KR102503895B1

KR102503895B1 - Audio signal processing method and appratus

Info

Publication number: KR102503895B1
Application number: KR1020200170663A
Authority: KR
Inventors: 김의성; 전재진
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2020-12-08
Filing date: 2020-12-08
Publication date: 2023-02-27
Also published as: KR20220081129A

Abstract

음성 신호 및 노이즈를 포함하는 음향 신호를 처리하는 방법 및 장치에 관한 것이다. 실시예에 따른 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하는 단계, 입력 음향 신호에서 노이즈를 제거하여, 제1 신호를 생성하는 단계, 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 노이즈의 강도를 제어하는 파라미터를 획득하는 단계 및 파라미터에 기초하여 입력 음향 신호 및 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는 단계를 포함한다.A method and apparatus for processing a sound signal including a voice signal and noise. A sound signal processing method according to an embodiment includes obtaining an input sound signal including a voice signal and noise, generating a first signal by removing noise from the input sound signal, and inputting the noise through a user interface related to noise control. and obtaining a parameter for controlling the strength of the noise, and generating an output acoustic signal by performing a weighted sum of the input acoustic signal and the first signal based on the parameter.

Description

Sound signal processing method and apparatus {AUDIO SIGNAL PROCESSING METHOD AND APPRATUS}

음향 신호를 처리하는 방법 및 장치에 관한 것이다. 보다 구체적으로, 음성 신호 및 노이즈를 포함하는 음향 신호를 처리하는 방법 및 장치에 관한 것이다.It relates to a method and apparatus for processing an acoustic signal. More specifically, it relates to a method and apparatus for processing a voice signal and an acoustic signal including noise.

노이즈 제거 기술은 음향 신호에서 다른 신호의 간섭을 비롯한 여러 가지 의도하지 않은 입력 신호의 왜곡을 제거하는 기술이다. 예를 들어, 마이크를 통해 입력된 음성 신호에 주변 환경에서 발생한 소음으로 인해 노이즈가 포함된 경우, 배경 노이즈 제거 기술을 통해 음성 외의 노이즈를 제거할 수 있다. 즉, 노이즈 제거 기술은 마이크에 입력되는 원치 않는 신호들에 의한 영향을 배제시키거나 완화시키는 음성 전처리 기술의 일종이다. 노이즈 제거 기술은 통화 시 의사 소통을 방해하는 배경 노이즈를 제거하여 통화 품질을 높이기 위하여 이용될 수 있으며, 음성 인식 시스템에서 음성 인식의 효율을 높이기 위하여 이용될 수도 있다.Noise cancellation technology is a technology that removes various unintended distortions of an input signal, including interference of other signals, from an acoustic signal. For example, when a voice signal input through a microphone includes noise due to noise generated in the surrounding environment, noise other than the voice may be removed through a background noise removal technique. That is, the noise cancellation technology is a kind of voice preprocessing technology that excludes or mitigates the influence of unwanted signals input to the microphone. Noise cancellation technology can be used to improve call quality by removing background noise that interferes with communication during a call, and can also be used to increase the efficiency of voice recognition in a voice recognition system.

노이즈 제거를 위해 많은 연구들이 진행되어 왔으며, 전통적인 노이즈 제거 알고리즘은 통계적 기법을 사용하여 개발되었으나, 최근 딥러닝을 이용한 알고리즘이 많이 개발되고 있다. 노이즈 제거된 신호를 출력하도록 학습된 딥러닝 모델을 이용하는 경우, 노이즈 제거의 강도를 조절하기 위해서는 여러 모델들의 가중치를 모두 가지고 있어야 하는데, 이는 여러 모델들의 가중치들을 저장하기 위한 소프트웨어 용량을 과도하게 요구하며, 많은 가중치 값들에 메모리를 할당함에 따라 레이턴시(latency)가 길어지는 문제가 있다.A lot of research has been conducted for noise removal, and traditional noise removal algorithms have been developed using statistical techniques, but recently many algorithms using deep learning have been developed. In the case of using a deep learning model trained to output a denoised signal, it is necessary to have all the weights of several models in order to adjust the strength of denoising, which requires excessive software capacity for storing the weights of the various models. , there is a problem of long latency as memory is allocated to many weight values.

설정된 배경 노이즈의 강도에 따라 노이즈 제거의 정도를 제어하는 음성 신호 처리 기술을 제공할 수 있으며, 사용자가 음성 신호를 수신하는 환경에서 사용자 인터페이스를 통해 노이즈 제거의 정도를 조절할 수 있는 기술을 제공할 수 있다.It is possible to provide a voice signal processing technology that controls the degree of noise cancellation according to the strength of the set background noise, and to provide a technology that allows the user to adjust the degree of noise cancellation through a user interface in an environment where a voice signal is received. there is.

음성이 포함된 구간과 음성이 포함되지 않은 구간에서 노이즈 제거된 음성 신호와 입력된 음성 신호의 합성 비율을 조정함으로써, 음성이 포함된 구간과 음성이 포함되지 않은 구간의 연결이 자연스러운 출력 신호를 합성하는 기술을 제공할 수 있다.By adjusting the synthesis ratio of the noise-removed voice signal and the input voice signal in the section with and without audio, the output signal is synthesized with a natural connection between the section with and without audio. technology can be provided.

통화 환경에서 노이즈 제거가 필요한 상황을 판단하여, 노이즈 제거가 필요한 상황에서 사용자가 인터페이스를 통하여 배경 노이즈 제거에 대한 적절한 제어를 할 수 있도록 유도하는 기술을 제공할 수 있다.It is possible to provide a technique for determining a situation requiring noise removal in a call environment and inducing a user to appropriately control background noise removal through an interface in a situation requiring noise removal.

다만, 기술적 과제는 상술한 기술적 과제들로 한정되는 것은 아니며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical challenges are not limited to the above-described technical challenges, and other technical challenges may exist.

일 측에 따른 프로세서에서 수행되는 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하는 단계; 상기 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하는 단계; 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 상기 노이즈의 강도를 제어하는 파라미터를 획득하는 단계; 및 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는 단계를 포함한다.A sound signal processing method performed by a processor according to one aspect includes obtaining an input sound signal including a voice signal and noise; generating a first signal by removing the noise from the input sound signal; obtaining a parameter input through a user interface related to noise control and controlling the intensity of the noise; and generating an output acoustic signal by performing a weighted sum of the input acoustic signal and the first signal based on the parameter.

상기 음향 신호 처리 방법은 상기 입력 음향 신호의 SNR(signal-to-noise ratio) 값을 획득하는 단계; 및 상기 SNR 값에 기초하여, 상기 사용자 인터페이스를 활성화할지 여부를 결정하는 단계를 더 포함할 수 있다.The sound signal processing method may include obtaining a signal-to-noise ratio (SNR) value of the input sound signal; and determining whether to activate the user interface based on the SNR value.

상기 출력 음향 신호를 생성하는 단계는 상기 제1 신호의 각 구간에 대응하여, 음성 존재 확률(speech presence probability)을 획득하는 단계; 및 상기 음성 존재 확률 및 상기 파라미터에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 구간 별로 가중 합하는 단계를 포함할 수 있다.The generating of the output acoustic signal may include obtaining a speech presence probability corresponding to each section of the first signal; and performing a weighted sum of the input acoustic signal and the first signal for each section based on the voice presence probability and the parameter.

상기 가중 합하는 단계는 상기 음성 존재 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합하는 단계; 및 상기 음성 존재 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합하는 단계를 포함할 수 있다.The weighted summing may include performing a weighted sum by assigning a higher weight to the input sound signal than the first signal, based on the parameter, in a section in which the voice presence probability is high; and performing a weighted sum by assigning a higher weight to the first signal than to the input sound signal, based on the parameter, in a section in which the voice presence probability is low.

상기 가중 합하는 단계는 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 상기 입력 음향 신호의 비중이 높은 제2 신호 및 상기 제1 신호의 비중이 높은 제3 신호를 생성하는 단계; 및 상기 음성 존재 확률에 기초하여, 상기 제2 신호 및 상기 제3 신호를 구간 별로 가중 합하는 단계를 포함할 수 있다.The weighted summing step generates a second signal having a high proportion of the input acoustic signal and a third signal having a high proportion of the first signal by weighting the input acoustic signal and the first signal based on the parameter. step; and performing a weighted sum of the second signal and the third signal for each section based on the voice presence probability.

상기 음성 존재 확률을 획득하는 단계는 상기 제1 신호를 특정 시간 단위로 분할하여 생성된 프레임들 각각에 대응하여, 음성 활성 검출(voice activity detection) 알고리즘에 기초하여, 해당하는 프레임에 상기 음성 신호가 포함될 확률을 획득하는 단계를 포함할 수 있다.In the obtaining of the voice presence probability, the voice signal in the corresponding frame corresponds to each of the frames generated by dividing the first signal into specific time units based on a voice activity detection algorithm. It may include obtaining a probability of being included.

상기 제1 신호를 생성하는 단계는 상기 입력 음향 신호의 시간 영역에서 노이즈 제거 알고리즘을 수행하여, 상기 제1 신호를 생성하는 단계를 포함할 수 있다.The generating of the first signal may include generating the first signal by performing a noise removal algorithm in the time domain of the input sound signal.

상기 제1 신호를 생성하는 단계는 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥 러닝 모델에 기초하여, 상기 제1 신호를 생성하는 단계를 포함할 수 있다.The generating of the first signal may include generating the first signal based on a deep learning model trained to output a noise-removed acoustic signal from an acoustic signal including noise.

일 측에 따른 음향 신호 처리 방법은 음성 신호 및 노이즈를 포함하는 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하는 단계; 상기 제1 신호에 대응하여, 음성이 존재할 확률(speech presence probability)을 획득하는 단계; 상기 노이즈의 제거 정도에 관한 파라미터를 획득하는 단계; 및 상기 음성이 존재할 확률 및 상기 파라미터 중 적어도 하나에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 합성함으로써, 상기 노이즈의 제거 정도를 제어하는 단계를 포함한다.An acoustic signal processing method according to one aspect includes generating a first signal by removing noise from an input acoustic signal including a voice signal and noise; obtaining a speech presence probability in response to the first signal; obtaining a parameter related to the degree of noise removal; and controlling a degree of removal of the noise by combining the input sound signal and the first signal based on at least one of a probability that the voice exists and the parameter.

상기 제어하는 단계는 상기 음성이 존재할 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합함으로써, 상기 노이즈의 제거 정도를 제어하는 단계; 및 상기 음성이 존재할 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합함으로써, 상기 노이즈의 제거 정도를 제어하는 단계를 포함할 수 있다.The controlling may include: controlling a degree of removal of the noise by weighting and summing the input sound signal with a higher weight than the first signal based on the parameter in a section in which the sound is likely to exist; and controlling a degree of removal of the noise by weighting the first signal with a higher weight than the input sound signal based on the parameter in a section in which the probability of the presence of the voice is low.

상기 제어하는 단계는 상기 제1 신호에 음성 발화 검출(voice activity detection) 알고리즘을 적용하여, 상기 제1 신호를 음성 구간 및 비음성 구간으로 구분하는 단계; 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 상기 음성 구간에 대응하는 제1 신호 및 상기 음성 구간에 대응하는 입력 음향 신호를 가중 합하는 단계; 및 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 상기 비음성 구간에 대응하는 제1 신호 및 상기 비음성 구간에 대응하는 입력 음향 신호를 가중 합하는 단계를 더 포함할 수 있다.The controlling may include dividing the first signal into a voice section and a non-voice section by applying a voice activity detection algorithm to the first signal; weighting the first signal corresponding to the speech section and the input sound signal corresponding to the speech section by assigning a higher weight to the input sound signal than the first signal based on the parameter; and weighting the first signal corresponding to the non-voice section and the input audio signal corresponding to the non-voice section by placing a higher weight on the first signal than the input audio signal based on the parameter. can

일 측에 따른 음향 신호 처리 장치는 음성 신호 및 노이즈를 포함하는 입력 음향 신호를 획득하고, 상기 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하고, 노이즈 제어에 관한 사용자 인터페이스를 통해 입력되고 상기 노이즈의 강도를 제어하는 파라미터를 획득하며, 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 출력 음향 신호를 생성하는,An acoustic signal processing apparatus according to one side obtains an input acoustic signal including a voice signal and noise, removes the noise from the input acoustic signal, generates a first signal, and inputs the first signal through a user interface related to noise control. obtaining a parameter for controlling the intensity of the noise, and generating an output acoustic signal by weighting the input acoustic signal and the first signal based on the parameter;

적어도 하나의 프로세서를 포함한다.It includes at least one processor.

상기 프로세서는 상기 입력 음향 신호의 SNR(signal-to-noise ratio) 값을 획득하고, 상기 SNR 값에 기초하여, 상기 사용자 인터페이스를 활성화할지 여부를 결정할 수 있다.The processor may obtain a signal-to-noise ratio (SNR) value of the input sound signal, and determine whether to activate the user interface based on the SNR value.

상기 프로세서는 상기 출력 음향 신호를 생성함에 있어서, 상기 제1 신호의 각 구간에 대응하여, 음성 존재 확률(speech presence probability)을 획득하고, 상기 음성 존재 확률 및 상기 파라미터에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 구간 별로 가중 합할 수 있다.In generating the output acoustic signal, the processor obtains a speech presence probability corresponding to each section of the first signal, and based on the speech presence probability and the parameters, the input acoustic signal and a weighted sum of the first signal for each section.

상기 프로세서는 상기 가중 합함에 있어서, 상기 음성 존재 확률이 높은 구간에서는, 상기 파라미터에 기초하여, 상기 제1 신호보다 상기 입력 음향 신호에 높은 가중치를 두어 가중 합하고, 상기 음성 존재 확률이 낮은 구간에서는, 상기 파라미터에 기초하여, 상기 입력 음향 신호보다 상기 제1 신호에 높은 가중치를 두어 가중 합할 수 있다.In the weighted sum, the processor performs a weighted sum by assigning a higher weight to the input sound signal than the first signal, based on the parameter, in a section in which the voice presence probability is high, and in a section in which the voice presence probability is low, Based on the parameter, a weighted sum may be performed by placing a higher weight on the first signal than on the input sound signal.

상기 프로세서는 상기 가중 합함에 있어서, 상기 파라미터에 기초하여 상기 입력 음향 신호 및 상기 제1 신호를 가중 합함으로써, 상기 입력 음향 신호의 비중이 높은 제2 신호 및 상기 제1 신호의 비중이 높은 제3 신호를 생성하고, 상기 음성 존재 확률에 기초하여, 상기 제2 신호 및 상기 제3 신호를 구간 별로 가중 합할 수 있다.In the weighted sum, the processor weights the input sound signal and the first signal based on the parameter, so that a second signal having a high weight ratio of the input sound signal and a third signal having a high weight ratio of the first signal A signal may be generated, and based on the voice presence probability, the second signal and the third signal may be weighted summed for each section.

상기 프로세서는 상기 제1 신호를 생성함에 있어서, 상기 입력 음향 신호의 시간 영역에서 노이즈 제거 알고리즘을 수행하여, 상기 제1 신호를 생성할 수 있다.In generating the first signal, the processor may generate the first signal by performing a noise removal algorithm in the time domain of the input sound signal.

상기 프로세서는 상기 제1 신호를 생성함에 있어서, 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥 러닝 모델에 기초하여, 상기 제1 신호를 생성할 수 있다.When generating the first signal, the processor may generate the first signal based on a deep learning model learned to output an acoustic signal from which noise is removed from an acoustic signal including noise.

일 측에 따른 음향 신호 처리 장치는 음성 신호 및 노이즈를 포함하는 입력 음향 신호에서 상기 노이즈를 제거하여, 제1 신호를 생성하고, 상기 제1 신호에 대응하여, 음성이 존재할 확률(speech presence probability)을 획득하고, 상기 노이즈의 제거 정도에 관한 파라미터를 획득하며, 상기 음성이 존재할 확률 및 상기 파라미터 중 적어도 하나에 기초하여, 상기 입력 음향 신호 및 상기 제1 신호를 합성함으로써, 상기 노이즈의 제거 정도를 제어하는, 적어도 하나의 프로세서를 포함한다.The sound signal processing apparatus according to one side generates a first signal by removing noise from an input sound signal including a speech signal and noise, and in response to the first signal, a speech presence probability Obtaining a parameter related to the degree of noise removal, and synthesizing the input sound signal and the first signal based on at least one of the probability that the voice exists and the parameter, thereby determining the degree of noise removal. It includes at least one processor that controls.

음성 신호를 수신하는 환경에서 배경 노이즈의 강도를 제어할 수 있는 사용자 인터페이스를 제공함으로써, 사용자가 음성 신호를 수신하면서 실시간으로 음성 신호의 노이즈 제거 정도를 제어할 수 있으며, 노이즈 제거가 필요한 상황에서 사용자가 직관적으로 배경 노이즈 제거에 대한 적절한 제어를 할 수 있는 효과가 도출될 수 있다.By providing a user interface capable of controlling the intensity of background noise in an environment where a voice signal is received, the user can control the degree of noise removal of the voice signal in real time while receiving the voice signal, and in a situation where noise removal is required, the user can Intuitively, an effect that can appropriately control background noise removal can be derived.

음성이 포함된 구간과 음성이 포함되지 않은 구간의 연결이 자연스러운 출력 신호를 합성함으로써, 개선된 품질의 노이즈 제거 신호를 출력할 수 있다.A noise canceling signal with improved quality may be output by synthesizing an output signal in which a connection between a section including audio and a section not including audio is natural.

도 1은 일실시예에 따른 음성 신호 처리 시스템의 예시도.
도 2a 및 2b는 사용자의 단말에 제공되는 노이즈 제어 인터페이스 구성의 예시도들.
도 3은 그룹 영상 통화 중에 제공되는 노이즈 제어 인터페이스 화면의 예시도.
도 4는 일실시예에 따른 음성 신호 처리 시스템의 예시도.
도 5 및 도 6은 일실시예에 따른 음향 신호 처리 방법의 순서도를 도시한 도면들.1 is an exemplary view of a voice signal processing system according to an embodiment;
2a and 2b are exemplary diagrams of a noise control interface configuration provided to a user's terminal.
3 is an exemplary view of a noise control interface screen provided during a group video call;
4 is an exemplary view of a voice signal processing system according to an embodiment;
5 and 6 are flowcharts illustrating a method for processing a sound signal according to an embodiment.

실시예들에 대한 특정한 구조적 또는 기능적 설명들은 단지 예시를 위한 목적으로 개시된 것으로서, 다양한 형태로 변경되어 구현될 수 있다. 따라서, 실제 구현되는 형태는 개시된 특정 실시예로만 한정되는 것이 아니며, 본 명세서의 범위는 실시예들로 설명한 기술적 사상에 포함되는 변경, 균등물, 또는 대체물을 포함한다.Specific structural or functional descriptions of the embodiments are disclosed for illustrative purposes only, and may be changed and implemented in various forms. Therefore, the form actually implemented is not limited only to the specific embodiments disclosed, and the scope of the present specification includes changes, equivalents, or substitutes included in the technical idea described in the embodiments.

제1 또는 제2 등의 용어를 다양한 구성요소들을 설명하는데 사용될 수 있지만, 이런 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 해석되어야 한다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소는 제1 구성요소로도 명명될 수 있다.Although terms such as first or second may be used to describe various components, such terms should only be construed for the purpose of distinguishing one component from another. For example, a first element may be termed a second element, and similarly, a second element may be termed a first element.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다.It should be understood that when an element is referred to as being “connected” to another element, it may be directly connected or connected to the other element, but other elements may exist in the middle.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 설명된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함으로 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions include plural expressions unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that the described feature, number, step, operation, component, part, or combination thereof exists, but one or more other features or numbers, It should be understood that the presence or addition of steps, operations, components, parts, or combinations thereof is not precluded.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 해당 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 의미를 갖는 것으로 해석되어야 하며, 본 명세서에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and unless explicitly defined in this specification, it should not be interpreted in an ideal or excessively formal meaning. don't

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일실시예에 따른 음성 신호 처리 시스템의 예시도이다.1 is an exemplary diagram of a voice signal processing system according to an embodiment.

도 1을 참조하면, 일실시예에 따른 음성 신호 처리 시스템(100)은 음향 신호의 노이즈의 제거 정도를 제어하는 시스템에 해당할 수 있다.Referring to FIG. 1 , a voice signal processing system 100 according to an embodiment may correspond to a system for controlling a degree of noise removal of a sound signal.

일실시예에 따른 시스템(100)은 노이즈가 포함된 음향 신호를 입력 받아 설정된 노이즈의 강도에 따라 노이즈의 제거 정도가 제어된 출력 음향 신호를 생성하는 장치(110)를 포함할 수 있다. 다시 말해, 장치(110)는 노이즈 강도가 높게 설정된 경우, 입력 음향 신호에서 노이즈 제거 정도가 낮은 출력 음향 신호를 생성하고, 노이즈 강도가 낮게 설정된 경우, 입력 음향 신호에서 노이즈 제거 정도가 높은 출력 음향 신호를 생성함으로써, 노이즈의 제거 정도를 제어할 수 있다. 장치(110)는 일실시예에 따른 음향 신호 처리 방법을 수행하는 장치로, 입력된 음향 신호를 일실시예에 따른 음향 신호 처리 방법에 따라 처리하여 출력 음향 신호를 생성하는 장치에 해당할 수 있다. 장치(110)는 적어도 하나의 프로세서, 메모리 및 입출력 장치를 포함할 수 있으며, 예를 들어 서버 또는 사용자의 디바이스(예를 들어, 휴대폰, 컴퓨터 등) 등에 해당할 수 있다.The system 100 according to an embodiment may include an apparatus 110 that receives an acoustic signal containing noise and generates an output acoustic signal having a controlled degree of noise removal according to a set noise intensity. In other words, when the noise intensity is set to be high, the device 110 generates an output acoustic signal with a low degree of noise removal from the input sound signal, and when the noise intensity is set to be low, the output sound signal with a high degree of noise cancellation is generated from the input sound signal. By generating, it is possible to control the degree of noise removal. The device 110 is a device that performs a sound signal processing method according to an embodiment, and may correspond to a device that generates an output sound signal by processing an input sound signal according to the sound signal processing method according to an embodiment. . The device 110 may include at least one processor, memory, and input/output device, and may correspond to, for example, a server or a user's device (eg, a mobile phone, a computer, etc.).

일실시예에 따른 장치(110)를 구성하는 메모리는 휘발성 메모리 또는 비휘발성 메모리일 수 있으며, 음향 신호 처리 방법과 관련된 정보를 저장하거나 음향 신호 처리 방법이 구현된 프로그램을 저장할 수 있다.The memory constituting the device 110 according to an embodiment may be a volatile memory or a non-volatile memory, and may store information related to a sound signal processing method or a program in which the sound signal processing method is implemented.

일실시예에 따른 장치(110)를 구성하는 적어도 하나의 프로세서는 음향 신호 처리 방법을 수행할 수 있다. 장치(110)를 구성하는 프로세서는 프로그램을 실행하고, 장치(110)를 제어할 수 있다. 프로세서에 의하여 실행되는 프로그램의 코드는 장치(110)를 구성하는 메모리에 저장될 수 있다.At least one processor constituting the device 110 according to an embodiment may perform a sound signal processing method. A processor constituting the device 110 may execute a program and control the device 110 . Program codes executed by the processor may be stored in a memory constituting the device 110 .

일실시예에 따른 장치(110)는 입출력 장치를 통하여 외부 장치(예를 들어, 퍼스널 컴퓨터 또는 네트워크)에 연결되고, 데이터를 교환할 수 있으며, 사용자로부터 입력 데이터를 수신하거나, 사용자에게 출력 데이터를 제공할 수 있다.The device 110 according to an embodiment is connected to an external device (eg, a personal computer or network) through an input/output device, can exchange data, receive input data from a user, or output data to a user. can provide

도 1을 참조하면, 장치(110)는 노이즈 제거부(111) 및 노이즈 제어부(112)의 구성을 포함할 수 있다. 음성 신호 처리 장치(110)의 각 구성은 모듈에 해당할 수 있으며, 각 구성은 음향 신호 처리를 위한 프로세스를 수행할 수 있다. 도 1에 도시된 장치(110)의 구성은 일실시예에 따른 음향 신호 처리 방법을 기능 단위로 설명하기 위하여 구분한 것으로, 장치의 구성을 한정하는 것은 아니다. 상술한 바와 같이, 일실시예에 따른 음향 신호 처리 방법은 장치(110)를 구성하는 적어도 하나의 프로세서에 의해 수행될 수 있다.Referring to FIG. 1 , an apparatus 110 may include a noise removal unit 111 and a noise control unit 112 . Each component of the audio signal processing apparatus 110 may correspond to a module, and each component may perform a process for processing a sound signal. The configuration of the device 110 shown in FIG. 1 is divided to describe the acoustic signal processing method according to an embodiment in functional units, and the configuration of the device is not limited. As described above, the acoustic signal processing method according to an embodiment may be performed by at least one processor constituting the device 110 .

일실시예에 따른 노이즈 제거부(111)는 음성 신호 및 노이즈가 포함된 음향 신호(101)를 입력 받아 노이즈 제거된 신호를 출력하는 모듈에 해당할 수 있다. 입력 음향 신호(101)에 포함된 노이즈는 음성 신호 외에 주변 환경에서 발생한 소음이 입력 장치를 통해 수신되어 발생한 배경 잡음 또는 배경 노이즈를 포함할 수 있다. 노이즈 제거부(111)는 노이즈 제거 알고리즘을 이용하여 입력 음향 신호(101)에서 노이즈를 제거하는 프로세스를 수행할 수 있다. 노이즈 제거 알고리즘은 통계적 기법을 이용하여 노이즈를 제거하는 알고리즘, 딥러닝 모델을 이용하여 노이즈를 제거하는 알고리즘 등 음향 신호에서 노이즈를 제거하는 다양한 알고리즘을 포함할 수 있다.The noise removal unit 111 according to an embodiment may correspond to a module that receives a voice signal and an acoustic signal 101 including noise and outputs a noise-removed signal. Noise included in the input sound signal 101 may include background noise or background noise generated when noise generated in the surrounding environment is received through the input device, in addition to the voice signal. The noise removal unit 111 may perform a process of removing noise from the input sound signal 101 using a noise removal algorithm. The noise removal algorithm may include various algorithms for removing noise from sound signals, such as an algorithm for removing noise using a statistical technique and an algorithm for removing noise using a deep learning model.

일실시예에 따를 때, 노이즈 제거 알고리즘은 음향 신호의 시간 영역(time domain)에서 수행되는 알고리즘을 포함할 수 있다. 다시 말해, 시간 영역에서 표현된 음향 신호에서 노이즈를 제거하는 알고리즘을 포함할 수 있다. 음향 신호의 시간 영역에서 수행되는 노이즈 제거 알고리즘은 음향 신호의 주파수 영역에서 수행되는 노이즈 제거 알고리즘과 구분될 수 있다.According to one embodiment, the noise removal algorithm may include an algorithm performed in the time domain of the acoustic signal. In other words, an algorithm for removing noise from an acoustic signal expressed in the time domain may be included. A noise removal algorithm performed in the time domain of the acoustic signal may be distinguished from a noise removal algorithm performed in the frequency domain of the audio signal.

예를 들어, 노이즈 제거 알고리즘은 지도 학습 기법으로 노이즈가 포함된 음향 신호에서 노이즈가 제거된 음향 신호를 출력하도록 학습된 딥러닝 모델을 이용하여, 시간 영역에서 표현된 음향 신호에 1차원 합성곱(convolution)을 적용하여 노이즈 제거된 신호를 추정하는 과정을 포함할 수 있다. 여기서, 딥러닝 모델의 학습 방법은 시간 영역에서 표현된 노이즈를 포함하는 음향 신호에 1차원 합성곱을 적용한 결과를 노이즈가 제거된 음향 신호로 맵핑시키는 과정을 포함할 수 있다. 이하에서, 시스템에 입력되는 노이즈 및 음성 신호를 포함하는 음향 신호(101)에서 노이즈 제거 알고리즘에 의해 노이즈가 제거된 음향 신호는 제1 신호로 지칭될 수 있다.For example, the noise removal algorithm is a one-dimensional convolution ( convolution) to estimate the denoised signal. Here, the deep learning model learning method may include a process of mapping a result of applying a one-dimensional convolution to an acoustic signal including noise expressed in the time domain into an acoustic signal from which noise is removed. Hereinafter, an acoustic signal from which noise is removed by a noise removal algorithm from the acoustic signal 101 including noise and voice signals input to the system may be referred to as a first signal.

일실시예에 따른 노이즈 제어부(112)는 입력 음향 신호(101)에서 노이즈가 제거된 제1 신호 및 입력 음향 신호(101)를 합성하여 노이즈 제거 정도가 제어된 출력 음향 신호(102)를 생성하는 모듈에 해당할 수 있다. 노이즈 제어부(112)는 노이즈 제어 알고리즘을 이용하여 노이즈 제거 정도가 제어된 출력 음향 신호(102)를 생성하는 프로세스를 수행할 수 있다. 노이즈 제거 정도는 노이즈 제어 인터페이스(120)를 통한 입력에 기초하여 제어될 수 있다.The noise controller 112 according to an embodiment synthesizes the first signal from which noise is removed from the input sound signal 101 and the input sound signal 101 to generate an output sound signal 102 having a controlled degree of noise removal. may correspond to a module. The noise control unit 112 may perform a process of generating the output acoustic signal 102 having a controlled degree of noise removal using a noise control algorithm. The degree of noise removal may be controlled based on input through the noise control interface 120 .

일실시예에 따를 때, 출력 음향 신호(102)를 생성하는 노이즈 제어 알고리즘은 제1 신호 및 입력 음향 신호(101)를 가중 합(weighted sum)하는 방법을 포함할 수 있다. 가중 합을 위한 제1 신호에 대응하는 가중치(weight) 및 입력 음향 신호(101)에 대응하는 가중치는 노이즈 제어 인터페이스(120)를 통해 획득된 노이즈의 강도를 제어하는 파라미터에 기초하여 결정될 수 있다. 일실시예에 따를 때, 노이즈의 강도를 제어하는 파라미터는 노이즈 제어 인터페이스(120)를 통해 사용자 단말로부터 입력될 수 있다.According to one embodiment, the noise control algorithm for generating the output acoustic signal 102 may include a weighted sum of the first signal and the input acoustic signal 101 . A weight corresponding to the first signal for the weighted sum and a weight corresponding to the input acoustic signal 101 may be determined based on a parameter for controlling the intensity of noise acquired through the noise control interface 120. According to an embodiment, a parameter for controlling the intensity of noise may be input from a user terminal through the noise control interface 120 .

예를 들어, 출력 음향 신호(102)는 아래의 수학식 1과 같이 표현될 수 있다.For example, the output acoustic signal 102 may be expressed as Equation 1 below.

여기서, x(t), y(t) 및 z(t)는 시간 영역에서 표현된 음향 신호로, 음향 신호를 시간(t)에 대한 함수로 나타낸 것이며, 각각 입력 음향 신호, 제1 신호 및 출력 신호에 대응된다. α는 노이즈 제어 인터페이스(120)를 통해 사용자 단말로부터 입력된 노이즈의 강도에 관한 파라미터로, 0 이상 1 이하의 실수 값을 가질 수 있다.Here, x (t), y (t) and z (t) are acoustic signals expressed in the time domain, and represent the acoustic signals as a function of time (t), respectively the input acoustic signal, the first signal and the output corresponds to the signal. α is a parameter related to the intensity of noise input from the user terminal through the noise control interface 120, and may have a real value of 0 or more and 1 or less.

일실시예에 따를 때, 노이즈의 강도에 관한 파라미터 α는 제1 신호의 가중치로, (1-α)는 입력 음향 신호(101)의 가중치로 결정되며, 결정된 가중치에 따라 제1 신호 및 입력 음향 신호(101)를 가중 합함으로써 출력 음향 신호(102)가 생성될 수 있다.According to an embodiment, the parameter α related to the intensity of noise is determined as a weight of the first signal, and (1-α) is determined as a weight of the input sound signal 101, and the first signal and the input sound are determined according to the determined weight. An output acoustic signal 102 may be generated by a weighted sum of the signals 101 .

일실시예에 따를 때, 입력 음향 신호 및 제1 신호의 가중 합은 구간 별 또는 프레임 별로 수행될 수 있다. 프레임(frame)은 시간 영역에서 표현된 신호를 일정 시간 단위로 나눈 구간을 의미한다. 입력 음향 신호 및 제1 신호를 프레임 별로 가중 합하는 것은 입력 음향 신호의 제1 프레임과 제1 신호의 제1 프레임을 가중치에 따라 가중 합하는 것으로, 여기서 음향 신호의 제1 프레임과 제1 신호의 제1 프레임은 동일한 시간 구간에 대응될 수 있다. 예를 들어, 프레임의 크기가 1s인 경우, 0~1s 프레임의 입력 음향 신호와 0~1s 프레임의 제1 신호를 가중 합하고, 1~2s 프레임의 입력 음향 신호와 1~2s 프레임의 제1 신호를 가중 합하는 방식으로 입력 음향 신호 전체와 제1 신호 전체를 가중 합할 수 있다. 가중 합을 위한 입력 음향 신호의 가중치와 제1 신호의 가중치는 프레임 별로 달라질 수 있다. 예를 들어, 프레임의 크기가 1s인 경우, 0~1s 프레임의 입력 음향 신호 x(0)와 0~1s 프레임의 제1 신호 y(0)는 α₀x(0)+(1-α₀)y(0)와 같이 가중 합 될 수 있고, 1~2s 프레임의 입력 음향 신호와 1~2s 프레임의 제1 신호는 α₁x(1)+(1-α₁)y(1)와 같이 가중 합 될 수 있으며, α₀및 α₁는 서로 독립적으로 결정될 수 있다.According to an embodiment, the weighted sum of the input sound signal and the first signal may be performed for each section or frame. A frame means a section in which a signal expressed in the time domain is divided into predetermined time units. The weighted sum of the input sound signal and the first signal for each frame is a weighted sum of the first frame of the input sound signal and the first frame of the first signal according to weights, wherein the first frame of the sound signal and the first frame of the first signal A frame may correspond to the same time interval. For example, when the size of the frame is 1s, the input sound signal of the 0-1s frame and the first signal of the 0-1s frame are weighted and the input sound signal of the 1-2s frame and the first signal of the 1-2s frame It is possible to weight-sum all of the input sound signals and all of the first signals in a weighted-sum manner. The weight of the input sound signal and the weight of the first signal for the weighted sum may be different for each frame. For example, when the size of the frame is 1s, the input sound signal x(0) of the 0~1s frame and the first signal y(0) of the 0~1s frame are α ₀ x(0)+(1-α ₀ ) y (0), and the input sound signal of the 1 to 2 s frame and the first signal of the 1 to 2 s frame are α ₁ x (1) + (1-α ₁ ) y (1) It can be a weighted sum, and α ₀ and α ₁ can be determined independently of each other.

일실시예에 따를 때, 노이즈 제어부(112)는 음성 존재 확률(speech presence probability)을 고려하여 출력 음향 신호(102)를 생성할 수 있다. 이에 관하여는 이하의 도 4에서 상술한다.According to an embodiment, the noise controller 112 may generate the output sound signal 102 in consideration of the speech presence probability. This will be described in detail in FIG. 4 below.

일실시예에 따른 시스템(100)은 노이즈 제어 인터페이스(120)를 포함할 수 있다. 노이즈 제어 인터페이스(120)는 노이즈 제어를 위하여 사용자 및 시스템(100) 사이의 상호작용을 지원하는 사용자 인터페이스(user interface; UI)를 포함할 수 있다. 노이즈 제어 인터페이스(120)는 음향 신호 처리 방법을 수행하는 장치(110)와 다른 장치에 탑재되어 네트워크를 통해 장치(110)와 연결되어 데이터를 송수신할 수 있다. 예를 들어, 장치(110)는 음향 신호 처리에 관한 서버에 해당하고, 노이즈 제어 인터페이스(120)는 사용자의 단말에 탑재되어 네트워크를 통해 사용자의 단말에서 수신된 데이터를 장치(110)에 전송하고, 네트워크를 통해 장치(110)로부터 데이터를 수신하여 사용자 단말에 표시할 수 있다.System 100 according to one embodiment may include a noise control interface 120 . The noise control interface 120 may include a user interface (UI) that supports interaction between a user and the system 100 for noise control. The noise control interface 120 may be mounted on a device different from the device 110 performing the acoustic signal processing method and connected to the device 110 through a network to transmit/receive data. For example, the device 110 corresponds to a server related to sound signal processing, and the noise control interface 120 is mounted on a user's terminal and transmits data received from the user's terminal to the device 110 through a network. , Data may be received from the device 110 through the network and displayed on the user terminal.

도 1에 도시된 것과 달리, 일실시예에 따른 노이즈 제어 인터페이스(120)는 음향 신호 처리 방법을 수행하는 장치(110)에 탑재될 수도 있다. 예를 들어, 장치(110)는 사용자 단말에 해당할 수 있으며 노이즈 제어 인터페이스(120)는 사용자 단말에 제공된 사용자 인터페이스에 해당할 수 있다.Unlike what is shown in FIG. 1 , the noise control interface 120 according to an embodiment may be mounted on the device 110 that performs a sound signal processing method. For example, the device 110 may correspond to a user terminal and the noise control interface 120 may correspond to a user interface provided to the user terminal.

일실시예에 따른 음향 신호 처리 장치(110)는 노이즈 제어 인터페이스(120)를 통해 입력된 노이즈 제어에 관한 입력을 획득할 수 있다. 노이즈 제어에 관한 입력은 노이즈의 강도를 제어하는 입력을 포함할 수 있다. 예를 들어, 사용자는 인터페이스(120)를 통해 노이즈의 강도를 제어하는 파라미터를 입력할 수 있으며, 음향 신호 처리 장치(110)는 네트워크를 통해 입력된 파라미터를 획득할 수 있다.The acoustic signal processing apparatus 110 according to an embodiment may obtain an input related to noise control input through the noise control interface 120 . The input related to noise control may include an input for controlling the intensity of noise. For example, a user may input a parameter for controlling the intensity of noise through the interface 120, and the acoustic signal processing apparatus 110 may obtain the input parameter through a network.

노이즈의 강도는 출력 음향 신호(102)에 포함된 노이즈의 크기를 의미하는 것으로, 노이즈의 강도 설정에 따라 입력 음향 신호(101)에서 노이즈가 제거되는 정도가 결정될 수 있다. 예를 들어, 노이즈의 노이즈 강도가 작게 설정된 경우, 출력되는 음향 신호에서 노이즈가 적게 포함되는 것을 의미하므로 노이즈의 제거 정도는 크게 설정될 수 있다. 반면, 노이즈 강도가 크게 설정된 경우, 노이즈의 제거 정도는 작게 설정될 수 있다. The intensity of noise means the size of noise included in the output sound signal 102, and the degree to which noise is removed from the input sound signal 101 may be determined according to the setting of the noise intensity. For example, when the noise intensity of the noise is set to be small, it means that the output sound signal contains less noise, so the degree of noise removal can be set to be larger. On the other hand, when the noise intensity is set high, the degree of noise removal may be set small.

예를 들어, 사용자의 단말에 제공되는 노이즈 제어 인터페이스의 구성은 도 2a 및 도 2b를 참조할 수 있다. 도 2a를 참조하면, 사용자는 제공된 노이즈 제어 인터페이스를 통해 표시된 바(201)의 위치를 기준선(202) 내에서 변경함으로써, 출력 음향 신호(102)에 포함된 노이즈의 강도를 설정할 수 있다. 바(201)가 기준선(202)의 왼쪽에 위치할수록 작은 노이즈 강도를 지시하고, 오른쪽에 위치할수록 큰 노이즈 강도를 지시할 수 있다. 도 2a 및 도 2b를 참조하면, 도 2b에 도시된 바(211)는 도 2a에 도시된 바(201)보다 기준선의 오른쪽에 위치하고 있으므로, 도 2a보다 도 2b에서 노이즈의 강도가 크게 설정된다. 일실시예에 따른 노이즈 제어 인터페이스는 사용자로부터 바(201)의 위치를 변경하는 입력을 수신하여, 바(201)가 위치한 기준선(202) 내 지점을 파라미터화하여 노이즈 강도에 관한 파라미터를 획득할 수 있다.For example, a configuration of a noise control interface provided to a user's terminal may refer to FIGS. 2A and 2B. Referring to FIG. 2A , the user may set the intensity of noise included in the output sound signal 102 by changing the position of the displayed bar 201 within the reference line 202 through a provided noise control interface. As the bar 201 is positioned to the left of the reference line 202, it indicates a small noise intensity, and as it is positioned to the right, it may indicate a large noise intensity. Referring to FIGS. 2A and 2B, since the bar 211 shown in FIG. 2B is located to the right of the reference line than the bar 201 shown in FIG. 2A, the intensity of noise is set higher in FIG. 2B than in FIG. 2A. The noise control interface according to an embodiment may receive an input for changing the position of the bar 201 from a user, parameterize a point in the reference line 202 at which the bar 201 is located, and obtain a parameter related to noise intensity. there is.

도 2a 및 도 2b에 도시된 인터페이스의 구성은 일실시예에 따른 노이즈 제어 인터페이스(120)의 구성의 일 예로, 이외에도 노이즈 제어 인터페이스(120)는 노이즈의 강도를 수치로 입력하는 인터페이스를 포함할 수 있으며, 터치스크린을 통한 터치 입력, 키보드를 통한 키 입력, 마우스를 통한 클릭 입력 등 다양한 입력을 지원하는 인터페이스를 포함할 수 있다.The configuration of the interface shown in FIGS. 2A and 2B is an example of the configuration of the noise control interface 120 according to an embodiment. In addition, the noise control interface 120 may include an interface for inputting the strength of noise as a numerical value. and may include an interface supporting various inputs such as touch input through a touch screen, key input through a keyboard, and click input through a mouse.

사용자는 노이즈 제어 인터페이스(120)를 통해 출력 음향 신호(102)에 포함된 노이즈의 강도를 조정함으로써, 시스템으로 노이즈 제어에 관한 입력을 송신할 수 있으며, 시스템은 인터페이스를 통해 수신된 노이즈 제어에 관한 입력에 기초하여 노이즈의 제거 정도를 조정하여 출력 음향 신호(102)를 생성할 수 있다.The user can transmit an input related to noise control to the system by adjusting the intensity of noise included in the output acoustic signal 102 through the noise control interface 120, and the system can send an input related to the noise control received through the interface. The output acoustic signal 102 may be generated by adjusting the degree of noise removal based on the input.

일실시예에 따를 때, 노이즈 제어 인터페이스(120)는 음성 신호를 수신하는 환경에서 제공될 수 있다. 예를 들어, 음성 통화 또는 영상 통화 시 통화 상대방의 음성 신호를 수신하는 환경에서 노이즈 제어 인터페이스(120)가 제공될 수 있다.According to one embodiment, the noise control interface 120 may be provided in an environment that receives a voice signal. For example, the noise control interface 120 may be provided in an environment where a voice signal of the other party is received during a voice call or a video call.

예를 들어, 도 3은 그룹 영상 통화 중에 제공되는 노이즈 제어 인터페이스를 도시한 도면이다. 도 3을 참조하면, 사용자 단말을 통해 그룹 영상 통화가 진행되는 경우, 통화 상대방의 단말로부터 송신된 영상 및 음성을 수신하는 환경에서 노이즈 제어 인터페이스가 제공될 수 있다. 그룹 영상 통화의 경우 복수의 통화 상대방들 각각의 음성 신호에 대응하여 노이즈 제어 인터페이스가 제공될 수 있으며, 사용자는 통화 상대방의 영상이 표시되는 인터페이스 하단에 표시된 노이즈 제어 인터페이스를 통해 영상에 대응되는 통화 상대방으로부터 수신된 음성 신호의 노이즈 강도를 제어할 수 있다. 사용자는 복수의 통화 상대방들 각각에 대응하는 인터페이스를 통해 통화 상대방으로부터 수신되는 음성 신호의 노이즈 강도를 제어함으로써, 노이즈의 제거 정도가 조정된 상대방의 음성 신호를 수신할 수 있다. 또한, 복수의 통화 상대방들에 대응하는 노이즈 강도는 서로 다르게 설정될 수 있는 바, 설정된 각각의 노이즈 강도에 따라 노이즈 제거 정도가 서로 다르게 조정되어 음향 신호가 출력될 수 있다.For example, FIG. 3 is a diagram illustrating a noise control interface provided during a group video call. Referring to FIG. 3 , when a group video call is conducted through a user terminal, a noise control interface may be provided in an environment for receiving video and audio transmitted from a terminal of a call counterpart. In the case of a group video call, a noise control interface may be provided in response to voice signals of each of a plurality of call parties, and the user may use the noise control interface displayed at the bottom of the interface where the video of the call party is displayed to correspond to the video of the call party. It is possible to control the noise intensity of the voice signal received from The user may control the noise intensity of the voice signal received from the call counterpart through an interface corresponding to each of the plurality of call counterparts, thereby receiving the counterpart's voice signal with the adjusted noise removal degree. In addition, since noise intensities corresponding to a plurality of counterparts may be set differently, the degree of noise removal may be differently adjusted according to the set noise intensities, and the sound signal may be output.

다시 도 1을 참조하면, 일실시예에 따른 노이즈 제어 인터페이스(120)는 입력 음향 신호(101)의 신호 대 잡음 비(signal-to-noise ratio; SNR)에 기초하여 활성화 여부가 결정될 수 있다.Referring back to FIG. 1 , activation of the noise control interface 120 according to an embodiment may be determined based on a signal-to-noise ratio (SNR) of the input acoustic signal 101.

SNR(signal-to-noise ratio)은 잡음 신호 대비 수신된 신호의 비율을 나타내는 것으로, 수신된 신호의 세기를 V_s, 노이즈의 세기를 V_n이라고 할 때, alog₍V_s/V_n)(dB)으로 정의된다(여기서, a는 양의 상수). SNR이 클수록 신호가 잡음보다 크다는 것으로, 예를 들어, 음성 신호의 SNR이 7dB일 때보다 20dB일 때 음성은 크게, 잡음은 작게 들리며, 0dB은 음성과 잡음의 크기가 동일하다는 것을 의미하고, -10dB와 같이 SNR이 음의 값인 경우 음성보다 잡음이 더 크다는 것을 의미한다.SNR (signal-to-noise ratio) represents the ratio of the received signal to the noise signal. When the strength of the received signal is V _s and the strength of noise is V _n , alog ₍ V _s /V _n ) ( dB), where a is a positive constant. The higher the SNR, the greater the signal is greater than the noise. For example, when the SNR of the voice signal is 20 dB rather than 7 dB, the voice is heard louder and the noise is small. If the SNR is negative, such as 10 dB, it means that noise is greater than speech.

일실시예에 따른 장치(110)는 입력 음향 신호(101)의 SNR 값을 획득할 수 있으며, SNR 값에 기초하여 사용자 인터페이스(120)를 활성화할지 여부를 결정할 수 있다. 예를 들어, SNR 값이 미리 정해진 기준 값 미만인 경우, 입력 음향 신호(101)에 노이즈가 많이 포함된 것으로 판단되어, 노이즈 제어 인터페이스(120)를 활성화하는 것으로 결정될 수 있다.The device 110 according to an embodiment may obtain an SNR value of the input sound signal 101 and determine whether to activate the user interface 120 based on the SNR value. For example, when the SNR value is less than a predetermined reference value, it may be determined that the input acoustic signal 101 contains a lot of noise, and it may be determined that the noise control interface 120 is activated.

일실시예에 따를 때, 인터페이스(120)를 활성화하는 것으로 결정된 경우, 인터페이스(120)가 활성화되고, 장치(110)에서 노이즈 제거 알고리즘 및 노이즈 제어 알고리즘이 수행될 수 있다.According to an embodiment, when it is determined to activate the interface 120, the interface 120 is activated, and the device 110 may perform a noise removal algorithm and a noise control algorithm.

다시 도 3을 참조하면, 그룹 영상 통화의 경우 복수의 통화 상대방들 각각의 음성 신호에 대응하여 노이즈 제어 인터페이스가 제공될 수 있으며, 복수의 통화 상대방들로부터 수신된 각각의 입력 음향 신호의 SNR 값에 기초하여, 노이즈 제어 인터페이스의 활성화 여부가 결정될 수 있다. 노이즈 제어 인터페이스의 활성화 여부는 다양한 방법으로 인터페이스 상에 표시될 수 있다. 예를 들어, 인터페이스에 표시된 도형(301, 302)의 색상 차이 및 도형(303, 304)의 색상 차이로 노이즈 제어 인터페이스의 활성화 여부가 표시될 수 있다. 노이즈 제어 인터페이스가 활성화된 경우, 사용자는 입력 장치를 통해 노이즈 강도를 제어하는 파라미터를 입력하기 위하여 노이즈 강도를 나타내는 바(305)의 위치를 이동시킬 수 있는 반면, 노이즈 제어 인터페이스가 활성화되지 않은 경우, 사용자는 입력 장치를 통해 노이즈 강도를 나타내는 바(306)의 위치를 이동시킬 수 없다. Referring back to FIG. 3 , in the case of a group video call, a noise control interface may be provided corresponding to each voice signal of a plurality of call parties, and the SNR value of each input sound signal received from the plurality of call parties Based on this, it may be determined whether to activate the noise control interface. Whether or not the noise control interface is activated may be displayed on the interface in various ways. For example, whether or not the noise control interface is activated may be indicated by a difference in color between the figures 301 and 302 and a color difference between the figures 303 and 304 displayed on the interface. When the noise control interface is activated, the user can move the position of the bar 305 representing the noise intensity to input a parameter for controlling the noise intensity through the input device, whereas when the noise control interface is not activated, The user cannot move the position of the bar 306 representing the noise intensity through the input device.

도 4는 일실시예에 따른 음성 신호 처리 시스템의 예시도이다.4 is an exemplary diagram of a voice signal processing system according to an embodiment.

도 4를 참조하면, 일실시예에 따른 음향 신호 처리 방법을 수행하는 장치는 음성 존재 확률을 획득하기 위한 음성 활성 검출부의 구성을 더 포함할 수 있다. 도 4에 도시된 장치의 구성은 도 1과 마찬가지로 일실시예에 따른 음향 신호 처리 방법을 기능 단위로 설명하기 위하여 구분한 것으로, 장치의 구성을 한정하는 것은 아니다.Referring to FIG. 4 , the apparatus for performing the acoustic signal processing method according to an exemplary embodiment may further include a voice activity detection unit configured to obtain a voice presence probability. The configuration of the device shown in FIG. 4 is divided to describe the acoustic signal processing method according to an embodiment in functional units, like in FIG. 1 , and the configuration of the device is not limited.

일실시예에 따른 음성 활성 검출부(410)는 입력 음향 신호에서 노이즈가 제거된 제1 신호의 음성 존재 확률을 획득하는 모듈에 해당할 수 있다. 음성 활성 검출부(410)는 음성 활성 검출(voice activity detection; VAD) 알고리즘을 이용하여, 제1 신호의 음성 존재 확률을 획득하는 프로세스를 수행할 수 있다. 음성 활성 검출 알고리즘은 음향 신호에서 음성 구간 및 묵음 구간(또는 비음성 구간)을 판별하는 알고리즘으로, 음성 활성을 검출하는 다양한 알고리즘을 포함할 수 있다. 음성 활성 검출 알고리즘의 결과로 알고리즘의 입력으로 이용된 음향 신호가 음성 신호일 확률인 음성 존재 확률이 출력될 수 있다.The voice activity detection unit 410 according to an embodiment may correspond to a module that obtains a voice presence probability of a first signal from which noise is removed from an input acoustic signal. The voice activity detection unit 410 may perform a process of obtaining a voice presence probability of the first signal by using a voice activity detection (VAD) algorithm. The voice activity detection algorithm is an algorithm for determining a voice section and a silent section (or non-voice section) in an acoustic signal, and may include various algorithms for detecting voice activity. As a result of the voice activity detection algorithm, a voice presence probability, which is a probability that an acoustic signal used as an input of the algorithm is a voice signal, may be output.

일실시예에 따른 음성 활성 검출부(410)는 음성 활성 검출 알고리즘에 따라 제1 신호의 프레임 별로 음성 존재 확률을 획득할 수 있다.The voice activity detection unit 410 according to an embodiment may obtain a voice presence probability for each frame of the first signal according to a voice activity detection algorithm.

일실시예에 따른 노이즈 제어부(420)는 음성 존재 확률 및 노이즈의 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호 및 제1 신호를 가중 합함으로써 출력 음향 신호를 생성할 수 있다.The noise controller 420 according to an exemplary embodiment may generate an output audio signal by performing a weighted sum of the input audio signal and the first signal based on a parameter for controlling a voice presence probability and noise intensity.

일실시예에 따른 노이즈 제어부(420)는 노이즈 강도를 제어하는 파라미터에 기초하여 입력 음향 신호 및 제1 신호를 가중 합함으로써, 입력 음향 신호의 비중이 높은 제2 신호 및 제1 신호의 비중이 높은 제3 신호를 생성하고, 제1 신호의 음성 존재 확률에 기초하여, 제2 신호 및 제3 신호를 가중 합함으로써, 출력 음향 신호를 생성할 수 있다.The noise control unit 420 according to an embodiment weights the input sound signal and the first signal based on a parameter for controlling the noise intensity, thereby obtaining a second signal having a high ratio of the input sound signal and a high ratio of the first signal. An output acoustic signal may be generated by generating a third signal and performing a weighted sum of the second signal and the third signal based on a voice presence probability of the first signal.

예를 들어, 노이즈 제어 인터페이스를 통해 입력된 노이즈 강도를 제어하는 파라미터가 α₁(α₁은 0 이상 0.5 미만의 임의의 실수)으로 설정된 경우, 아래의 수학식 2와 같이 입력 음향 신호 x(t) 및 제1 신호 y(t)를 가중 합하여, 입력 음향 신호 x(t)의 비중이 높은 제2 신호 z₀(t)가 합성될 수 있다.For example, when the parameter controlling the intensity of noise input through the noise control interface is set to α ₁ (where α ₁ is an arbitrary real number greater than or equal to 0 and less than 0.5), the input sound signal x (t ) and the first signal y(t), the second signal z ₀ (t) having a high proportion of the input acoustic signal x(t) may be synthesized.

여기서, α₁은 0 이상 0.5 미만의 임의의 실수이므로, 입력 음향 신호 x(t)의 가중치 (1-α₁)가 제1 신호 y(t)의 가중치 α₁보다 크다. 따라서, 제1 신호 y(t)보다 입력 음향 신호 x(t)의 비중이 높은 제2 신호 z₀(t)가 생성될 수 있다.Here, since α ₁ is an arbitrary real number greater than or equal to 0 and less than 0.5, the weight (1−α ₁ ) of the input acoustic signal x(t) is greater than the weight α ₁ of the first signal y(t). Accordingly, a second signal z ₀ (t) having a higher proportion of the input acoustic signal x(t) than the first signal y(t) may be generated.

또한, 아래의 수학식 3과 같이 입력 음향 신호 x(t) 및 제1 신호 y(t)를 가중 합하여, 제1 신호 y(t)의 비중이 높은 제3 신호 z₁(t)가 합성될 수 있다.In addition, as shown in Equation 3 below, a third signal z ₁ (t) having a high proportion of the first signal y (t) is synthesized by weighted summing the input sound signal x (t) and the first signal y (t) can

여기서, α₂는 노이즈 제어 인터페이스를 통해 입력된 α₁에 기초하여 결정되는 파라미터로, α₁이 클수록 작게 결정되는 파라미터에 해당할 수 있다. 예를 들어, α₂는 (1-α₁)으로 결정될 수 있다.Here, α ₂ is a parameter determined based on α ₁ input through the noise control interface, and may correspond to a parameter determined smaller as α ₁ increases. For example, α ₂ can be determined as (1−α ₁ ).

α₂는 (1-α₁)으로 결정된다고 할 때, 0 이상 0.5 미만의 임의의 실수인 α₁에 대응하여, α₂는 0.5 초과 1 이하의 실수로 결정되므로, 입력 음향 신호 x(t)의 가중치 (1-α₂)가 제1 신호 y(t)의 가중치 α₂보다 작다. 따라서, 입력 음향 신호 x(t)보다 제1 신호 y(t)의 비중이 높은 출력 제3 신호 z₁(t)가 생성될 수 있다.When α ₂ is determined as (1-α ₁ ), in response to α ₁ , which is an arbitrary real number greater than 0 and less than 0.5, since α ₂ is determined as a real number greater than 0.5 and less than 1, the input sound signal x(t) The weight (1-α ₂ ) of is smaller than the weight α ₂ of the first signal y(t). Accordingly, an output third signal z ₁ (t) having a higher proportion of the first signal y(t) than the input acoustic signal x(t) may be generated.

출력 음향 신호 z(t)는 아래의 수학식 4와 같이 제1 신호의 음성 존재 확률 p에 기초하여, 제2 신호 z₀(t) 및 제3 신호 z₁(t)를 가중 합함으로써, 생성될 수 있다.The output acoustic signal z(t) is generated by a weighted sum of the second signal z ₀ (t) and the third signal z ₁ (t) based on the speech presence probability p of the first signal as shown in Equation 4 below: It can be.

일실시예에 따를 때, 음성 존재 확률이 큰 경우, 입력 음향 신호의 비중이 높은 제2 신호가 출력 음향 신호에 더 많이 포함되고, 음성 존재 확률이 작은 경우, 제1 신호의 비중이 높은 제3 신호가 출력 음향 신호에 더 많이 포함되도록 제1 신호 및 입력 음향 신호를 합성함으로써, 구간 별로 노이즈 제거 정도가 제어된 출력 음향 신호가 생성될 수 있다. 제2 신호 및 제3 신호의 합성을 위한 가중치 α₁및α₂을 조절함으로써, 음성 신호가 포함된 부분과 음성 신호가 포함되지 않은 부분의 연결이 자연스러운 출력 음향 신호가 생성될 수 있다.According to an embodiment, when the voice presence probability is high, the second signal having a high proportion of the input acoustic signal is more included in the output acoustic signal, and when the voice presence probability is small, the third signal having a high proportion of the first signal By synthesizing the first signal and the input acoustic signal so that more signals are included in the output acoustic signal, an output acoustic signal having a noise removal degree controlled for each section may be generated. A weight α ₁ for synthesizing the second signal and the third signal andBy adjusting α ₂ , an output audio signal in which a part including the audio signal and a part not including the audio signal are naturally connected can be generated.

일실시예에 따를 때, 제1 신호의 음성 존재 확률 p가 프레임 별로 획득된 경우, 노이즈 제어부(420)는 음성 존재 확률 및 노이즈 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호 및 제1 신호를 프레임 별로 가중 합하여 출력 음향 신호를 생성할 수 있다. 예를 들어, 제1 신호의 프레임 t₁에서 음성 존재 확률이 p₁인 경우, 프레임 t₁에 대응하는 출력 음향 신호 z(t₁)은 아래의 수학식 5와 같이 나타낼 수 있다.According to an embodiment, when the voice presence probability p of the first signal is obtained for each frame, the noise controller 420 determines the input sound signal and the first signal based on the voice presence probability and a parameter for controlling the noise intensity. An output acoustic signal may be generated by performing a weighted sum for each frame. For example, when the audio presence probability in frame t ₁ of the first signal is p ₁ , the output acoustic signal z(t ₁ ) corresponding to frame t ₁ may be expressed as in Equation 5 below.

일실시예에 따를 때, 제1 신호의 음성 존재 확률 p가 프레임 별로 획득된 경우, 노이즈 제어부(420)는 제1 신호의 각 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부를 판단하여, 출력 음향 신호를 생성할 수 있다.According to an embodiment, when the voice presence probability p of the first signal is obtained for each frame, the noise controller 420 determines whether each frame of the first signal is a frame with a high probability of voice presence or a frame with a low voice presence probability. By determining, it is possible to generate an output acoustic signal.

제1 신호의 특정 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부는 제1 신호의 특정 프레임에서 획득된 음성 존재 확률을 미리 정해진 임계 값과 비교함으로써 결정될 수 있다. 예를 들어, 0.5를 기준으로 제1 신호의 특정 프레임의 음성 존재 확률이 0.5이상인 경우, 음성 존재 확률이 높은 프레임으로, 0.5 미만인 경우 음성 존재 확률이 낮은 프레임을 결정될 수 있다. 일실시예에 따를 때, 음성 존재 확률이 높은 프레임은 음성 구간으로, 음성 존재 확률이 낮은 프레임은 비음성 구간으로 구분될 수 있다. 다시 말해, 제1 신호에 음성 발화 검출 알고리즘을 적용하여 출력된 제1 신호의 프레임 별 음성 존재 확률에 기초하여, 제1 신호의 프레임들은 음성 구간 및 비음성 구간으로 구분될 수 있다.Whether the specific frame of the first signal is a frame with a high probability of voice presence or a frame with a low probability of voice presence may be determined by comparing the voice presence probability obtained in the specific frame of the first signal with a predetermined threshold value. For example, based on 0.5, if the audio presence probability of a specific frame of the first signal is greater than or equal to 0.5, a frame having a high audio presence probability may be determined, and if it is less than 0.5, a frame having a low audio presence probability may be determined. According to an embodiment, a frame with a high probability of voice existence may be divided into a voice section, and a frame with a low voice presence probability may be divided into a non-voice section. In other words, the frames of the first signal may be divided into a speech section and a non-voice section based on the speech existence probability for each frame of the first signal output by applying the speech utterance detection algorithm to the first signal.

일실시예에 따른 노이즈 제어부(420)는 제1 신호의 프레임들 중 음성 존재 확률이 높은 프레임에서는, 노이즈 강도를 제어하는 파라미터에 기초하여, 제1 신호보다 입력 음향 신호에 높은 가중치를 두어 가중 합할 수 있다. 노이즈 제어부(420)는 제1 신호의 프레임들 중 음성 존재 확률이 낮은 프레임에서는, 노이즈 강도를 제어하는 파라미터에 기초하여, 입력 음향 신호보다 제1 신호에 높은 가중치를 두어 가중 합할 수 있다.Among the frames of the first signal, the noise control unit 420 performs a weighted sum by placing a higher weight on the input sound signal than on the first signal, based on a parameter for controlling the noise intensity, in a frame having a high probability of voice presence among the frames of the first signal. can Among the frames of the first signal, the noise control unit 420 may perform a weighted sum by assigning a higher weight to the first signal than to the input sound signal, based on a parameter for controlling the noise intensity, in a frame having a low voice presence probability.

예를 들어, 노이즈 제어 인터페이스를 통해 입력된 노이즈 강도를 제어하는 파라미터가 α₁(α₁은 0 이상 0.5 이하의 임의의 실수)으로 설정된 경우, 제1 신호의 프레임들 중 음성 존재 확률이 높은 프레임(t₁)에서는, 아래의 수학식 6과 같이 제1 신호 y(t₁) 및 입력 음향 신호 x(t₁)가 합성될 수 있다.For example, when the parameter controlling the intensity of noise input through the noise control interface is set to α ₁ (where α ₁ is an arbitrary real number between 0 and 0.5), frames having a high probability of voice presence among frames of the first signal In (t ₁ ), the first signal y(t ₁ ) and the input sound signal x(t ₁ ) may be synthesized as shown in Equation 6 below.

여기서, α₁은 0 이상 0.5 이하의 임의의 실수이므로, 입력 음향 신호 x(t₁)의 가중치 (1-α₁)가 제1 신호 y(t₁)의 가중치 α₁보다 크거나 같다. 따라서, 음성 존재 확률이 높은 프레임에서는 제1 신호 y(t₁)보다 입력 음향 신호 x(t₁)의 비중이 높은 출력 음향 신호 z(t₁)가 생성될 수 있다.Here, since α ₁ is an arbitrary real number between 0 and 0.5, the weight (1−α ₁ ) of the input acoustic signal x(t ₁ ) is greater than or equal to the weight α ₁ of the first signal y(t ₁ ). Accordingly, in a frame having a high voice presence probability, an output sound signal z(t ₁ ) having a higher proportion of the input sound signal x(t ₁ ) than the first signal y(t ₁ ) may be generated.

반면, 제1 신호의 프레임들 중 음성 존재 확률이 낮은 프레임에서는, 아래의 수학식 7과 같이 제1 신호 y(t₂) 및 입력 음향 신호 x(t₂)가 합성될 수 있다.On the other hand, in a frame having a low voice presence probability among frames of the first signal, the first signal y(t ₂ ) and the input sound signal x(t ₂ ) may be synthesized as shown in Equation 7 below.

α₂는 (1-α₁)으로 결정된다고 할 때, 0 이상 0.5 이하의 임의의 실수인 α₁에 대응하여, α₂는 0.5 이상 1 이하의 실수로 결정되므로, 입력 음향 신호 x(t₂)의 가중치 (1-α₂)가 제1 신호 y(t₂)의 가중치 α₂보다 작거나 같다. 따라서, 음성 존재 확률이 낮은 프레임에서는 입력 음향 신호보다 제1 신호의 비중이 높은 출력 음향 신호 z(t₂)가 생성될 수 있다.When α ₂ is determined as (1-α ₁ ), in response to α ₁ , which is an arbitrary real number between 0 and 0.5, since α ₂ is determined as a real number between 0.5 and 1, the input sound signal x(t ₂ ) has a weight (1-α ₂ ) that is less than or equal to the weight α ₂ of the first signal y(t ₂ ). Accordingly, in a frame having a low voice presence probability, an output acoustic signal z(t ₂ ) having a higher proportion of the first signal than the input acoustic signal may be generated.

일실시예에 따를 때, 제1 신호의 각 프레임이 음성 존재 확률이 높은 프레임인지 음성 존재 확률이 낮은 프레임인지 여부를 판단하여(또는 음성 구간에 해당하는지 비음성 구간에 해당하는지 판단하여), 출력 음향 신호를 생성하는 과정은 상술한 수학식 4에서, 음성 존재 확률이 높은 프레임의 경우 p=1, 음성 존재 확률이 낮은 프레임의 경우 p=0으로 하여 제2 신호 및 제3 신호를 가중 합하여 출력 음향 신호를 생성하는 것으로 이해할 수 있다.According to an embodiment, by determining whether each frame of the first signal is a frame with a high probability of speech existence or a frame with a low probability of speech existence (or by determining whether it corresponds to a speech section or a non-voice section), output In the process of generating the sound signal, in the above-described Equation 4, p = 1 in the case of a frame with a high probability of audio presence and p = 0 in the case of a frame with a low probability of audio presence, the second signal and the third signal are weighted and output. It can be understood as generating an acoustic signal.

도 5 및 도 6은 일실시예에 따른 음향 신호 처리 방법의 순서도를 도시한 도면들이다.5 and 6 are flowcharts illustrating a method for processing a sound signal according to an exemplary embodiment.

보다 구체적으로, 도 5 및 도 6은 음향 신호 처리 장치는 서버이고, 사용자 단말에 노이즈 제어 인터페이스가 탑재된 실시예에서, 음향 신호 시스템의 동작 과정을 도시한다. More specifically, FIGS. 5 and 6 illustrate an operation process of an acoustic signal system in an embodiment in which the acoustic signal processing device is a server and a noise control interface is mounted in a user terminal.

도 5를 참조하면, 서버는 사용자 단말의 마이크 등 입력 장치를 통해 입력된 입력 음향 신호를 네트워크를 통해 수신(510)할 수 있다. 서버는 수신된 입력 음향 신호에 상술한 노이즈 제거 알고리즘을 수행(530)하여 제1 신호를 획득할 수 있다. 사용자 단말은 노이즈 제어 인터페이스를 통해 사용자로부터 노이즈 제어 입력을 수신하여 서버에 전달할 수 있으며, 서버는 노이즈 제어 인터페이스를 통해 수신된 노이즈 제어 입력을 획득(520)할 수 있다. 일실시예에 따를 때, 노이즈 제어 입력 획득 동작과 노이즈 제거 알고리즘 수행 동작은 병렬적으로 수행될 수 있다.Referring to FIG. 5 , the server may receive ( 510 ) an input sound signal input through an input device such as a microphone of a user terminal through a network. The server may acquire the first signal by performing the above-described noise removal algorithm on the received input sound signal (530). The user terminal may receive a noise control input from a user through a noise control interface and transmit the received noise control input to a server, and the server may acquire the noise control input received through the noise control interface (520). According to one embodiment, the operation of obtaining a noise control input and the operation of performing a noise removal algorithm may be performed in parallel.

서버는 획득된 노이즈 제어 입력에 기초하여, 제1 신호 및 입력 음향 신호를 가중 합하는 노이즈 제어 알고리즘을 수행(540)함으로써, 출력 음향 신호를 생성할 수 있다. 서버는 출력 음향 신호를 네트워크를 통해 사용자 단말에 발신(550)할 수 있으며, 출력 음향 신호를 수신한 사용자 단말은 스피커 등 출력 장치를 통해 출력 음향 신호를 출력할 수 있다.Based on the obtained noise control input, the server may generate an output sound signal by performing a noise control algorithm of weighted summing the first signal and the input sound signal (540). The server may transmit the output audio signal to the user terminal through the network (550), and the user terminal receiving the output audio signal may output the output audio signal through an output device such as a speaker.

일실시예에 따를 때, 서버에서 수신된 입력 음향 신호는 사용자 단말에 노이즈 제거 알고리즘이 수행된 음향 신호에 해당할 수 있다. 다시 말해, 사용자 단말은 입력 장치를 통해 수신된 음향 신호에서 노이즈를 제거하는 프로세스를 수행할 수 있다.According to one embodiment, the input sound signal received from the server may correspond to a sound signal on which a noise removal algorithm is performed on the user terminal. In other words, the user terminal may perform a process of removing noise from the sound signal received through the input device.

도 6을 참조하면, 사용자 단말로부터 입력 음향 신호를 수신한 서버는 도 5에 도시된 이후의 동작을 수행하기에 앞서, 수신된 입력 음향 신호에서 SNR 값을 획득(610)하여, 노이즈 제어 인터페이스의 활성화 여부를 결정할 수 있다. SNR 값은 서버의 프로세서에 의해 계산될 수도 있고, 외부 장치로부터 계산되어 서버에서 획득될 수도 있다.Referring to FIG. 6 , the server receiving the input sound signal from the user terminal obtains an SNR value from the received input sound signal (610) prior to performing subsequent operations shown in FIG. You can decide whether to activate it or not. The SNR value may be calculated by a processor of the server or may be calculated from an external device and obtained by the server.

서버는 획득된 SNR 값에 기초하여 SNR 값이 미리 정해진 기준에 부합하는 경우, 사용자 단말의 노이즈 제어 인터페이스를 활성화(620)시킬 수 있다. 미리 정해진 기준은 입력 음향 신호에 대하여 노이즈 제거 및 제어 알고리즘의 수행 여부를 결정하기 위한 SNR 값과 관련된 기준에 해당할 수 있다. 예를 들어, SNR 값을 미리 정해진 임계 값 또는 임계 범위와 비교하여, SNR 값이 임계 값보다 크거나 작은 경우 또는 SNR 값이 임계 범위 내에 있는 경우, 노이즈 제어 인터페이스를 활성화하는 것으로 결정될 수 있다.The server may activate the noise control interface of the user terminal (620) when the SNR value meets a predetermined criterion based on the obtained SNR value. The predetermined criterion may correspond to a criterion related to an SNR value for determining whether to perform a noise removal and control algorithm on an input sound signal. For example, by comparing the SNR value with a predetermined threshold value or threshold range, it may be determined to activate the noise control interface if the SNR value is greater than or less than the threshold value or if the SNR value is within the threshold range.

서버에서 노이즈 제어 인터페이스를 활성화하는 경우, 사용자 단말에 노이즈 제어 인터페이스가 표시되는 등 활성화될 수 있으며, 사용자는 노이즈 제어 인터페이스를 통해 서버와 데이터를 송수신할 수 있다.When the noise control interface is activated in the server, the noise control interface may be activated such that the noise control interface is displayed on the user terminal, and the user may transmit and receive data to and from the server through the noise control interface.

서버에서 SNR 값에 기초하여, 노이즈 제어 인터페이스를 활성화하는 것으로 결정된 경우, 노이즈 제거 알고리즘이 수행될 수 있다. 또한, 노이즈 제어 인터페이스를 통해 서버에서 노이즈 제어 입력이 획득될 수 있으므로, 노이즈 제어 알고리즘이 수행될 수 있다. If the server determines to activate the noise control interface based on the SNR value, a noise cancellation algorithm may be performed. Also, since a noise control input can be obtained from a server through a noise control interface, a noise control algorithm can be performed.

반면, SNR 값이 미리 정해진 기준에 해당하지 않는 경우, 서버는 노이즈 제어 인터페이스를 활성화하지 않을 수 있으며, 노이즈 제거 알고리즘이 수행되지 않을 수 있다. 사용자 단말에서 노이즈 제어 인터페이스가 활성화되지 않는 경우, 사용자는 인터페이스를 통해 노이즈 제어에 관한 파라미터를 입력할 수 없다. 서버에서 노이즈 제어 알고리즘이 수행되지 않으며, 노이즈 제어 인터페이스를 통해 입력되는 노이즈 제어에 관한 파라미터가 수신되지 않는 경우, 노이즈 제어 알고리즘은 수행되지 않는다. 이 경우, 사용자 단말에 입력된 음향 신호는 서버에서 노이즈 제거 및 제어에 관한 별도의 처리를 거치지 않고, 출력 장치를 통해 출력될 수 있다. On the other hand, if the SNR value does not correspond to the predetermined criterion, the server may not activate the noise control interface and the noise removal algorithm may not be performed. When the noise control interface is not activated in the user terminal, the user cannot input parameters related to noise control through the interface. The noise control algorithm is not performed in the server, and if parameters related to noise control input through the noise control interface are not received, the noise control algorithm is not performed. In this case, the sound signal input to the user terminal may be output through the output device without undergoing a separate process for noise removal and control in the server.

도 5 및 도 6은 입력 음향 신호를 서버로 발신하는 사용자 단말과 출력 음향 신호를 서버로부터 수신하는 사용자 단말은 서로 동일한 단말인 경우를 도시하고 있으나, 일실시예에 따를 때, 입력 음향 신호를 서버로 발신하는 사용자 단말과 출력 음향 신호를 서버로부터 수신하는 사용자 단말은 서로 다른 단말에 해당할 수 있다. 예를 들어, 영상 통화 또는 음성 통화 환경에서는, 특정 사용자 단말을 통해 입력된 음향 신호가 통화 상대방의 단말을 통해 출력되므로, 네트워크를 통해 서버로 입력 음향 신호를 발신하는 사용자 단말과 네트워크를 통해 서버로부터 출력 음향 신호를 수신하는 사용자 단말은 서로 다른 단말에 해당할 수 있다.5 and 6 show a case in which a user terminal that transmits an input sound signal to a server and a user terminal that receives an output sound signal from a server are the same terminal, but according to an exemplary embodiment, a user terminal that transmits an input sound signal to a server A user terminal sending a message and a user terminal receiving an output sound signal from the server may correspond to different terminals. For example, in a video call or voice call environment, since an audio signal input through a specific user terminal is output through the terminal of the other party, the user terminal transmitting the input audio signal to the server through the network and the server through the network. User terminals receiving the output sound signal may correspond to different terminals.

이상에서 설명된 실시예들은 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치, 방법 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPGA(field programmable gate array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The embodiments described above may be implemented as hardware components, software components, and/or a combination of hardware components and software components. For example, the devices, methods and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate (FPGA). array), programmable logic units (PLUs), microprocessors, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of software. For convenience of understanding, there are cases in which one processing device is used, but those skilled in the art will understand that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that it can include. For example, a processing device may include a plurality of processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may include a computer program, code, instructions, or a combination of one or more of the foregoing, which configures a processing device to operate as desired or processes independently or collectively. The device can be commanded. Software and/or data may be any tangible machine, component, physical device, virtual equipment, computer storage medium or device, intended to be interpreted by or provide instructions or data to a processing device. , or may be permanently or temporarily embodied in a transmitted signal wave. Software may be distributed on networked computer systems and stored or executed in a distributed manner. Software and data may be stored on computer readable media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있으며 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer readable medium. The computer readable medium may include program instructions, data files, data structures, etc. alone or in combination, and the program instructions recorded on the medium may be specially designed and configured for the embodiment or may be known and usable to those skilled in the art of computer software. may be Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic media such as floptical disks. - includes hardware devices specially configured to store and execute program instructions, such as magneto-optical media, and ROM, RAM, flash memory, and the like. Examples of program instructions include high-level language codes that can be executed by a computer using an interpreter, as well as machine language codes such as those produced by a compiler.

위에서 설명한 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 또는 복수의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The hardware device described above may be configured to operate as one or a plurality of software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 이를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with limited drawings, those skilled in the art can apply various technical modifications and variations based on this. For example, the described techniques may be performed in an order different from the method described, and/or components of the described system, structure, device, circuit, etc. may be combined or combined in a different form than the method described, or other components may be used. Or even if it is replaced or substituted by equivalents, appropriate results can be achieved.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents of the claims are within the scope of the following claims.

Claims

In the acoustic signal processing method performed in the processor,
obtaining an input sound signal including a voice signal and noise;
generating a first signal by removing the noise from the input sound signal;
obtaining a speech presence probability in response to the first signal;
obtaining a parameter input through a user interface related to noise control and controlling the strength of the noise;
generating a second signal and a third signal having different synthesis ratios of the input acoustic signal and the first signal, based on the parameter; and
generating an output acoustic signal by weighted summing the second signal and the third signal based on the voice presence probability;
including,
Acoustic signal processing method.

According to claim 1,
obtaining a signal-to-noise ratio (SNR) value of the input sound signal; and
Based on the SNR value, determining whether to activate the user interface.
Including more,
Acoustic signal processing method.

According to claim 1,
The voice presence probability is obtained corresponding to each section of the first signal,
The output acoustic signal is generated by weighting the second signal and the third signal for each section based on the voice presence probability and the parameter obtained for each section of the first signal.
Acoustic signal processing method.

delete

According to claim 1,
Generating the output acoustic signal
weighting the second signal and the third signal for each section by assigning a higher weight to the second signal than the third signal based on the voice presence probability when the voice presence probability is greater than or equal to a predetermined threshold; and
When the voice presence probability is less than a predetermined threshold, weighted summing the second signal and the third signal for each interval by assigning a higher weight to the third signal than the second signal based on the voice presence probability
including,
Acoustic signal processing method.

According to claim 1,
The step of obtaining the voice presence probability is
Corresponding to each of the frames generated by dividing the first signal into specific time units,
Obtaining a probability that the voice signal is included in a corresponding frame based on a voice activity detection algorithm
including,
Acoustic signal processing method.

According to claim 1,
Generating the first signal
generating the first signal by performing a noise removal algorithm in the time domain of the input sound signal;
including,
Acoustic signal processing method.

According to claim 1,
Generating the first signal
Generating the first signal based on a deep learning model learned to output an acoustic signal from which noise is removed from an acoustic signal including noise
including,
Acoustic signal processing method.

delete

A computer program stored in a medium to be combined with hardware to execute the method of any one of claims 1 to 3 and 5 to 8.

Acquiring an input acoustic signal including a voice signal and noise;
removing the noise from the input sound signal to generate a first signal;
Corresponding to the first signal, obtaining a speech presence probability;
Obtaining a parameter input through a user interface related to noise control and controlling the intensity of the noise;
By weighting the input acoustic signal and the first signal based on the parameter, a second signal having a higher proportion of the input acoustic signal than the first signal and a first signal having a higher proportion of the first signal than the input acoustic signal 3 generate signals,
generating an output acoustic signal by weighted summing the second signal and the third signal based on the speech presence probability;
including at least one processor,
Acoustic signal processing device.

According to claim 13,
The processor
Obtaining a signal-to-noise ratio (SNR) value of the input sound signal;
Based on the SNR value, determining whether to activate the user interface,
Acoustic signal processing device.

According to claim 13,
The voice presence probability is obtained corresponding to each section of the first signal,
The second signal and the third signal are generated by weighting the input acoustic signal and the first signal for each section based on the voice presence probability and the parameter obtained for each section of the first signal.
Acoustic signal processing device.

delete

According to claim 13,
The processor
In generating the output acoustic signal,
When the voice presence probability is greater than or equal to a predetermined threshold, a weighted sum of the second signal and the third signal is given for each section by assigning a higher weight to the second signal than the third signal based on the voice presence probability;
When the voice presence probability is less than a predetermined threshold, weighted summing the second signal and the third signal for each interval by placing a higher weight on the third signal than on the second signal based on the voice presence probability.
Acoustic signal processing device.

According to claim 13,
The processor
In generating the first signal,
Generating the first signal by performing a noise removal algorithm in the time domain of the input sound signal,
Acoustic signal processing device.

According to claim 13,
The processor
In generating the first signal,
Generating the first signal based on a deep learning model learned to output an acoustic signal from which noise is removed from an acoustic signal containing noise,
Acoustic signal processing device.

delete