KR20230085772A

KR20230085772A - Pre-processing system of speech communication based on deep learning

Info

Publication number: KR20230085772A
Application number: KR1020210174265A
Authority: KR
Inventors: 김남수; 김정훈; 안성환; 김지환; 우범준
Original assignee: 서울대학교산학협력단
Priority date: 2021-12-07
Filing date: 2021-12-07
Publication date: 2023-06-14

Abstract

본 발명은 딥러닝 기반 음성 통신 전처리 시스템에 관한 것으로서, 보다 구체적으로는 딥러닝 기반 음성 통신 전처리 시스템으로서, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈; 상기 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈; 상기 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈; 및 상기 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하는 것을 그 구성상의 특징으로 한다.
본 발명에서 제안하고 있는 딥러닝 기반 음성 통신 전처리 시스템에 따르면, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈과, 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈과, 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈과, 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하여 구성함으로써, 왜곡된 음성을 복원할 뿐만 아니라, 다채널 마이크로 입력받은 음성으로부터 공간 정보를 추출하여 2개 이상의 음성이 포함되어 있을 때 이를 각각 분리하고 코덱을 통해 불필요한 잡음을 이차적으로 제거하여 깨끗한 음성을 개별적으로 복원이 가능하도록 하며, 그에 따른 음성 통신에 적용되는 다양한 서비스에 활용될 수 있도록 할 수 있다.
또한, 본 발명의 딥러닝 기반 음성 통신 전처리 시스템에 따르면, 다채널 마이크로부터 입력되는 음성의 공간적 정보를 추출하여 왜곡된 음성의 복원, 혼합된 음성의 분리, 그리고 코덱을 통한 개별 음성 복원 기술이 통합된 딥러닝 기반의 음성 통신 전처리 시스템 모델로 구현이 가능하도록 함으로써, 하나의 통합 시스템으로 제안하여 하나의 파이프라인으로 처리하여 모든 모듈을 한 번에 훈련할 수 있어서, 모델의 유지 및 업데이트가 매우 용이하고, 모든 과정을 딥러닝으로 처리를 하여 기존 수식보다 현실 왜곡 음성 데이터에 더 적합한 weight를 얻고, 내부 모듈간의 상호작용으로 인해 성능 향상이 가능하도록 할 수 있다.The present invention relates to a deep learning-based voice communication pre-processing system, and more specifically, to a deep learning-based voice communication pre-processing system, which extracts spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone. multi-channel speech pre-processing module; a voice distortion correcting module that receives spatial information and voice information extracted from the multi-channel voice pre-processing module, corrects distortion of voice extracted from the spatial information, and outputs improved voice; a composite voice separation module that receives the enhanced composite voice in which the voice distortion is corrected from the voice distortion correcting module and separates the enhanced composite voice into individual voices; and a voice codec module for compressing the voices individually separated from the composite voice separation module, transmitting them through a voice channel, and restoring the received compressed voice.
According to the deep learning-based voice communication pre-processing system proposed in the present invention, a multi-channel voice pre-processing module for extracting spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone, and multi-channel voice pre-processing A voice distortion correction module receiving the spatial information and voice information extracted from the module and correcting the distortion of the voice from which the spatial information was extracted and processed to output an improved voice; A composite voice separation module that receives and separates the enhanced composite voice into individual voices, compresses the voices individually separated from the composite voice separation module, transmits them through a voice channel, and restores the transmitted compressed voice. By including a voice codec module that not only restores distorted voices, but also extracts spatial information from voices input by multi-channel microphones, separates them from each other when two or more voices are included, and removes unnecessary noise through the codec. Secondarily removed, clear voices can be individually restored, and thus can be used for various services applied to voice communication.
In addition, according to the deep learning-based voice communication pre-processing system of the present invention, spatial information of voice input from a multi-channel microphone is extracted and distorted voice is restored, mixed voice is separated, and individual voice restoration technologies are integrated through a codec. By enabling implementation as a deep learning-based voice communication preprocessing system model, it is proposed as an integrated system and processed as one pipeline to train all modules at once, so it is very easy to maintain and update the model. And, by processing all processes with deep learning, it is possible to obtain weights more suitable for reality distortion voice data than existing formulas, and to improve performance due to interaction between internal modules.

Description

Deep learning-based voice communication pre-processing system {PRE-PROCESSING SYSTEM OF SPEECH COMMUNICATION BASED ON DEEP LEARNING}

본 발명은 딥러닝 기반 음성 통신 전처리 시스템에 관한 것으로서, 보다 구체적으로는 왜곡된 음성을 복원할 뿐만 아니라, 다채널 마이크로 입력받은 음성으로부터 공간 정보를 추출하여 2개 이상의 음성이 포함되어 있을 때 이를 각각 분리하고 코덱을 통해 불필요한 잡음을 이차적으로 제거하여 깨끗한 음성을 개별적으로 복원이 가능하도록 하는 딥러닝 기반 음성 통신 전처리 시스템에 관한 것이다.The present invention relates to a deep learning-based voice communication pre-processing system, and more specifically, not only restores distorted voice, but also extracts spatial information from voice inputted by a multi-channel microphone and extracts spatial information when two or more voices are included. It relates to a deep learning-based voice communication pre-processing system that separates and secondarily removes unnecessary noise through a codec to individually restore clear voice.

2020년대 초 전세계를 강타한 Covid-19 펜데믹으로 인하여 회의, 수업 등의 비대면 모임에 대한 수요가 폭발적으로 증가하였다. 대표적인 화상회의 솔루션을 제공하는 기업인 Zoom은 이번 펜데믹으로 큰 수혜를 본 기업 중 하나인데, 통계분석 그룹 stastia의 자료에 따르면 해당 기간 순이익이 30배 정도 증가하였다. 기존의 대면 모임과 비교해보면 비대면 모임은 여러 장점을 가진다. 특히 기업의 회의에서 이러한 장점이 크게 부각되는데 장소의 자유로움, 이동 시간의 절약, 편리한 자료 활용 등이 있다. 그러기에 펜데믹 상황이 개선된 이후에도 비대면 모임에 대한 수요는 계속될 전망이다.Due to the Covid-19 pandemic that hit the world in the early 2020s, demand for non-face-to-face gatherings such as conferences and classes has exploded. Zoom, a company that provides a representative video conferencing solution, is one of the companies that benefited greatly from this pandemic. Compared to existing face-to-face meetings, non-face-to-face meetings have several advantages. In particular, these advantages are greatly highlighted at corporate meetings, such as freedom of place, saving travel time, and convenient use of materials. Therefore, even after the pandemic situation improves, the demand for non-face-to-face meetings is expected to continue.

반면, 비대면 모임에는 단점도 있는데, 참가자들이 위치하는 공간이 상이해 각자의 음성에 서로 다른 왜곡이 일어나게 된다는 점이다. 더 나은 참가자들의 사용자 경험을 위하여 각 회의 참여 환경으로 기인하는 발화의 왜곡을 최소화하는 것은 성공적인 비대면 모임을 위해 중요한 기술로 각광받고 있다. 이러한 비대면 모임의 대중화에 따라 원격 모임을 보조해줄 수 있는 플랫폼의 기반 기술이 중요해지고 있다. 특히, 사용자들이 열악한 음성 품질에 많은 불편을 느끼고 있는 만큼, 음성 품질을 다양한 방법으로 향상시킬 수 있는 기술의 개발이 필수적이다.On the other hand, there is also a downside to non-face-to-face meetings, which is that the space where the participants are located is different, so different distortions occur in each person's voice. Minimizing distortion of speech caused by each meeting participation environment for a better user experience of participants is in the limelight as an important technology for successful non-face-to-face meetings. With the popularization of these non-face-to-face meetings, the underlying technology of a platform that can assist remote meetings is becoming more important. In particular, as users experience a lot of inconvenience due to poor voice quality, it is essential to develop technologies capable of improving voice quality in various ways.

이에 따른 음성 향상 기술은 잡음이 섞인 음성에서 깨끗한 음성을 추정하는 기술로, 음성통신 분야에서는 음성의 명료도 향상에 도움을 주고, 음성 인식 등에서는 전처리 기술로 이용하는 등 다양한 음성 관련 어플리케이션에 활용될 수 있는 중요한 기술이다. 초기 연구에서는 비음성 구간(노이즈만 있는 구간)에서 노이즈를 추정하여 그 정보를 바탕으로 노이즈를 제거하는 통계적 방법이 많이 사용되었으나, 이러한 기술은 노이즈가 시간에 따라 변하거나(non-stationary) 심하게 섞인 환경(low signal to ratio(SNR))에서는 성능이 저하되는 경향이 있었다.Accordingly, voice enhancement technology is a technology for estimating a clear voice from a voice mixed with noise, and can be used for various voice-related applications, such as helping to improve voice intelligibility in the field of voice communication and using it as a preprocessing technology in voice recognition. It is an important skill. In early studies, a statistical method of estimating noise in a non-speech section (interval with only noise) and removing noise based on the information was widely used, but these techniques were used in cases where the noise changed over time (non-stationary) or was heavily mixed. In the environment (low signal to ratio (SNR)), performance tended to deteriorate.

최근에는 딥 러닝(deep learning)의 발달로 인해 음성 향상 기술 분야에서도 다양한 딥 러닝 기법이 적용되고 있다. 딥 러닝 기반의 음성향상에서는 잡음이 섞인 음성을 인풋으로 하고, 잡음이 섞이기 전 깨끗한 음성을 타겟으로 하여 모델을 훈련시키는데, 이는 전형적인 회귀(regression) 모델의 학습이라고 할 수 있다.Recently, due to the development of deep learning, various deep learning techniques are applied in the field of speech enhancement technology. In deep learning-based voice enhancement, a voice mixed with noise is used as an input and a model is trained with a clean voice as a target before the noise is mixed. This can be said to be a typical regression model learning.

기존 음성 전처리 시스템은 싱글 채널의 왜곡된 음성의 불필요한 요소들을 제거하여 깨끗한 음성을 만들어내는 것을 목표로 하고 있다. 그러나 영상 회의 통신 시스템의 입력으로 깨끗한 음성을 넣어주기 위해 음성 전처리 기술로서, 영상 회의 중에서 다채널 마이크로 음성을 녹음할 시 다채널 입력 특성상 공간의 특성이 반영된 음성 잔향이 끼어 있으며, 주변 잡음도 섞이게 된다. 또한, 회의 및 대화를 하는 상황에서 두 명의 발화가 겹치는 상황이 있을 수 있는 상황도 빈번하게 발생하며, 일반적으로 왜곡된 음성 그대로 통신 시스템을 통하여 신호를 보내 다른 쪽에서 신호를 받아 음성을 복원하였을 때 음성을 인식하기 어려운 현상이 발생하게 되는 한계가 따르는 문제가 있었다.Existing voice pre-processing systems aim to create a clear voice by removing unnecessary elements of a single channel distorted voice. However, as a voice pre-processing technology to input clear voice as an input of a video conferencing communication system, when recording voice with a multi-channel microphone during a video conference, voice reverberation reflecting the characteristics of the space is included due to the multi-channel input characteristics, and ambient noise is also mixed. . In addition, situations in which there may be overlapping utterances of two people frequently occur during meetings and conversations, and generally, when a signal is sent through a communication system as it is distorted and received from the other side and the voice is restored, the voice There was a problem with limitations that caused a phenomenon that was difficult to recognize.

본 발명은 기존에 제안된 방법들의 상기와 같은 문제점들을 해결하기 위해 제안된 것으로서, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈과, 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈과, 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈과, 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하여 구성함으로써, 왜곡된 음성을 복원할 뿐만 아니라, 다채널 마이크로 입력받은 음성으로부터 공간 정보를 추출하여 2개 이상의 음성이 포함되어 있을 때 이를 각각 분리하고 코덱을 통해 불필요한 잡음을 이차적으로 제거하여 깨끗한 음성을 개별적으로 복원이 가능하도록 하며, 그에 따른 음성 통신에 적용되는 다양한 서비스에 활용될 수 있도록 하는, 딥러닝 기반 음성 통신 전처리 시스템을 제공하는 것을 그 목적으로 한다.The present invention has been proposed to solve the above problems of the previously proposed methods, and includes a multi-channel voice pre-processing module for extracting spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone. A voice distortion correction module receiving the spatial information and voice information extracted from the multi-channel voice pre-processing module and outputting improved voice by correcting the distortion of the voice from which the spatial information was extracted and processed, and the voice distortion from the voice distortion correction module A composite voice separation module that receives the corrected enhanced composite voice and separates the enhanced composite voice into individual voices, and compresses the individually separated voices from the composite voice separation module, transmits them through a voice channel, and compresses the transmitted voices. By including a voice codec module that restores the distorted voice, it not only restores the distorted voice, but also extracts spatial information from the voice input by the multi-channel microphone, separates them when two or more voices are included, and codes them separately. The purpose is to provide a deep learning-based voice communication pre-processing system that enables clear voice to be individually restored by removing unnecessary noise secondaryly through, and can be used for various services applied to voice communication accordingly. do.

또한, 본 발명은, 다채널 마이크로부터 입력되는 음성의 공간적 정보를 추출하여 왜곡된 음성의 복원, 혼합된 음성의 분리, 그리고 코덱을 통한 개별 음성 복원 기술이 통합된 딥러닝 기반의 음성 통신 전처리 시스템 모델로 구현이 가능하도록 함으로써, 하나의 통합 시스템으로 제안하여 하나의 파이프라인으로 처리하여 모든 모듈을 한 번에 훈련할 수 있어서, 모델의 유지 및 업데이트가 매우 용이하고, 모든 과정을 딥러닝으로 처리를 하여 기존 수식보다 현실 왜곡 음성 데이터에 더 적합한 weight를 얻고, 내부 모듈간의 상호작용으로 인해 성능 향상이 가능하도록 하는, 딥러닝 기반 음성 통신 전처리 시스템을 제공하는 것을 또 다른 목적으로 한다.In addition, the present invention extracts spatial information of voice input from a multi-channel microphone, restores distorted voice, separates mixed voice, and deep learning-based voice communication pre-processing system in which individual voice restoration technologies are integrated through a codec. By making it possible to implement as a model, it is proposed as an integrated system and processed as one pipeline so that all modules can be trained at once, so it is very easy to maintain and update the model, and all processes are processed by deep learning. Another object is to provide a deep learning-based voice communication pre-processing system that obtains weights more suitable for reality distortion voice data than existing formulas and enables performance improvement due to interaction between internal modules.

상기한 목적을 달성하기 위한 본 발명의 특징에 따른 딥러닝 기반 음성 통신 전처리 시스템은,Deep learning-based voice communication pre-processing system according to the features of the present invention for achieving the above object,

딥러닝 기반 음성 통신 전처리 시스템으로서,As a deep learning-based voice communication pre-processing system,

다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈;a multi-channel voice pre-processing module for extracting spatial information about space and voice information about voice from multi-channel voice received through a multi-channel microphone;

상기 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈;a voice distortion correcting module that receives spatial information and voice information extracted from the multi-channel voice pre-processing module, corrects distortion of voice extracted from the spatial information, and outputs improved voice;

상기 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈; 및a composite voice separation module that receives the enhanced composite voice in which the voice distortion is corrected from the voice distortion correcting module and separates the enhanced composite voice into individual voices; and

상기 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하는 것을 그 구성상의 특징으로 한다.It is characterized in that it includes a voice codec module that compresses the voices individually separated from the composite voice separation module, transmits them through a voice channel, and restores the transmitted compressed voice.

바람직하게는, 상기 멀티채널 음성 전처리 모듈은,Preferably, the multi-channel audio pre-processing module,

다채널 마이크로 입력되는 음성의 마이크 배열에 따라 채널별로 입력되는 음성들의 정보를 취합하여 활용 가능한 공간적 임베딩을 추출하는 네트워크로 구성될 수 있다.It can be composed of a network that extracts usable spatial embedding by collecting information on voices inputted by channels according to the microphone arrangement of voices inputted by multi-channel microphones.

바람직하게는, 상기 음성 왜곡 보정 모듈은,Preferably, the voice distortion correction module,

상기 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받아 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 상기 복합 음성 분리 모듈로 출력하되, 음성 분리를 위한 화자의 수를 측정하여 측정된 화자의 수를 복합 음성 분리 모듈로 출력하는 화자 수 측정 모듈을 더 포함하여 구성할 수 있다.The spatial information and voice information extracted from the multi-channel voice pre-processing module are received, the spatial information is extracted and the distortion of the voice is corrected, and the improved voice is output to the composite voice separation module, and the number of speakers for voice separation is measured. It may be configured to further include a speaker number measurement module that outputs the number of speakers measured by the above to the complex voice separation module.

바람직하게는, 상기 복합 음성 분리 모듈은,Preferably, the complex voice separation module,

상기 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 음성의 특징을 추출하는 인코더;an encoder receiving the improved complex voice, in which voice distortion is corrected, from the voice distortion correction module and extracting voice features;

상기 인코더를 통해 추출된 음성의 특징들을 분리해주는 음성 분리 네트워크; 및a voice separation network separating features of the voice extracted through the encoder; and

상기 음성 분리 네트워크를 통해 분리된 음성의 특징들을 다시 음성으로 복원하는 디코더를 포함하여 구성할 수 있다.It may be configured to include a decoder that restores the characteristics of the voice separated through the voice separation network back to voice.

더욱 바람직하게는, 상기 음성 코덱 모듈은,More preferably, the voice codec module,

상기 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하는 인코더; 및an encoder for compressing the separately separated voices from the complex voice separation module; and

음성 채널을 통해 전송받은 압축된 음성에 대해 복원하는 뉴럴 디코더를 포함하여 구성할 수 있다.It can be configured to include a neural decoder for restoring compressed voice transmitted through a voice channel.

더욱 더 바람직하게는, 상기 음성 코덱 모듈의 인코더는,Even more preferably, the encoder of the voice codec module,

상기 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하되, 음성을 인코딩하여 비트 스트림(bit stream)으로 음성 정보를 압축 처리할 수 있다.Voices individually separated from the complex voice separation module may be compressed, but audio information may be compressed into a bit stream by encoding the voice.

더더욱 바람직하게는, 상기 음성 통신 전처리 시스템은,Even more preferably, the voice communication pre-processing system comprises:

상기 멀티채널 음성 전처리 모듈과, 음성 왜곡 보정 모듈과, 복합 음성 분리 모듈, 및 음성 코덱 모듈을 딥러닝 기반의 하나의 통합 모델로 구성할 수 있다.The multi-channel voice pre-processing module, the voice distortion correction module, the complex voice separation module, and the voice codec module may be configured as a deep learning-based integrated model.

본 발명에서 제안하고 있는 딥러닝 기반 음성 통신 전처리 시스템에 따르면, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈과, 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈과, 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈과, 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하여 구성함으로써, 왜곡된 음성을 복원할 뿐만 아니라, 다채널 마이크로 입력받은 음성으로부터 공간 정보를 추출하여 2개 이상의 음성이 포함되어 있을 때 이를 각각 분리하고 코덱을 통해 불필요한 잡음을 이차적으로 제거하여 깨끗한 음성을 개별적으로 복원이 가능하도록 하며, 그에 따른 음성 통신에 적용되는 다양한 서비스에 활용될 수 있도록 할 수 있다.According to the deep learning-based voice communication pre-processing system proposed in the present invention, a multi-channel voice pre-processing module for extracting spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone, and multi-channel voice pre-processing A voice distortion correction module receiving the spatial information and voice information extracted from the module and correcting the distortion of the voice from which the spatial information was extracted and processed to output an improved voice; A composite voice separation module that receives and separates the enhanced composite voice into individual voices, compresses the voices individually separated from the composite voice separation module, transmits them through a voice channel, and restores the transmitted compressed voice. By including a voice codec module that not only restores distorted voices, but also extracts spatial information from voices input by multi-channel microphones, separates them from each other when two or more voices are included, and removes unnecessary noise through the codec. Secondarily removed, clear voices can be individually restored, and thus can be used for various services applied to voice communication.

또한, 본 발명의 딥러닝 기반 음성 통신 전처리 시스템에 따르면, 다채널 마이크로부터 입력되는 음성의 공간적 정보를 추출하여 왜곡된 음성의 복원, 혼합된 음성의 분리, 그리고 코덱을 통한 개별 음성 복원 기술이 통합된 딥러닝 기반의 음성 통신 전처리 시스템 모델로 구현이 가능하도록 함으로써, 하나의 통합 시스템으로 제안하여 하나의 파이프라인으로 처리하여 모든 모듈을 한 번에 훈련할 수 있어서, 모델의 유지 및 업데이트가 매우 용이하고, 모든 과정을 딥러닝으로 처리를 하여 기존 수식보다 현실 왜곡 음성 데이터에 더 적합한 weight를 얻고, 내부 모듈간의 상호작용으로 인해 성능 향상이 가능하도록 할 수 있다.In addition, according to the deep learning-based voice communication pre-processing system of the present invention, spatial information of voice input from a multi-channel microphone is extracted and distorted voice is restored, mixed voice is separated, and individual voice restoration technologies are integrated through a codec. By enabling implementation as a deep learning-based voice communication preprocessing system model, it is proposed as an integrated system and processed as one pipeline to train all modules at once, so it is very easy to maintain and update the model. And, by processing all processes with deep learning, it is possible to obtain weights more suitable for reality distortion voice data than existing formulas, and to improve performance due to interaction between internal modules.

도 1은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 구성을 기능블록으로 도시한 도면.
도 2는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 왜곡 보정 모듈의 구성을 기능블록으로 도시한 도면.
도 3은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 복합 음성 분리 모듈의 구성을 기능블록으로 도시한 도면.
도 4는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 코덱 모듈의 구성을 기능블록으로 도시한 도면.
도 5는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 전체 구성 과정을 개략적으로 도시한 도면.
도 6은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 통신 후 음성 복원 과정을 개략적으로 도시한 도면.
도 7은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 통신 전 압축 전처리 과정을 개략적으로 도시한 도면.
도 8은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 통신 후 복원 과정을 개략적으로 도시한 도면.1 is a diagram showing the configuration of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks.
2 is a diagram showing the configuration of a voice distortion correction module of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks;
3 is a diagram showing the configuration of a complex voice separation module of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks.
4 is a diagram showing the configuration of a voice codec module of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks.
5 is a diagram schematically illustrating the entire configuration process of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention.
6 schematically illustrates a voice restoration process after voice communication in a deep learning-based voice communication pre-processing system according to an embodiment of the present invention.
7 schematically illustrates a pre-compression pre-processing process of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention.
8 schematically illustrates a restoration process after communication in a deep learning-based voice communication pre-processing system according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, preferred embodiments will be described in detail so that those skilled in the art can easily practice the present invention with reference to the accompanying drawings. However, in describing a preferred embodiment of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the same reference numerals are used throughout the drawings for parts having similar functions and actions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 ‘연결’ 되어 있다고 할 때, 이는 ‘직접적으로 연결’ 되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 ‘간접적으로 연결’ 되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 ‘포함’ 한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다.In addition, throughout the specification, when a part is said to be 'connected' to another part, this is not only the case where it is 'directly connected', but also the case where it is 'indirectly connected' with another element in between. include In addition, 'including' a certain component means that other components may be further included, rather than excluding other components unless otherwise specified.

도 1은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 구성을 기능블록으로 도시한 도면이고, 도 2는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 왜곡 보정 모듈의 구성을 기능블록으로 도시한 도면이며, 도 3은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 복합 음성 분리 모듈의 구성을 기능블록으로 도시한 도면이고, 도 4는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 코덱 모듈의 구성을 기능블록으로 도시한 도면이다. 도 1 내지 도 4에 각각 도시된 바와 같이, 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템(100)은, 공간 정보 및 음성 정보를 추출하는 멀티채널 음성 전처리 모듈(110)과, 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈(120)과, 개별 음성으로 분리하는 복합 음성 분리 모듈(130)과, 분리된 음성을 각각 압축하여 전송하고, 전송받은 압축된 음성을 복원하는 음성 코덱 모듈(140)을 포함하여 구성될 수 있다. 이하에서는 첨부된 도면을 참조하여 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 구체적인 구성에 대해 설명하기로 한다.1 is a diagram showing the configuration of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks, and FIG. 2 is a voice distortion of the deep learning-based voice communication pre-processing system according to an embodiment of the present invention. 3 is a diagram showing the configuration of a correction module in functional blocks, and FIG. 3 is a diagram showing the configuration of a complex speech separation module of a deep learning-based voice communication preprocessing system in functional blocks according to an embodiment of the present invention, and FIG. It is a diagram showing the configuration of a voice codec module of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention in functional blocks. 1 to 4, a deep learning-based voice communication pre-processing system 100 according to an embodiment of the present invention includes a multi-channel voice pre-processing module 110 for extracting spatial information and voice information; The voice distortion correction module 120 corrects the distortion of the voice and outputs the improved voice, the composite voice separation module 130 separates the voice into individual voices, compresses and transmits the separated voice, and transmits the transmitted compressed voice. It may be configured to include a voice codec module 140 for restoring. Hereinafter, a detailed configuration of a deep learning-based voice communication preprocessing system according to an embodiment of the present invention will be described with reference to the accompanying drawings.

도 5는 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 전체 구성 과정을 개략적으로 도시한 도면이고, 도 6은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 음성 통신 후 음성 복원 과정을 개략적으로 도시한 도면이며, 도 7은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 통신 전 압축 전처리 과정을 개략적으로 도시한 도면이고, 도 8은 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템의 통신 후 복원 과정을 개략적으로 도시한 도면이다.5 is a diagram schematically showing the entire configuration process of a deep learning-based voice communication pre-processing system according to an embodiment of the present invention, and FIG. 6 is a diagram of the deep learning-based voice communication pre-processing system according to an embodiment of the present invention. It is a diagram schematically showing a voice restoration process after communication, and FIG. 7 is a diagram schematically showing a compression preprocessing process before communication in a deep learning-based voice communication preprocessing system according to an embodiment of the present invention, and FIG. 8 is a diagram schematically showing the present invention It is a diagram schematically showing a restoration process after communication of a deep learning-based voice communication pre-processing system according to an embodiment.

멀티채널 음성 전처리 모듈(110)은, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 구성이다. 이러한 멀티채널 음성 전처리 모듈(110)은 다채널 마이크로 입력되는 음성의 마이크 배열에 따라 채널별로 입력되는 음성들의 정보를 취합하여 활용 가능한 공간적 임베딩을 추출하는 네트워크로 구성될 수 있다. 여기서, 멀티채널 음성 전처리 모듈(110)은 다채널 마이크로 입력되는 채널별 음성의 입력간 시간 지연을 고려하여 공간에 대한 공간 정보 및 음성 정보를 추출하도록 기능할 수 있다.The multi-channel voice pre-processing module 110 is a component that extracts spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone. The multi-channel voice pre-processing module 110 may be configured as a network that extracts usable spatial embedding by collecting voice information input for each channel according to the microphone arrangement of the voice input by the multi-channel microphone. Here, the multi-channel voice pre-processing module 110 may function to extract spatial information and voice information about a space in consideration of a time delay between inputs of voices for each channel input by a multi-channel microphone.

또한, 멀티채널 음성 전처리 모듈(110)은 후술하게 될 음성 왜곡 보정 모듈(120)과 복합 음성 분리 모듈(130)과 음성 코덱 모듈(140)의 성능 향상을 위한 멀티채널 전처리 기법을 위한 빔포밍 기술을 제공할 수 있다. 즉, u-net 기반의 생성 모델을 활용해서 짧은 연산 시간과 높은 성능의 음성 향상이 이루어질 수 할 수 있다.In addition, the multi-channel audio pre-processing module 110 is a beamforming technology for a multi-channel pre-processing technique for improving the performance of the audio distortion correction module 120, the composite speech separation module 130, and the audio codec module 140, which will be described later. can provide. That is, by using the u-net-based generation model, short computation time and high-performance voice enhancement can be achieved.

일반적으로 멀티채널 음성은 공간에 대한 정보와 음성에 대한 정보를 동시에 포함하고 있다. 빔포밍 기술은 멀티채널 음성에서 필요한 정보를 활용 가능한 형태로 뽑아내기 위한 기술이다. 기존에는 delay-and-sum, minimum variance distortionless response (MVDR), generalized sidelobe canceller (GSC) 등의 빔포밍 기술을 이용해 멀티채널 음성 전처리를 수행했다. 이를 통해 음원의 위치 추정, 음성 향상 등을 수행했다.In general, multi-channel voice includes spatial information and voice information at the same time. Beamforming technology is a technology for extracting necessary information from multi-channel voice in a usable form. In the past, multi-channel voice preprocessing was performed using beamforming technologies such as delay-and-sum, minimum variance distortionless response (MVDR), and generalized sidelobe canceller (GSC). Through this, the localization of the sound source and voice enhancement were performed.

최근 딥러닝의 발달에 따라 멀티채널 음성 빔포밍에도 기계 학습 기술을 적용하려는 여러 시도가 있었다. 대표적으로는 neural beamformer로 기존의 MVDR, GSC등의 기법에서 활용하는 마스크의 추정을 딥러닝을 통해 수행하는 방법이다. 이를 통해 기존의 통계 기반 기법과 비슷하거나 조금 향상된 성능을 보여줬으나 빔포밍을 수행하는데 걸리는 연산시간이 길어지는 문제가 발생함을 확인할 수 있었다. 이후에는 end-to-end로 음성 피쳐를 따로 뽑지 않고 멀티채널 음성을 바로 입력해 멀티채널 음성 향상과 방향추정 등을 수행하는 방법에 대한 연구가 이루어지고 있다.Recently, with the development of deep learning, several attempts have been made to apply machine learning technology to multi-channel voice beamforming. Representatively, it is a method of performing mask estimation, which is used in existing techniques such as MVDR and GSC, with a neural beamformer through deep learning. Through this, it was confirmed that the performance was similar to or slightly improved compared to the existing statistics-based technique, but there was a problem in that the computation time required to perform beamforming was increased. Since then, research has been conducted on a method of performing multi-channel speech enhancement and direction estimation by directly inputting multi-channel speech without separately extracting speech features end-to-end.

음성 왜곡 보정 모듈(120)은, 멀티채널 음성 전처리 모듈(110)로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 구성이다. 이러한 음성 왜곡 보정 모듈(120)은 도 2에 도시된 바와 같이, 멀티채널 음성 전처리 모듈(110)로부터 추출된 공간 정보 및 음성 정보를 전달받아 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 복합 음성 분리 모듈(130)로 출력하되, 음성 분리를 위한 화자의 수를 측정하여 측정된 화자의 수를 복합 음성 분리 모듈(130)로 출력하는 화자 수 측정 모듈(121)을 더 포함하여 구성할 수 있다. 여기서, 음성 왜곡 보정 모듈(120)은 주어진 음성의 왜곡을 보정하는 과정으로, 음성의 전달력을 낮추는 소음과 잔향을 제거하는 역할을 한다. 공간적 정보를 반영하여 음성 왜곡이 보정되더라도 2명 이상의 음성이 섞이게 되면 그 상태는 그대로 유지되게 되고, 이는 여전히 음성의 전달력을 방해하기 때문에 이를 해결하고자 종단형 음성 분리 모듈인 복합 음성 분리 모듈(130)을 통해 처리할 수 있도록 전달하게 된다.The voice distortion correction module 120 is a component that receives spatial information and voice information extracted from the multi-channel voice pre-processing module 110, corrects distortion of voice extracted from the spatial information, and outputs improved voice. As shown in FIG. 2, the voice distortion correcting module 120 receives the spatial information and voice information extracted from the multi-channel voice pre-processing module 110 and corrects the distortion of the voice extracted and processed with the spatial information to improve the voice. is output to the composite voice separation module 130, and further comprises a speaker count measurement module 121 for measuring the number of speakers for voice separation and outputting the measured number of speakers to the composite voice separation module 130. can do. Here, the voice distortion correction module 120 is a process of correcting the distortion of a given voice, and serves to remove noise and reverberation that lower the transmission power of voice. Even if voice distortion is corrected by reflecting spatial information, when two or more voices are mixed, the state is maintained, and this still hinders voice transmission. To solve this problem, the composite voice separation module (130 ) to be passed on for processing.

또한, 음성 왜곡 보정 모듈(120)은 종단형으로 구성되어 있으며, 깨끗한 음성과 모델을 통해 예측된 음성간의 L2 로스로 학습하게 된다. 이때, 예측된 음성이 섞였을 경우 예측된 음성의 수를 파악하여 복합 음성 분리 모듈(130)로 전달하여 개별 음성을 분리를 한다. 다만 여기서 화자의 수를 측정하는 모듈은 따로 훈련을 진행하여 모듈에 접목시킬 수 있다.In addition, the voice distortion correction module 120 is configured in a vertical type, and learns as an L2 loss between a clear voice and a voice predicted through a model. At this time, when the predicted voices are mixed, the number of predicted voices is identified and transferred to the composite voice separation module 130 to separate individual voices. However, the module for measuring the number of speakers may be separately trained and incorporated into the module.

이와 같이, 음성 왜곡 보정은 발화된 음성이 마이크 배열(Microphone array)로 들어가는 과정에서, 음성은 여러 종류의 소음과 섞이게 되고, 발화자가 있는 공간에 따라 벽 등 장애물에 반사되어 잔향이 발생하게 된다. 이러한 소음, 잔향은 음성을 왜곡하여, 음성의 명료도를 떨어트리는 주원인이 된다. 이러한 왜곡을 제거하여 깨끗한 음성을 얻어내기 위하여 많은 연구가 진행되고 있고, 최근에는 딥러닝을 기반으로 한 여러 가지 모델이 제시되고 있다. 음성 파형이나 이 파형을 단시간 푸리에 변환을 이용하여 얻어낸 스펙트로그램을 연결된 부호기와 복호기를 통하여 복원하는 U-Net 구조가 가장 널리 사용되고 있는 음성 왜곡 보정 기술이다.In this way, in the voice distortion correction, in the process of entering the uttered voice into the microphone array, the voice is mixed with various types of noise and is reflected from obstacles such as walls depending on the space where the speaker is located, resulting in reverberation. Such noise and reverberation distort the voice and become the main cause of deteriorating the clarity of the voice. A lot of research is being conducted to obtain a clear voice by removing this distortion, and recently, various models based on deep learning have been proposed. The U-Net structure that restores the audio waveform or the spectrogram obtained by using the short-time Fourier transform of the waveform through the connected encoder and decoder is the most widely used audio distortion correction technology.

복합 음성 분리 모듈(130)은, 음성 왜곡 보정 모듈(120)로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 구성이다. 이러한 복합 음성 분리 모듈(130)은 도 3에 도시된 바와 같이, 음성 왜곡 보정 모듈(120)로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 음성의 특징을 추출하는 인코더(131)와, 인코더(131)를 통해 추출된 음성의 특징들을 분리해주는 음성 분리 네트워크(132)와, 음성 분리 네트워크(132)를 통해 분리된 음성의 특징들을 다시 음성으로 복원하는 디코더(133)를 포함하여 구성할 수 있다. 여기서, 복합 음성 분리 모듈(130)은 개별 음성의 SI-SNR을 최대화하는 방향으로 학습을 진행할 수 있다.The composite voice separation module 130 is a component that receives the improved composite voice in which voice distortion is corrected from the voice distortion correcting module 120 and separates the enhanced composite voice into individual voices. As shown in FIG. 3, the composite voice separation module 130 includes an encoder 131 that receives improved composite voice in which voice distortion is corrected from the voice distortion correction module 120 and extracts voice features; A voice separation network 132 that separates the voice features extracted through the encoder 131 and a decoder 133 that restores the voice features separated through the voice separation network 132 back into voice. can Here, the composite voice separation module 130 may perform learning in a direction of maximizing the SI-SNR of individual voices.

또한, 복합 음성 분리 모듈(130)은 종단형 음성 분리 기술로서, 종단형 음성 분리는 2개 이상의 음성이 섞여 들어왔을 때 이를 각각 분리해내는 작업이다. 기존의 음성 분리 시스템은 섞인 음성을 멜 스펙트로그램으로 변환 후 사람의 목소리별 fundamental frequency를 찾아서 음성을 분리해 내었다. 사람의 목소리의 주파수대역은 비슷한 구간에 분포해 있기 때문에 음성을 수식적으로 분리해내는데 한계가 있다. 최근의 딥러닝 발달로 인해 종단형 음성 분리 기술은 음성 분리에 좋은 성능을 보이고 있다.In addition, the complex voice separation module 130 is a longitudinal voice separation technology, and the longitudinal voice separation is a task of separating two or more voices when they are mixed. Existing voice separation systems convert the mixed voice into a Mel spectrogram and then separate the voice by finding the fundamental frequency of each human voice. Since the frequency band of the human voice is distributed in a similar section, there is a limit to mathematically separating the voice. Due to the recent development of deep learning, longitudinal speech separation technology shows good performance in speech separation.

음성 코덱 모듈(140)은, 복합 음성 분리 모듈(130)로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 구성이다. 이러한 음성 코덱 모듈(140)은 도 4에 도시된 바와 가팅, 복합 음성 분리 모듈(130)로부터 개별로 분리된 음성을 각각 압축하는 인코더(141)와, 음성 채널을 통해 전송받은 압축된 음성에 대해 복원하는 뉴럴 디코더(142)를 포함하여 구성할 수 있다. 여기서, 음성 코덱 모듈(140)의 인코더(141)는 복합 음성 분리 모듈(130)로부터 개별로 분리된 음성을 각각 압축하되, 음성을 인코딩하여 비트 스트림(bit stream)으로 음성 정보를 압축 처리할 수 있다. 즉, 음성 코덱 모듈(140)은 비트 스트림으로 음성의 정보를 압축하고, 이 비트 스트림을 통신에서 송신하여 수신하는 쪽에서 받을 수 있으며, 깨끗한 음성을 재구성하는 과정에서 음성의 잡음을 한 번 더 제거할 수 있다.The voice codec module 140 is a component that compresses voices individually separated from the composite voice separation module 130, transmits them through a voice channel, and restores the transmitted compressed voice. As shown in FIG. 4, the voice codec module 140 includes an encoder 141 for compressing voices individually separated from the voice separation module 130 and the compressed voice transmitted through the voice channel. It can be configured to include a neural decoder 142 that restores. Here, the encoder 141 of the voice codec module 140 compresses the voices individually separated from the composite voice separation module 130, and encodes the voices to compress the voice information into a bit stream. there is. That is, the voice codec module 140 compresses voice information into a bit stream, transmits the bit stream in communication and receives it at the receiving end, and removes voice noise once more in the process of reconstructing clear voice. can

또한, 음성 코덱 모듈(140)은 L2 로스로 주어 이 과정에서 통신 및 제거되지 않은 음성의 왜곡 요소를 2단계로 제거함으로서 깨끗한 음성을 들을 수 있도록 기능할 수 있다.In addition, the voice codec module 140 can function to hear clear voice by removing distorted elements of voice that are not communicated and removed in this process given as L2 loss in two steps.

또한, 음성 코덱 모듈(140)의 음성 통신 과정은 다음과 같다. 먼저 음성을 인코딩하여 bit stream으로 만든 후 송신한다. 이후 통신 채널을 통해 전송된 bit stream을 수신 단에서 받으면, 디코딩을 통하여 음성을 복원한다. 이때, 통신 채널이 전달할 수 있는 정보의 양은 제한적이기 때문에, 인코딩할 때 음성을 최대한으로 압축하면서 디코딩할 때 원본과 최대한 가깝게 복원할 수 있도록 해주게 된다. 종래의 음성 코덱에서 인코더는 사람이 직접 설계한 음성 특징들을 추출하며, 디코더는 해당 특징을 가지고 음성을 수식적으로 복원했지만, 사람이 직접 설계하기 때문에 더 좋은 성능의 코덱을 개발하는 것에 한계가 있었다. 이에 최근에는 더 좋은 인코더 또는 디코더를 스스로 학습하는 딥러닝 모델에 대한 연구가 활발히 진행되고 있으며, 전통적인 코덱에 비해 우수한 성능을 보이는 경우가 나타나고 있다.In addition, the voice communication process of the voice codec module 140 is as follows. First, the voice is encoded, converted into a bit stream, and then transmitted. Then, when the receiving end receives the bit stream transmitted through the communication channel, the voice is restored through decoding. At this time, since the amount of information that can be transmitted through the communication channel is limited, the voice is compressed as much as possible during encoding and restored as close to the original as possible during decoding. In the conventional voice codec, the encoder extracts voice features designed by humans, and the decoder mathematically restores voice using the corresponding features. . In recent years, research on deep learning models that learn better encoders or decoders by themselves has been actively conducted, and cases showing superior performance compared to traditional codecs have emerged.

이와 같이 음성 잡음뿐만 아닌 다양한 요인으로 인한 음성 왜곡을 보정해주는 음성 통신 전처리 기술로 구성되는 음성 통신 전처리 시스템(100)은 멀티채널 음성 전처리 모듈(110)과, 음성 왜곡 보정 모듈(120)과, 복합 음성 분리 모듈(130), 및 음성 코덱 모듈(140)을 딥러닝 기반의 하나의 통합 모델로 구성할 수 있다. 이러한 음성 통신 전처리 시스템(100)은 딥러닝 모델을 이용하기 때문에 최소한의 연산량을 감당할 수 있는 스마트폰, 컴퓨터 등에서 쉽게 사용될 수 있도록 기능할 수 있다. 또한, 다채널 마이크로부터 공간적 정보를 추출하여 왜곡된 음성의 복원, 혼합된 음성의 분리, 코덱을 통한 개별 음성 복원 기술이 통합된 딥러닝 기반의 음성 통신 전처리 시스템으로 구현될 수 있다.The voice communication pre-processing system 100 composed of voice communication pre-processing technology that corrects voice distortion due to various factors other than voice noise as described above includes a multi-channel voice pre-processing module 110, a voice distortion correction module 120, and a complex The voice separation module 130 and the voice codec module 140 may be configured as a deep learning-based integrated model. Since this voice communication pre-processing system 100 uses a deep learning model, it can function so that it can be easily used in a smartphone, computer, etc. that can handle a minimum amount of computation. In addition, it can be implemented as a deep learning-based voice communication pre-processing system in which spatial information is extracted from a multi-channel microphone to restore distorted voice, separate mixed voice, and individual voice restoration technologies through a codec.

또한, 하나의 통합 시스템으로 구현되어 하나의 파이프라인으로 처리하여 모든 모듈을 한 번에 훈련할 수 있어서, 모델의 유지 및 업데이트가 매우 용이하고, 모든 과정을 딥러닝으로 처리를 하여 기존 수식보다 현실 왜곡 음성 데이터에 더 적합한 weight를 얻고, 내부 모듈간의 상호작용으로 인해 성능 향상을 보일 수 있도록 기능할 수 있으며, 비대면 회의 환경에 가장 적합한 왜곡 음성 복원 솔루션으로 제공되어 활용될 수 있다.In addition, since it is implemented as an integrated system and processed as one pipeline, all modules can be trained at once, so it is very easy to maintain and update the model, and all processes are processed with deep learning, making it more realistic than conventional formulas. It can function to obtain more suitable weight for distorted voice data and show performance improvement due to interaction between internal modules, and can be provided and utilized as the most suitable distorted voice restoration solution for a non-face-to-face meeting environment.

또한, 본 발명의 음성 통신 전처리 시스템(100)은 음성을 효율적으로 송/수신할 수 있을 뿐만 아니라 입력 음성의 왜곡을 효과적으로 제거하기 때문에 다른 음성 통신 시스템에 비해 이점이 있으며, 기존에 쓰이던 전통적인 음성 코덱을 바로 대체하여 사용할 수 있다. In addition, the voice communication pre-processing system 100 of the present invention not only can transmit/receive voice efficiently, but also effectively removes distortion of input voice, so it has an advantage over other voice communication systems. can be used directly as a substitute for .

상술한 바와 같이, 본 발명의 일실시예에 따른 딥러닝 기반 음성 통신 전처리 시스템은, 다채널 마이크로 입력받은 멀티채널 음성으로부터 공간에 대한 공간 정보 및 음성에 대한 음성 정보를 추출하는 멀티채널 음성 전처리 모듈과, 멀티채널 음성 전처리 모듈로부터 추출된 공간 정보 및 음성 정보를 전달받고, 공간 정보가 추출 처리된 음성의 왜곡을 보정하여 향상된 음성을 출력하는 음성 왜곡 보정 모듈과, 음성 왜곡 보정 모듈로부터 음성의 왜곡이 보정된 향상된 복합 음성을 전달받고, 향상된 복합 음성을 각각의 개별 음성으로 분리하는 복합 음성 분리 모듈과, 복합 음성 분리 모듈로부터 개별로 분리된 음성을 각각 압축하여 음성 채널을 통해 전송하고, 전송받은 압축된 음성에 대해 복원하는 음성 코덱 모듈을 포함하여 구성함으로써, 왜곡된 음성을 복원할 뿐만 아니라, 다채널 마이크로 입력받은 음성으로부터 공간 정보를 추출하여 2개 이상의 음성이 포함되어 있을 때 이를 각각 분리하고 코덱을 통해 불필요한 잡음을 이차적으로 제거하여 깨끗한 음성을 개별적으로 복원이 가능하도록 하며, 그에 따른 음성 통신에 적용되는 다양한 서비스에 활용될 수 있도록 할 수 있으며, 특히, 다채널 마이크로부터 입력되는 음성의 공간적 정보를 추출하여 왜곡된 음성의 복원, 혼합된 음성의 분리, 그리고 코덱을 통한 개별 음성 복원 기술이 통합된 딥러닝 기반의 음성 통신 전처리 시스템 모델로 구현이 가능하도록 함으로써, 하나의 통합 시스템으로 제안하여 하나의 파이프라인으로 처리하여 모든 모듈을 한 번에 훈련할 수 있어서, 모델의 유지 및 업데이트가 매우 용이하고, 모든 과정을 딥러닝으로 처리를 하여 기존 수식보다 현실 왜곡 음성 데이터에 더 적합한 weight를 얻고, 내부 모듈간의 상호작용으로 인해 성능 향상이 가능하도록 할 수 있게 된다.As described above, the deep learning-based voice communication pre-processing system according to an embodiment of the present invention is a multi-channel voice pre-processing module that extracts spatial information about space and voice information about voice from multi-channel voice input by a multi-channel microphone. and a voice distortion correction module receiving the spatial information and voice information extracted from the multi-channel voice pre-processing module, correcting the distortion of the voice from which the spatial information was extracted, and outputting an improved voice, and voice distortion from the voice distortion correction module. A composite voice separation module that receives the corrected and improved composite voice and separates the enhanced composite voice into individual voices, and compresses the individually separated voices from the composite voice separation module and transmits them through a voice channel. By including a voice codec module that restores compressed voice, distorted voice is not only restored, but also spatial information is extracted from voice input with a multi-channel microphone, and when two or more voices are included, they are separated from each other. By removing unnecessary noise secondaryly through the codec, it is possible to individually restore clean voice, and it can be used for various services applied to voice communication. In particular, the spatial By extracting information, it can be implemented as a deep learning-based voice communication pre-processing system model that integrates distorted voice restoration, mixed voice separation, and individual voice restoration technology through codec. All modules can be trained at once by processing in one pipeline, so it is very easy to maintain and update the model. , it is possible to improve performance due to interaction between internal modules.

이상 설명한 본 발명은 본 발명이 속한 기술분야에서 통상의 지식을 가진 자에 의하여 다양한 변형이나 응용이 가능하며, 본 발명에 따른 기술적 사상의 범위는 아래의 특허청구범위에 의하여 정해져야 할 것이다.The present invention described above can be variously modified or applied by those skilled in the art to which the present invention belongs, and the scope of the technical idea according to the present invention should be defined by the claims below.

100: 본 발명의 일실시예에 따른 음성 통신 전처리 시스템
110: 멀티채널 음성 전처리 모듈
120: 음성 왜곡 보정 모듈
121: 화자 수 측정 모듈
130: 복합 음성 분리 모듈
131: 인코더
132: 음성 분리 네트워크(신경망)
133: 디코더
140: 음성 코덱 모듈
141: 인코더
142: 뉴럴 디코더100: Voice communication pre-processing system according to an embodiment of the present invention
110: multi-channel speech pre-processing module
120: voice distortion correction module
121: speaker count measurement module
130: complex voice separation module
131: encoder
132 Voice Separation Network (Neural Network)
133: decoder
140: voice codec module
141: encoder
142: neural decoder

Claims

As a deep learning-based voice communication pre-processing system 100,
a multi-channel voice pre-processing module 110 that extracts spatial information about space and voice information about voice from multi-channel voice received through a multi-channel microphone;
a voice distortion correction module 120 that receives the spatial information and voice information extracted from the multi-channel voice pre-processing module 110, corrects distortion of the voice from which the spatial information has been extracted, and outputs improved voice;
a composite voice separation module 130 that receives the enhanced composite voice in which the voice distortion is corrected from the voice distortion correcting module 120 and separates the enhanced composite voice into individual voices; and
Characterized in that it includes a voice codec module 140 that compresses the voices individually separated from the composite voice separation module 130, transmits them through a voice channel, and restores the transmitted compressed voice. based voice communication pre-processing system.

The method of claim 1, wherein the multi-channel audio pre-processing module 110,
A deep learning-based voice communication pre-processing system, characterized in that it consists of a network that extracts usable spatial embedding by collecting information on voices input by channels according to the microphone arrangement of voices input by multi-channel microphones.

The method of claim 1, wherein the voice distortion correction module 120,
The spatial information and voice information extracted from the multi-channel voice pre-processing module 110 are received, and distortion of the voice extracted and processed with the spatial information is corrected to output the improved voice to the composite voice separation module 130, and the voice separation is performed. The deep learning-based voice communication pre-processing system, characterized in that it further comprises a speaker count measurement module 121 for measuring the number of speakers for the purpose and outputting the measured number of speakers to the complex voice separation module 130.

The method of any one of claims 1 to 3, wherein the complex voice separation module 130,
an encoder 131 receiving the improved complex voice in which voice distortion is corrected from the voice distortion correcting module 120 and extracting characteristics of the voice;
a voice separation network 132 separating features of the voice extracted through the encoder 131; and
Characterized in that it comprises a decoder (133) for restoring the characteristics of the voice separated through the voice separation network (132) back to voice, the deep learning-based voice communication pre-processing system.

The method of claim 4, wherein the voice codec module 140,
Encoders 141 for compressing the voices individually separated from the complex voice separation module 130; and
A deep learning-based voice communication pre-processing system, characterized by comprising a neural decoder 142 that restores compressed voice transmitted through a voice channel.

The method of claim 5, wherein the encoder 141 of the voice codec module 140,
The deep learning-based voice communication pre-processing system, characterized in that each of the individually separated voices from the complex voice separation module 130 is compressed, and the voice is encoded and the voice information is compressed into a bit stream.

The method of claim 6, wherein the voice communication pre-processing system 100,
The multi-channel voice pre-processing module 110, the voice distortion correction module 120, the composite voice separation module 130, and the voice codec module 140 are configured as a deep learning-based integrated model. A deep learning-based voice communication pre-processing system.