KR102628500B1

KR102628500B1 - Apparatus for face-to-face recording and method for using the same

Info

Publication number: KR102628500B1
Application number: KR1020210129163A
Authority: KR
Inventors: 김종엽; 정재훈
Original assignee: 주식회사 케이티
Priority date: 2021-09-29
Filing date: 2021-09-29
Publication date: 2024-01-24
Also published as: KR20230046095A

Abstract

본 발명은 대면녹취단말장치 및 이를 이용한 대면녹취방법에 관한 것으로서, 본 발명의 일 실시예에 의한 대면녹취단말장치는 두 명의 화자 사이에 구비되는 것으로, 제1 화자 측을 향하며, 제1 마이크 신호를 생성하는 제1 마이크 어레이; 제2 화자 측을 향하며, 제2 마이크 신호를 생성하는 제2 마이크 어레이; 및 미리 설정된 오디오 신호를 출력하는 스피커를 포함하고, 상기 제1 마이크 신호 또는 제2 마이크 신호로부터, 상기 제1 화자의 음성에 대응하는 제1 음성신호, 상기 제2 화자의 음성에 대응하는 제2 음성신호 및 상기 스피커로부터 출력된 오디오 신호에 대응하는 제3 음성신호를 분리하는 다채널 신호처리부를 포함할 수 있다.The present invention relates to a face-to-face recording terminal device and a face-to-face recording method using the same. The face-to-face recording terminal device according to an embodiment of the present invention is provided between two speakers, faces the first speaker, and receives a first microphone signal. a first microphone array that generates; a second microphone array facing toward the second speaker and generating a second microphone signal; and a speaker that outputs a preset audio signal, from the first microphone signal or the second microphone signal, a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker. It may include a multi-channel signal processor that separates an audio signal and a third audio signal corresponding to the audio signal output from the speaker.

Description

Face-to-face recording terminal device and face-to-face recording method using the same {Apparatus for face-to-face recording and method for using the same}

본 발명은 녹취시 각각의 화자들의 음성을 분리할 수 있는 대면녹취단말장치 및 이를 이용한 대면녹취방법에 관한 것이다.The present invention relates to a face-to-face recording terminal device that can separate the voices of each speaker during recording and a face-to-face recording method using the same.

최근 금융상품 판매시 위험성이 있는 금융상품에 대한 정보를 전부 설명하지 않고 판매하는 등의 불완전판매 행위가 문제되고 있다. 이를 방지하기 위한 방안으로, 상담원과 고객 사이의 상담내역을 녹취한 후 이를 텍스트로 변환하여 모니터링하는 방안이 제안되고 있으나, 이 경우 정확한 모니터링을 위해서는 녹취 내에 포함되는 복수의 화자들을 분리할 필요가 있다.Recently, when selling financial products, incomplete sales, such as selling risky financial products without explaining all the information, have become a problem. As a way to prevent this, a method has been proposed to record the consultation details between the counselor and the customer and then convert them into text and monitor them. However, in this case, for accurate monitoring, it is necessary to separate multiple speakers included in the recording. .

종래에는 복수의 화자들을 분리하기 위해 미리 화자 모델을 학습(Training)하고, 특정 화자의 목소리를 등록(Enrollment)하는 방식을 활용하였다. 구체적으로, 켑스트럼(Cepstrum), LPC(Linear Predictive Coding), MFCC(Mel-Frequency Cepstrum Coefficients), PLP(Perceptual Linear Predictive Analysis) 등의 특징추출방식을 활용하여 특정 화자의 음성 특징을 추출하고, 이를 비교하는 방식으로 화자를 구별하였다.Conventionally, in order to separate multiple speakers, a method of training a speaker model in advance and registering the voice of a specific speaker was used. Specifically, feature extraction methods such as Cepstrum, LPC (Linear Predictive Coding), MFCC (Mel-Frequency Cepstrum Coefficients), and PLP (Perceptual Linear Predictive Analysis) are used to extract voice features of a specific speaker, Speakers were distinguished by comparing them.

그러나, 종래의 방법의 경우, 화자를 미리 등록하고 등록된 화자의 음성 특성을 바탕으로 화자를 분리하여야 하므로, 매일 바뀌는 고객들의 음성을 사전에 등록하고 학습을 할 수 없을 뿐 아니라, 상담원과 고객을 실시간으로 분리하고 처리하는데 어려움이 존재한다.However, in the case of the conventional method, speakers must be registered in advance and speakers separated based on the voice characteristics of the registered speakers, so not only is it impossible to pre-register and learn the voices of customers that change every day, but it is also necessary to keep the counselor and customer connected. There are difficulties in separating and processing in real time.

한국등록특허공보 제10-1750338호Korean Patent Publication No. 10-1750338

본 발명은, 녹취 내 포함된 각각의 화자들의 음성과, 스피커에서 재생되는 오디오 신호를 정확하게 분리할 수 있는 대면녹취단말장치 및 이를 이용한 대면녹취방법을 제공하기 위한 것이다.The purpose of the present invention is to provide a face-to-face recording terminal device that can accurately separate the voices of each speaker included in the recording and the audio signal played from the speaker, and a face-to-face recording method using the same.

본 발명은, 녹취 내 포함된 각각의 화자들의 음성과, 스피커에서 재생되는 오디오 신호를 정확하게 분리하여, 불완전판매 여부 판단을 위한 STT(Speech-to-Text) 데이터를 생성할 수 있는 대면녹취단말장치 및 이를 이용한 대면녹취방법을 제공하기 위한 것이다.The present invention is a face-to-face recording terminal device that can accurately separate the voices of each speaker included in the recording and the audio signal played from the speaker and generate STT (Speech-to-Text) data for determining incomplete sales. The purpose is to provide a face-to-face recording method using this.

본 발명은, 빔 포밍과 함께 마이크 신호의 특징 파라미터를 조절하여, 녹취 내 포함된 각각의 화자들의 음성을 특정할 수 있는 대면녹취단말장치 및 이를 이용한 대면녹취방법을 제공하기 위한 것이다.The present invention is intended to provide a face-to-face recording terminal device that can specify the voices of each speaker included in a recording by adjusting the characteristic parameters of the microphone signal along with beam forming, and a face-to-face recording method using the same.

본 발명은, 녹취 내에 포함되는 노이즈를 효과적으로 제거할 수 있는 대면녹취단말장치 및 이를 이용한 대면녹취방법을 제공하기 위한 것이다.The purpose of the present invention is to provide a face-to-face recording terminal device that can effectively remove noise contained in a recording and a face-to-face recording method using the same.

본 발명의 일 실시예에 의한 대면녹취단말장치는, 두 명의 화자 사이에 구비되는 것으로, 제1 화자 측을 향하며, 제1 마이크 신호를 생성하는 제1 마이크 어레이; 제2 화자 측을 향하며, 제2 마이크 신호를 생성하는 제2 마이크 어레이; 및 미리 설정된 오디오 신호를 출력하는 스피커를 포함하고, 상기 제1 마이크 신호 또는 제2 마이크 신호로부터, 상기 제1 화자의 음성에 대응하는 제1 음성신호, 상기 제2 화자의 음성에 대응하는 제2 음성신호 및 상기 스피커로부터 출력된 오디오 신호에 대응하는 제3 음성신호를 분리하는 다채널 신호처리부를 포함할 수 있다. A face-to-face recording terminal device according to an embodiment of the present invention is provided between two speakers, and includes a first microphone array facing the first speaker and generating a first microphone signal; a second microphone array facing toward the second speaker and generating a second microphone signal; and a speaker that outputs a preset audio signal, from the first microphone signal or the second microphone signal, a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker. It may include a multi-channel signal processor that separates an audio signal and a third audio signal corresponding to the audio signal output from the speaker.

여기서, 상기 다채널 신호처리부는 상기 제1 화자의 위치에 대응하여 설정된 설정각도를 기준으로 상기 제1 마이크 신호에 DOA(Direction of Arrival) 마스킹(masking)을 적용하고, 상기 DOA 마스킹된 상기 제1 마이크 신호에 대한 전처리를 수행하여, 상기 제1 마이크 신호로부터 상기 제1 음성신호를 추출할 수 있다.Here, the multi-channel signal processing unit applies DOA (Direction of Arrival) masking to the first microphone signal based on the setting angle set corresponding to the position of the first speaker, and the DOA-masked first speaker By performing preprocessing on the microphone signal, the first voice signal can be extracted from the first microphone signal.

여기서, 상기 다채널 신호처리부는 DRC(Dynamic Range Control), EQ(Audio Equalizer), NR(Noise Reduction) 및 AGC(Automatic Gain control) 중 적어도 어느 하나를 이용하여, 상기 제1 마이크 신호에 대한 전처리를 수행할 수 있다.Here, the multi-channel signal processing unit preprocesses the first microphone signal using at least one of Dynamic Range Control (DRC), Audio Equalizer (EQ), Noise Reduction (NR), and Automatic Gain Control (AGC). It can be done.

여기서, 상기 다채널 신호처리부는 상기 DRC를 이용하여, 상기 DOA 마스킹된 상기 제1 마이크 신호 중에서 크기가 임계값 미만인 신호는 감쇄시키고, 임계값 이상인 신호는 증폭시켜 노이즈로부터 상기 제1 음성신호를 분리할 수 있다.Here, the multi-channel signal processing unit uses the DRC to attenuate signals whose size is less than a threshold among the DOA-masked first microphone signals and amplify signals whose size is greater than the threshold to separate the first voice signal from noise. can do.

여기서 상기 다채널 신호처리부는, 수신한 상기 제1 마이크 신호 또는 제2 마이크 신호의 입력 게인을 한계값 미만으로 제한할 수 있다.Here, the multi-channel signal processing unit may limit the input gain of the received first microphone signal or the second microphone signal to less than a threshold value.

여기서 상기 다채널 신호처리부는, 상기 제1 마이크 신호 또는 제2 마이크 신호에 VAD(Voice Activity Detection)를 적용하여, 상기 제1 마이크 신호에 포함된 비음성 노이즈를 제거할 수 있다.Here, the multi-channel signal processing unit may apply VAD (Voice Activity Detection) to the first microphone signal or the second microphone signal to remove non-voice noise included in the first microphone signal.

여기서, 상기 제1 마이크 어레이 및 제2 마이크 어레이는, 차음부재를 포함하는 내부실링부 내에 구비되어, 상기 스피커에서 출력되는 오디오 신호가 상기 대면녹취단말장치 내부를 통해 유입되는 것을 차단할 수 있다. Here, the first microphone array and the second microphone array are provided in an internal sealing part including a sound insulation member, and can block the audio signal output from the speaker from flowing into the face-to-face recording terminal device.

여기서 상기 다채널 신호처리부는, 상기 제1 마이크 신호에 VAD를 적용하여 상기 스피커에서 출력되는 오디오 신호를 구별하여 상기 제1 마이크 신호에서 제거할 수 있다.Here, the multi-channel signal processor may apply VAD to the first microphone signal to distinguish the audio signal output from the speaker and remove it from the first microphone signal.

여기서 상기 다채널 신호처리부는, 상기 오디오 신호를 에코 레퍼런스(echo reference) 신호로 하는 AEC(Acoustic Echo Cancelation)를 적용하여, 상기 제2 마이크 신호에서 상기 오디오 신호를 제거할 수 있다.Here, the multi-channel signal processing unit may remove the audio signal from the second microphone signal by applying Acoustic Echo Cancellation (AEC), which uses the audio signal as an echo reference signal.

여기서 다채널 신호처리부는, 상기 에코 레퍼런스를 서버로 전송하여, STT(Speech-to-Text)를 수행하도록 할 수 있다.Here, the multi-channel signal processing unit can transmit the echo reference to the server to perform Speech-to-Text (STT).

본 발명의 일 실시예에 의한 대면녹취방법은, 제1 마이크 어레이, 제2 마이크 어레이 및 스피커를 포함하는 대면녹취단말장치를 이용하는 대면녹취방법에 관한 것으로, 제1 화자 측을 향하는 제1 마이크 어레이가 제1 마이크 신호를 생성하고, 제2 화자 측을 향하는 제2 마이크 어레이가 제2 마이크 신호를 생성하며, 스피커가 미리 설정된 오디오 신호를 출력하는 단계; 및 상기 제1 마이크 신호 또는 제2 마이크 신호로부터, 상기 제1 화자의 음성에 대응하는 제1 음성신호, 상기 제2 화자의 음성에 대응하는 제2 음성신호 및 상기 스피커로부터 출력된 오디오 신호에 대응하는 제3 음성신호를 분리하는 단계를 포함할 수 있다.The face-to-face recording method according to an embodiment of the present invention relates to a face-to-face recording method using a face-to-face recording terminal device including a first microphone array, a second microphone array, and a speaker, wherein the first microphone array is directed toward the first speaker. generating a first microphone signal, a second microphone array facing toward the second speaker generating a second microphone signal, and the speaker outputting a preset audio signal; and from the first microphone signal or the second microphone signal, a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker, and an audio signal output from the speaker. It may include the step of separating the third voice signal.

덧붙여 상기한 과제의 해결수단은, 본 발명의 특징을 모두 열거한 것이 아니다. 본 발명의 다양한 특징과 그에 따른 장점과 효과는 아래의 구체적인 실시형태를 참조하여 보다 상세하게 이해될 수 있을 것이다.Additionally, the means for solving the above problems do not enumerate all the features of the present invention. The various features of the present invention and the resulting advantages and effects can be understood in more detail by referring to the specific embodiments below.

본 발명의 일 실시예에 의한 대면녹취단말장치 및 이를 이용한 대면녹취방법에 의하면, 녹취 내 포함된 각각의 화자들의 음성과, 스피커에서 재생되는 오디오 신호를 정확하게 분리하는 것이 가능하다.According to the face-to-face recording terminal device and the face-to-face recording method using the same according to an embodiment of the present invention, it is possible to accurately separate the voices of each speaker included in the recording and the audio signal played from the speaker.

본 발명의 일 실시예에 의한 대면녹취단말장치 및 이를 이용한 대면녹취방법에 의하면, 녹취 내에 포함되는 노이즈를 효과적으로 제거하는 것이 가능하다.According to the face-to-face recording terminal device and the face-to-face recording method using the same according to an embodiment of the present invention, it is possible to effectively remove noise included in the recording.

본 발명의 일 실시예에 의한 대면녹취단말장치 및 이를 이용한 대면녹취방법에 의하면, 녹취 내 포함된 각각의 화자들의 음성과, 스피커에서 재생되는 오디오 신호를 정확하게 분리하여, 불완전판매 여부 판단을 위한 STT(Speech-to-Text) 데이터를 생성할 수 있으며, 이를 통하여 불완전판매 여부에 대한 모니터링을 수행하는 것이 가능하다.According to the face-to-face recording terminal device and the face-to-face recording method using the same according to an embodiment of the present invention, the voices of each speaker included in the recording are accurately separated from the audio signal played from the speaker, and STT is used to determine whether or not there is an incomplete sale. (Speech-to-Text) data can be generated, through which it is possible to monitor for incomplete sales.

도1은 본 발명의 일 실시예에 의한 대면녹취시스템을 나타내는 개략도이다.
도2는 본 발명의 일 실시예에 의한 대면녹취단말장치를 나타내는 블록도이다.
도3은 본 발명의 일 실시예에 의한 대면녹취단말장치를 나타내는 개략도이다.
도4는 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 각각의 화자들의 발화시 채널분리성능을 나타내는 예시도이다.
도5는 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 스피커와 화자의 동시발화시 채널분리성능을 나타내는 예시도이다.
도6은 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 화자들의 동시발화시 채널분리성능을 나타내는 예시도이다.
도7은 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 대면녹취방법을 나타내는 순서도이다.Figure 1 is a schematic diagram showing a face-to-face recording system according to an embodiment of the present invention.
Figure 2 is a block diagram showing a face-to-face recording terminal device according to an embodiment of the present invention.
Figure 3 is a schematic diagram showing a face-to-face recording terminal device according to an embodiment of the present invention.
Figure 4 is an exemplary diagram showing channel separation performance when each speaker speaks using a face-to-face recording terminal device according to an embodiment of the present invention.
Figure 5 is an example diagram showing channel separation performance when a speaker and a speaker speak simultaneously using a face-to-face recording terminal device according to an embodiment of the present invention.
Figure 6 is an exemplary diagram showing channel separation performance when speakers simultaneously speak using a face-to-face recording terminal device according to an embodiment of the present invention.
Figure 7 is a flowchart showing a face-to-face recording method using a face-to-face recording terminal device according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있도록 바람직한 실시예를 상세히 설명한다. 다만, 본 발명의 바람직한 실시예를 상세하게 설명함에 있어, 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략한다. 또한, 유사한 기능 및 작용을 하는 부분에 대해서는 도면 전체에 걸쳐 동일한 부호를 사용한다.Hereinafter, with reference to the attached drawings, preferred embodiments will be described in detail so that those skilled in the art can easily practice the present invention. However, when describing preferred embodiments of the present invention in detail, if it is determined that a detailed description of a related known function or configuration may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. In addition, the same symbols are used throughout the drawings for parts that perform similar functions and actions.

덧붙여, 명세서 전체에서, 어떤 부분이 다른 부분과 '연결'되어 있다고 할 때, 이는 '직접적으로 연결'되어 있는 경우뿐만 아니라, 그 중간에 다른 소자를 사이에 두고 '간접적으로 연결'되어 있는 경우도 포함한다. 또한, 어떤 구성요소를 '포함'한다는 것은, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있다는 것을 의미한다. 또한, 명세서에 기재된 "~부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어와 소프트웨어의 결합으로 구현될 수 있다. In addition, throughout the specification, when a part is said to be 'connected' to another part, this does not only mean 'directly connected', but also 'indirectly connected' with another element in between. Includes. In addition, 'including' a certain component means that other components may be further included rather than excluding other components, unless specifically stated to the contrary. Additionally, terms such as “unit” and “module” used in the specification refer to a unit that processes at least one function or operation, which may be implemented as hardware or software, or as a combination of hardware and software.

도1은 본 발명의 일 실시예에 의한 대면녹취시스템을 나타내는 개략도이다. Figure 1 is a schematic diagram showing a face-to-face recording system according to an embodiment of the present invention.

도1을 참조하면 본 발명의 일 실시예에 의한 대면녹취시스템은, 대면녹취단말장치(100) 및 서버(200)를 포함할 수 있다. Referring to Figure 1, a face-to-face recording system according to an embodiment of the present invention may include a face-to-face recording terminal device 100 and a server 200.

이하 도1을 참조하여 본 발명의 일 실시예에 의한 대면녹취시스템을 설명한다. Hereinafter, a face-to-face recording system according to an embodiment of the present invention will be described with reference to FIG. 1.

대면녹취단말장치(100)는 두 명의 화자(1, 2) 사이에 구비될 수 있으며, 화자(1, 2) 사이의 대화를 마이크 어레이(110, 120)로 녹취하거나, 미리 설정된 오디오 신호 등을 스피커(130)를 통하여 출력할 수 있다. 여기서, 대면녹취단말장치(100)는 은행이나 증권사, 보험사 등의 상담원과 고객 사이의 상담시 사용될 수 있으며, 녹취된 상담 내역은 금융상품에 대한 불완전판매를 방지하기 위해 활용될 수 있다. 즉, 대면녹취단말장치(100)는 금융상품의 완전판매를 위한 필수적인 안내 사항 등을 스피커(130)를 통하여 출력하여 고객에게 제공할 수 있으며, 이러한 안내 사항 등을 고객에게 알려주었음을 확인할 수 있도록 상담 내역에 대한 녹취를 제공할 수 있다.The face-to-face recording terminal device 100 can be provided between two speakers 1 and 2, and records the conversation between the speakers 1 and 2 using the microphone array 110 and 120, or records a preset audio signal, etc. It can be output through the speaker 130. Here, the face-to-face recording terminal device 100 can be used during consultations between customers and counselors at banks, securities companies, insurance companies, etc., and the recorded consultation details can be used to prevent incomplete sales of financial products. In other words, the face-to-face recording terminal device 100 can output essential information for the complete sale of a financial product and provide it to the customer through the speaker 130, so that the customer can confirm that these information has been informed. A recording of the consultation may be provided.

서버(200)는 대면녹취단말장치(100)로부터 녹취를 제공받을 수 있으며, 녹취에 대해 STT(Speech-To-Text)를 수행할 수 있다. 즉, 서버(200)는 STT를 통해 생성된 텍스트를 바탕으로 완전판매여부에 대한 모니터링을 수행하여, 금융상품에 대한 불완전판매 여부를 자동으로 검증할 수 있다.The server 200 can receive recordings from the face-to-face recording terminal device 100 and can perform STT (Speech-To-Text) on the recordings. In other words, the server 200 can automatically verify whether there is an incomplete sale of a financial product by monitoring whether there is a complete sale based on the text generated through STT.

다만, 서버(200)에서 불완전판매 여부를 확인하기 위해서 녹취를 텍스트로 변환하는 경우, 녹취 내 각각의 화자의 음성이 분리되어 있지 않으면 정확한 STT를 수행하기 어려운 문제점이 존재한다. 특히, 금융상품에 대한 안내사항을 스피커(130)를 통해 출력하는 경우, 스피커(130)에서 출력되는 내용이 각각의 마이크 어레이(110, 120)로도 입력될 수 있으므로, 화자 구분이 더욱 어려워질 수 있다.However, when the server 200 converts a recording into text to check for incomplete sales, there is a problem in that it is difficult to perform an accurate STT if the voices of each speaker in the recording are not separated. In particular, when information on financial products is output through the speaker 130, the content output from the speaker 130 may also be input to each microphone array 110 and 120, making it more difficult to distinguish between speakers. there is.

반면에, 본 발명의 일 실시예에 의한 대면녹취단말장치(100)에 의하면, 빔 포밍(beam forming) 및 화자분리를 위한 전처리를 통해 정확한 화자분리 및 잡음제거 등을 수행할 수 있으므로, 고품질의 STT 인식을 위한 녹취 등을 제공하는 것이 가능하다. 또한, 본 발명의 일 실시예에 의한 대면녹취단말장치(100)는, 상담원과 고객의 음성, 스피커(130)에서 출력되는 약관설명 TTS(Text-to-Speech)만을 녹음하고 주변의 다양한 소음 등은 모두 제거하는 노이즈 캔슬링 기능을 제공할 수 있으며, 녹취를 하면서 실시간으로 상담원과 고객의 음성을 구분하는 것이 가능하다. 이하, 본 발명의 일 실시예에 의한 대면녹취단말장치(100)를 설명한다.On the other hand, according to the face-to-face recording terminal device 100 according to an embodiment of the present invention, accurate speaker separation and noise removal can be performed through beam forming and preprocessing for speaker separation, and thus high quality. It is possible to provide recordings for STT recognition. In addition, the face-to-face recording terminal device 100 according to an embodiment of the present invention records only the voices of the counselor and the customer, the terms and conditions explanation TTS (Text-to-Speech) output from the speaker 130, and various surrounding noises, etc. It can provide a noise canceling function that removes all noise, and it is possible to distinguish the voices of the agent and the customer in real time while recording. Hereinafter, a face-to-face recording terminal device 100 according to an embodiment of the present invention will be described.

도2 및 도3은 본 발명의 일 실시예에 의한 대면녹취단말장치를 나타내는 블록도이다. 도2 및 도3을 참조하면, 본 발명의 일 실시예에 의한 대면녹취단말장치(100)는, 제1 마이크 어레이(110), 제2 마이크 어레이(120), 스피커(130) 및 다채널 신호처리부(140)를 포함할 수 있다. Figures 2 and 3 are block diagrams showing a face-to-face recording terminal device according to an embodiment of the present invention. Referring to Figures 2 and 3, the face-to-face recording terminal device 100 according to an embodiment of the present invention includes a first microphone array 110, a second microphone array 120, a speaker 130, and a multi-channel signal. It may include a processing unit 140.

제1 마이크 어레이(110)는 제1 화자(1) 측을 향하여 설치될 수 있으며, 제1 화자(1)가 발화하는 음성을 수신하여 제1 마이크 신호를 생성할 수 있다. 여기서, 제1 마이크 어레이(110) 내에는 빔 포밍을 위하여 복수개의 마이크가 구비될 수 있으며, 실시예에 따라서는 도2에 도시한 바와 같이, 좌측 제1 마이크(110a)와 우측 제1 마이크(110b)의 두개의 마이크가 포함될 수 있다. 다만, 이에 한정되는 것은 아니며, 제1 마이크 어레이(110) 내에 포함되는 마이크의 개수와 배치 등은 실시예에 따라 다양하게 변경가능하다. 여기서, 제1 화자(1)는 금융상품을 판매하는 상담원 등일 수 있다.The first microphone array 110 may be installed toward the first speaker 1, and may receive a voice spoken by the first speaker 1 and generate a first microphone signal. Here, a plurality of microphones may be provided in the first microphone array 110 for beam forming, and depending on the embodiment, as shown in FIG. 2, a left first microphone 110a and a right first microphone ( 110b) two microphones may be included. However, it is not limited to this, and the number and arrangement of microphones included in the first microphone array 110 can be changed in various ways depending on the embodiment. Here, the first speaker 1 may be a counselor selling financial products, etc.

제2 마이크 어레이(120)는 제2 화자(2) 측을 향하여 설치될 수 있으며, 제2 화자(2)가 발화하는 음성을 수신하여 제2 마이크 신호를 생성할 수 있다. 여기서, 제2 마이크 어레이(120) 내에는 빔 포밍을 위하여 복수개의 마이크가 구비될 수 있으며, 실시예에 따라서는 도2에 도시한 바와 같이, 좌측 제2 마이크(120a)와 우측 제2 마이크(120b)의 두개의 마이크가 포함될 수 있다. 다만, 이에 한정되는 것은 아니며, 제2 마이크 어레이(120) 내에 포함되는 마이크의 개수와 배치 등은 실시예에 따라 다양하게 변경가능하다. 여기서, 제2 화자(2)는 금융상품을 구매하는 고객 등일 수 있다.The second microphone array 120 may be installed toward the second speaker 2, and may receive a voice spoken by the second speaker 2 and generate a second microphone signal. Here, a plurality of microphones may be provided in the second microphone array 120 for beam forming, and depending on the embodiment, as shown in FIG. 2, a left second microphone 120a and a right second microphone ( 120b) two microphones may be included. However, it is not limited to this, and the number and arrangement of microphones included in the second microphone array 120 can be changed in various ways depending on the embodiment. Here, the second speaker 2 may be a customer purchasing a financial product, etc.

여기서, 제1 마이크 어레이(110)와 제2 마이크 어레이(120)가, 도2와 같이 배치되는 경우에는, 대면녹취단말장치(100)를 제1 화자(1)와 제2 화자(2)의 측면에 위치시켜, 대면녹취를 수행하도록 할 수 있다.Here, when the first microphone array 110 and the second microphone array 120 are arranged as shown in FIG. 2, the face-to-face recording terminal device 100 is used to connect the first speaker 1 and the second speaker 2. It can be placed on the side to perform face-to-face recording.

스피커(130)는 미리 설정된 오디오 신호를 출력할 수 있다. 스피커(130)는 고객에 해당하는 제2 화자(2) 측에 설치될 수 있으며, 스피커(130)에서 출력되는 오디오 신호에는 상담원이 판매하고자 하는 금융상품에 대한 약관설명 등 완전판매를 위하여 필수적인 안내 사항 등이 포함될 수 있다. 여기서, 오디오 신호는 금융상품에 대한 약관설명 등의 텍스트를 음성으로 변환한 TTS 신호일 수 있다. The speaker 130 can output a preset audio signal. The speaker 130 can be installed on the side of the second speaker 2 corresponding to the customer, and the audio signal output from the speaker 130 contains information essential for complete sales, such as an explanation of the terms and conditions for the financial product the counselor wants to sell. Matters may be included. Here, the audio signal may be a TTS signal that converts text, such as a description of the terms and conditions of a financial product, into voice.

한편, 실시예에 따라서는 제1 마이크 어레이(110) 및 제2 마이크 어레이(120)는 각각 내부실링부(미도시) 내에 구비될 수 있다. 내부실링부(미도시) 내에는 차음부재가 포함될 수 있으며, 이를 통하여 스피커(130)에서 출력되는 오디오 신호가 대면녹취단말장치(100) 내부를 통하여, 제1 마이크 어레이(110) 및 제2 마이크 어레이(120)에 유입되는 것을 차단할 수 있다.Meanwhile, depending on the embodiment, the first microphone array 110 and the second microphone array 120 may each be provided within an internal sealing portion (not shown). A sound insulation member may be included in the internal sealing part (not shown), through which the audio signal output from the speaker 130 passes through the inside of the face-to-face recording terminal device 100 and the first microphone array 110 and the second microphone. Inflow into the array 120 can be blocked.

다채널 신호처리부(140)는 제1 마이크 신호 또는 제2 마이크 신호로부터, 제1 화자(1)의 음성에 대응하는 제1 음성신호, 제2 화자(2)의 음성에 대응하는 제2 음성신호 및 스피커(130)로부터 출력된 오디오 신호에 대응하는 제3 음성신호를 각각 분리할 수 있다. 즉, 복수의 제1 마이크 어레이(110a, 110b)에서 측정되는 위상이나 크기 차이 등을 활용하면, 제1 마이크 어레이(110)로 입력되는 각각의 음향신호들의 입사각(DOA: Direction of Arrival)을 구할 수 있으며, 다채널 신호처리부(140)는 제1 화자(1)의 위치에 대응하여 미리 설정된 입사각으로 입사되는 음향신호만을 수신하도록 구현할 수 있다. 이때, 제1 화자(1)의 위치를 가정하여, 제1 마이크 어레이(110)의 DOA를 미리 설계해둘 수 있으며, 제2 마이크 어레이(120)에 대하여도 동일한 방식으로 DOA를 설정할 수 있다.The multi-channel signal processor 140 receives a first voice signal corresponding to the voice of the first speaker 1 and a second voice signal corresponding to the voice of the second speaker 2 from the first microphone signal or the second microphone signal. and a third voice signal corresponding to the audio signal output from the speaker 130 can be separated. That is, by using the phase or size difference measured in the plurality of first microphone arrays 110a and 110b, the angle of incidence (DOA: Direction of Arrival) of each sound signal input to the first microphone array 110 can be obtained. The multi-channel signal processing unit 140 can be implemented to receive only sound signals incident at a preset angle of incidence corresponding to the position of the first speaker 1. At this time, assuming the location of the first speaker 1, the DOA of the first microphone array 110 can be designed in advance, and the DOA of the second microphone array 120 can be set in the same manner.

이후, 다채널 신호처리부(140)는 각각의 DOA를 활용하여 제1 화자(1) 및 제2 화자(2)의 음성에 대응하는 각각의 제1 음성신호 및 제2 음성신호를 분리할 수 있다. 즉, 제1 화자(1) 및 제2 화자(2)의 위치에 대응하여 설정된 각각의 설정각도(α1, α2) 및 설정각도범위(β1, β2)를 기준으로, 제1 마이크 신호와 제2 마이크 신호에 DOA 마스킹(masking)을 적용할 수 있으며, 이를 통하여 제1 마이크 어레이(110)와 제2 마이크 어레이(120)로 입력되는 다양한 음향신호 중에서 설정각도(α1, α2) 및 설정각도범위(β1, β2) 내로 입사되는 음향신호를 제외한 나머지 음향신호들은 노이즈로 처리할 수 있다.Thereafter, the multi-channel signal processing unit 140 may utilize each DOA to separate the first and second voice signals corresponding to the voices of the first speaker 1 and the second speaker 2. . That is, based on the respective set angles (α1, α2) and set angle ranges (β1, β2) set corresponding to the positions of the first speaker (1) and the second speaker (2), the first microphone signal and the second speaker (2) DOA masking can be applied to the microphone signal, and through this, among the various sound signals input to the first microphone array 110 and the second microphone array 120, the set angle (α1, α2) and the set angle range ( Except for the acoustic signals incident within β1, β2), the remaining acoustic signals can be treated as noise.

DOA 마스킹을 수행한 후, 다채널 신호처리부(140)는 DOA 마스킹된 제1 마이크 신호에 대한 전처리를 수행할 수 있으며, 이를 통하여 제1 마이크 신호로부터 제1 음성신호를 추출할 수 있다. 이때, 다채널 신호처리부(140)는 DRC(Dynamic Range Control), EQ(Audio Equalizer), NR(Noise Reduction) 및 AGC(Automatic Gain control) 등을 이용하여, 제1 마이크 신호에 대한 전처리를 수행할 수 있다. 즉, DOA 마스킹만으로는 제1 화자(1)의 음성만을 추출하기 어려울 수 있으므로, 제1 마이크 신호에 화자분리를 위한 추가적인 전처리를 수행하여 제1 음성신호를 추출할 수 있다. 여기서, 제2 마이크 신호에 대하여도 동일한 방식으로 전처리를 수행할 수 있다. 이를 통하여, 각각의 제1 마이크 어레이(110)와 제2 마이크 어레이(120)는, 바로 앞에서 발화하는 제1 화자(1)와 제2 화자(2)의 음성만을 각각 채집하여 제1 음성신호 및 제2 음성신호로 생성하는 것이 가능하다. 즉, 고객과 상담원의 목소리가 섞여 분리되지 않는 현상을 최소화할 수 있다.After performing DOA masking, the multi-channel signal processing unit 140 can perform preprocessing on the DOA-masked first microphone signal, and through this, extract the first voice signal from the first microphone signal. At this time, the multi-channel signal processing unit 140 performs pre-processing on the first microphone signal using Dynamic Range Control (DRC), Audio Equalizer (EQ), Noise Reduction (NR), and Automatic Gain control (AGC). You can. That is, since it may be difficult to extract only the voice of the first speaker 1 using DOA masking alone, the first voice signal can be extracted by performing additional preprocessing for speaker separation on the first microphone signal. Here, preprocessing can be performed in the same way on the second microphone signal. Through this, each of the first microphone array 110 and the second microphone array 120 collects only the voices of the first speaker 1 and the second speaker 2 speaking right in front, respectively, and produces the first voice signal and It is possible to generate a second voice signal. In other words, the phenomenon where the voices of the customer and the agent are mixed and not separated can be minimized.

여기서, 다채널 신호처리부(140)는 최적의 화자분리 전처리를 위한 파라미터들을 설정할 수 있다.Here, the multi-channel signal processing unit 140 can set parameters for optimal speaker separation preprocessing.

먼저, 다채널 신호처리부(140)는 DOA 마스킹을 위한 설정각도 및 설정각도 범위를 설정할 수 있다. 이때 대면녹취단말장치(100)가 설치되는 실내 환경에 대한 잔향효과를 참조하여 설정할 수 있다. 실내 환경에 따라 잔향효과가 상이할 수 있으며, 잔향효과에 따라 DOA 마스킹에 의한 화자분리 효과가 달리 나타날 수 있다. 따라서, 각각의 실내 환경에 대한 잔향효과를 측정한 후 그에 따라 DOA 마스킹을 위한 설정각도 및 설정각도범위를 적절하게 설정할 수 있다. 실시예에 따라서는, RT(Reverberation time) 60을 이용하여 잔향효과를 측정할 수 있다.First, the multi-channel signal processing unit 140 can set the setting angle and setting angle range for DOA masking. At this time, the setting can be made with reference to the reverberation effect of the indoor environment where the face-to-face recording terminal device 100 is installed. The reverberation effect may vary depending on the indoor environment, and the speaker separation effect due to DOA masking may appear differently depending on the reverberation effect. Therefore, after measuring the reverberation effect for each indoor environment, the setting angle and setting angle range for DOA masking can be appropriately set. Depending on the embodiment, the reverberation effect may be measured using RT (Reverberation time) 60.

이후, 다채널 신호처리부(140)는 DRC에 대한 설정을 수행할 수 있다. 제1 마이크 신호에 DOA 마스킹을 적용하는 경우, 일반적으로 제1 화자의 음성신호는 상대적으로 크게 입력되고 주변에서 발생하는 나머지 소음들은 작게 입력된다. 따라서, DOA 만으로 충분히 화자분리가 되지 않는 경우에는, DRC를 이용하여 DOA 마스킹된 상기 제1 마이크 신호 중에서 크기가 임계값 미만인 신호는 감쇄시키고, 임계값 이상인 신호를 증폭시키도록 할 수 있다. 이를 통하여, 제1 음성신호와 노이즈를 충분히(예를들어, 10dB 이상) 분리시킬 수 있다. 여기서, 증폭된 신호는 이후 전처리의 최종단의 출력 게인을 감소시키는 방식으로 보상할 수 있다.Afterwards, the multi-channel signal processing unit 140 can perform settings for DRC. When DOA masking is applied to the first microphone signal, the first speaker's voice signal is generally input relatively loudly and the remaining noise occurring around is input small. Therefore, in cases where speaker separation is not sufficient through DOA alone, DRC can be used to attenuate signals whose size is less than the threshold and amplify signals whose size is greater than the threshold among the DOA-masked first microphone signals. Through this, the first voice signal and noise can be sufficiently separated (for example, 10 dB or more). Here, the amplified signal can be compensated by reducing the output gain at the final stage of preprocessing.

한편, 실내 잔향효과와 실제 사람들의 앉은 위치, 말할 때 사람의 움직임 등으로 인하여, DRC 적용시 노이즈가 함께 증폭되는 등의 문제가 발생할 수 있다. 이 경우, DRC의 동작개시시간(attack time)을 최소화하는 방식으로, 화자분리 효과를 높이는 것도 가능하다.Meanwhile, problems such as noise amplification may occur when applying DRC due to indoor reverberation effects, actual people's sitting positions, and people's movements when speaking. In this case, it is possible to increase the speaker separation effect by minimizing the attack time of DRC.

추가적으로, DOA 의 설정각도 및 설정각도범위 조절이나 DRC의 임계값 또는 동작개시시간 등을 조정하는 경우에도 충분한 화자분리가 수행되지 않는 경우에는, 제1 마이크 신호 또는 제2 마이크 신호의 입력 게인(gain)을 조절할 수 있다. 즉, 입력 게인의 크기를 한계값 미만으로 제한하게 되면, DOA 마스킹을 통해 입력되는 노이즈의 크기 자체가 줄어들게 되므로, DOA 마스킹을 통한 분리 성능을 향상시키는 것이 가능하다. 다만, 입력 게인의 크기는 적어도 음성인식(ASR: Automatic Speech Recognition) 엔진에서 인식할 수 있는 최저의 입력크기 이상으로 제한할 수 있다.Additionally, if sufficient speaker separation is not performed even when adjusting the setting angle and setting angle range of the DOA or the threshold or operation start time of the DRC, the input gain of the first microphone signal or the second microphone signal is adjusted. ) can be adjusted. In other words, if the size of the input gain is limited to less than the threshold, the size of the noise input through DOA masking is reduced, making it possible to improve separation performance through DOA masking. However, the size of the input gain can be limited to at least the minimum input size that can be recognized by the automatic speech recognition (ASR) engine.

한편, 실시예에 따라서는, 고객과 상담원이 상담을 진행하는 도중에, 옆 창구에서 상담 중인 음성이나 잡음, 객장 내 대기 중인 다른 고객들의 음성이나 잡음 등이 유입되는 등의 경우가 발생할 수 있다. 이러한 노이즈를 제거하기 위하여, 다채널 신호처리부(140)는 각각의 노이즈를 음성 노이즈와 비음성 노이즈로 구별하여 제거할 수 있다.Meanwhile, depending on the embodiment, while a customer and a counselor are conducting a consultation, there may be cases where voices or noises from a consultation at a neighboring counter, voices or noises from other customers waiting in the hall, etc. may flow in. In order to remove such noise, the multi-channel signal processing unit 140 can distinguish each noise into speech noise and non-speech noise and remove them.

구체적으로, 다채널 신호처리부(140)는 제1 마이크 신호에 VAD(Voice Activity Detection)를 적용할 수 있다. VAD는 DNN(Deep Neural Network)를 기반으로 음성신호와 비음성신호를 구별하는 알고리즘으로, VAD를 이용하면 제1 마이크 신호 내에 포함된 음성을 감지하는 것이 가능하다. 따라서, 다채널 신호처리부(140)는 VAD를 적용하여 제1 마이크 신호에 포함된 비음성 노이즈를 감지하고, 감지된 비음성 노이즈들을 제1 마이크 신호에서 제거할 수 있다. 음성 노이즈에 대하여는, DRC를 적용하거나, 제1 마이크 신호에 대한 EQ, Gain 등을 제어하는 전처리를 수행하여 제거할 수 있다.Specifically, the multi-channel signal processing unit 140 may apply Voice Activity Detection (VAD) to the first microphone signal. VAD is an algorithm that distinguishes voice signals and non-voice signals based on DNN (Deep Neural Network). Using VAD, it is possible to detect voice included in the first microphone signal. Accordingly, the multi-channel signal processing unit 140 can apply VAD to detect non-speech noise included in the first microphone signal and remove the detected non-speech noise from the first microphone signal. Voice noise can be removed by applying DRC or performing preprocessing to control EQ, gain, etc. for the first microphone signal.

또 다른 실시예에 의하면, 스피커(130)에서 재생된 오디오 신호가 각각 제1 마이크(110)와 제2 마이크(120)로 유입될 수 있다. 즉, 약관설명 등을 위한 TTS 오디오 신호가 스피커(130)에서 출력되는 경우, 해당 오디오 신호가 제1 마이크 신호 및 제2 마이크 신호에 포함되어, 각각의 제1 음성신호와 제2 음성신호를 정확하게 분리하기 어려울 수 있다. 이 경우, 다채널 신호처리부(140)는 제1 마이크 어레이(110)와 제2 마이크 어레이(120)에 각각 상이한 알고리즘을 적용하여, 오디오 신호를 제거할 수 있다. According to another embodiment, audio signals reproduced from the speaker 130 may flow into the first microphone 110 and the second microphone 120, respectively. That is, when a TTS audio signal for explanation of terms and conditions, etc. is output from the speaker 130, the corresponding audio signal is included in the first microphone signal and the second microphone signal, so that each of the first voice signal and the second voice signal can be accurately It can be difficult to separate. In this case, the multi-channel signal processing unit 140 may apply different algorithms to the first microphone array 110 and the second microphone array 120 to remove the audio signal.

구체적으로, 제1 마이크 어레이(110)의 경우, DOA 마스킹을 이용하여 제1 마이크 신호로부터 오디오 신호를 1차적으로 제거할 수 있다. 이후, VAD를 적용하여 제1 마이크 신호에 포함된 오디오 신호를 구별할 수 있으며, 구별된 오디오 신호를 제1 마이크 신호에서 제거할 수 있다. 또한, DRC 등의 전처리를 수행하여, 오디오 신호가 제거된 제1 음성신호를 생성할 수 있다.Specifically, in the case of the first microphone array 110, the audio signal can be primarily removed from the first microphone signal using DOA masking. Thereafter, the audio signal included in the first microphone signal can be distinguished by applying VAD, and the differentiated audio signal can be removed from the first microphone signal. Additionally, preprocessing such as DRC may be performed to generate a first voice signal from which the audio signal has been removed.

제2 마이크 어레이(120)의 경우, AEC(Acoustic Echo Cancelation)를 이용하여 오디오 신호를 제거할 수 있다. 스피커(130)에서 출력하는 오디오 신호는 대면녹취단말장치(100) 내에 저장되어 있거나, 외부의 서버(200) 등으로부터 제공받는 것일 수 있다. 따라서, 다채널 신호처리부(140)는 스피커(130)에서 출력할 오디오 신호를 추출할 수 있으며, 이를 에코 레퍼런스(echo reference)로 설정하여 AEC를 적용할 수 있다. 이후, 다채널 신호처리부(140)는 제2 마이크 어레이(120)에서 수신한 제2 마이크 신호 중에서 에코 레퍼런스를 AEC를 이용하여 제거함으로써, 제2 마이크 신호에 포함된 제2 화자(2)의 음성을 제외한 오디오 신호 등의 노이즈를 제거할 수 있다.In the case of the second microphone array 120, audio signals can be removed using Acoustic Echo Cancellation (AEC). The audio signal output from the speaker 130 may be stored in the face-to-face recording terminal device 100 or may be provided from an external server 200, etc. Accordingly, the multi-channel signal processing unit 140 can extract the audio signal to be output from the speaker 130 and apply AEC by setting it as an echo reference. Thereafter, the multi-channel signal processing unit 140 removes the echo reference from the second microphone signal received from the second microphone array 120 using AEC, thereby removing the voice of the second speaker 2 included in the second microphone signal. Noise from audio signals other than those can be removed.

추가적으로, 다채널 신호처리부(140)는 제1 음성신호, 제2 음성신호 및 에코 레퍼런스를 서버(200)로 전송할 수 있으며, 서버(200)는 음성인식엔진(미도시)을 이용하여 제1 음성신호, 제2 음성신호 및 에코 레퍼런스에 대한 STT를 수행할 수 있다. 즉, 고객와 상담원 사이의 상담내용과, 스피커(130)를 통해 출력되는 안내 사항 등을 실제 고객에게 들려주고 설명하였음을 확인할 수 있도록, 녹취록을 생성할 수 있다. 이후, 서버(200)는 각각의 녹취록을 바탕으로 불완전판매 여부를 모니터링할 수 있다. 이때, 다채널 신호처리부(140)는 기존의 복잡한 화자분리(source separation) 알고리즘을 대신하여 간단한 방식으로 화자분리를 수행할 수 있으므로, 각각의 제1 음성신호, 제2 음성신호 및 에코 레퍼런스를 실시간으로 서버(200)로 제공하는 것이 가능하며, 서버(200)는 이를 기반으로 실시간으로 불완전판매 여부를 모니터링할 수 있다.Additionally, the multi-channel signal processing unit 140 may transmit the first voice signal, the second voice signal, and the echo reference to the server 200, and the server 200 may transmit the first voice signal using a voice recognition engine (not shown). STT can be performed on the signal, the second voice signal, and the echo reference. In other words, a transcript can be created so that it can be confirmed that the contents of the consultation between the customer and the counselor and the guidance output through the speaker 130 were actually heard and explained to the customer. Afterwards, the server 200 can monitor whether there is an incomplete sale based on each recording. At this time, the multi-channel signal processing unit 140 can perform speaker separation in a simple manner instead of the existing complex speaker separation (source separation) algorithm, so each first voice signal, second voice signal, and echo reference are analyzed in real time. It is possible to provide it to the server 200, and the server 200 can monitor incomplete sales in real time based on this.

한편, 실시예에 따라서는, 복수의 화자(1, 2) 중 어느 하나의 화자에 대하여는 딥러닝 기반의 특징 추출을 통해 미리 등록하고, 나머지 화자에 대하여 DOA 기반의 화자분리를 수행하는 것도 가능하다. 예를들어, 은행의 경우, 상담원은 고정되고 고객이 바뀌므로, 상담원에 대하여는 음성특성을 등록하여 상담원의 음성신호를 정확하고 구별하고, 고객에 대하여는 DOA를 기반으로 음성신호를 추출함으로써, 보다 정확한 음성분리를 수행하는 것도 가능하다.Meanwhile, depending on the embodiment, it is possible to pre-register one of the plurality of speakers (1, 2) through deep learning-based feature extraction and perform DOA-based speaker separation for the remaining speakers. . For example, in the case of a bank, since the agent is fixed and the customer changes, the voice characteristics of the agent are registered to accurately distinguish the agent's voice signal, and for the customer, voice signals are extracted based on DOA to provide more accurate results. It is also possible to perform voice separation.

도4 내지 도6은 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 채널분리성능을 나타내는 예시도이다. Figures 4 to 6 are exemplary diagrams showing channel separation performance using a face-to-face recording terminal device according to an embodiment of the present invention.

도4를 참조하면, 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용하는 경우, 직원(제1 화자)과 고객(제2 화자)이 각각 발화하였을 때 23dB 차이로 채널분리 성능이 구현됨을 확인할 수 있다. 도5는 고객과 스피커가 동시에 발화하는 경우를 나타내는 것으로, 고객과 스피커가 동시에 발화하는 경우에도 42dB만큼의 채널분리성능이 구현됨을 확인할 수 있다. 또한, 도6은 고객과 직원이 동시에 발화하는 경우를 나타내는 것으로, 고객과 직원의 동시발화시에도 적어도 17dB만큼의 채널분리성능을 구현하는 것이 가능하다. 따라서, 동시발화시에도 고객과 직원에 대한 음성을 명확하게 분리할 수 있음을 확인할 수 있다.Referring to Figure 4, when using a face-to-face recording terminal device according to an embodiment of the present invention, it can be seen that channel separation performance is implemented with a 23dB difference when an employee (first speaker) and a customer (second speaker) each speak. You can. Figure 5 shows a case where the customer and the speaker speak at the same time. It can be seen that even when the customer and the speaker speak at the same time, a channel separation performance of 42dB is implemented. In addition, Figure 6 shows a case where a customer and an employee speak simultaneously, and it is possible to implement a channel separation performance of at least 17 dB even when a customer and an employee speak simultaneously. Therefore, it can be confirmed that the voices of customers and employees can be clearly separated even when speaking simultaneously.

도7은 본 발명의 일 실시예에 의한 대면녹취단말장치를 이용한 대면녹취방법을 나타내는 순서도이다. 여기서, 대면녹취단말장치은 제1 마이크, 제2 마이크 및 스피커를 포함할 수 있다.Figure 7 is a flowchart showing a face-to-face recording method using a face-to-face recording terminal device according to an embodiment of the present invention. Here, the face-to-face recording terminal device may include a first microphone, a second microphone, and a speaker.

도7을 참조하면, 제1 화자 측을 향하는 제1 마이크 어레이가 제1 마이크 신호를 생성하고, 제2 화자 측을 향하는 제2 마이크 어레이가 제2 마이크 신호를 생성하며, 스피커가 미리 설정된 오디오 신호를 출력할 수 있다(S10). 여기서, 제1 마이크 어레이 및 제2 마이크 어레이는 각각 복수개의 마이크들을 포함할 수 있으며, 스피커는 상담원이 판매하고자 하는 금융상품에 대한 약관설명 등 완전판매를 위하여 필수적인 안내 사항 등의 TTS를 오디오 신호로 출력할 수 있다. 여기서, 실시예에 따라서는 제1 마이크 어레이 및 제2 마이크 어레이가 각각 내부실링부 내에 구비될 수 있다. 내부실링부 내에는 차음부재가 포함될 수 있으며, 차음부재는 스피커에서 출력되는 오디오 신호가 대면녹취단말장치 내부를 통하여, 제1 마이크 어레이 및 제2 마이크 어레이에 유입되는 것을 차단할 수 있다.Referring to Figure 7, the first microphone array facing the first speaker generates the first microphone signal, the second microphone array facing the second speaker generates the second microphone signal, and the speaker generates the preset audio signal. can be output (S10). Here, the first microphone array and the second microphone array may each include a plurality of microphones, and the speaker transmits TTS as an audio signal, including information essential for complete sales, such as an explanation of the terms and conditions for the financial product the agent wants to sell. Can be printed. Here, depending on the embodiment, the first microphone array and the second microphone array may each be provided within the inner sealing portion. A sound insulating member may be included in the inner sealing part, and the sound insulating member can block the audio signal output from the speaker from flowing into the first microphone array and the second microphone array through the inside of the face-to-face recording terminal device.

이후, 대면녹취단말장치는 제1 마이크 신호 또는 제2 마이크 신호로부터, 제1 화자의 음성에 대응하는 제1 음성신호, 제2 화자의 음성에 대응하는 제2 음성신호 및 스피커로부터 출력된 오디오 신호에 대응하는 제3 음성신호를 분리할 수 있다(S20). 즉, 대면녹취단말장치는 각각의 화자들의 위치에 대응하여 미리 설정된 설정각도 및 설정각도범위를 기준으로, 제1 마이크 신호와 제2 마이크 신호에 DOA 마스킹을 적용할 수 있으며, 이를 통하여 제1 마이크 어레이와 제2 마이크 어레이로 입력되는 다양한 음향신호 중에서 설정각도범위 내로 입사되는 음향신호를 제외한 나머지 음향신호들은 노이즈로 처리할 수 있다. 또한, DOA 마스킹을 수행한 후에는, DOA 마스킹된 제1 마이크 신호에 대한 전처리를 수행할 수 있으며, 이를 통하여 제1 마이크 신호로부터 제1 음성신호를 추출할 수 있다. 이때, 대면녹취단말장치는 DRC(Dynamic Range Control), EQ(Audio Equalizer), NR(Noise Reduction) 및 AGC(Automatic Gain control) 등을 이용하여, 제1 마이크 신호에 대한 전처리를 수행할 수 있다. 여기서, 제2 마이크 신호에 대하여도 동일한 방식으로 전처리를 수행할 수 있다. Thereafter, the face-to-face recording terminal device receives a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker, and an audio signal output from the speaker from the first microphone signal or the second microphone signal. The third voice signal corresponding to can be separated (S20). In other words, the face-to-face recording terminal device can apply DOA masking to the first microphone signal and the second microphone signal based on the preset angle and angle range corresponding to the positions of each speaker, and through this, the first microphone signal Among the various sound signals input to the array and the second microphone array, the remaining sound signals except for the sound signals incident within the set angle range can be processed as noise. Additionally, after performing DOA masking, preprocessing can be performed on the DOA masked first microphone signal, and through this, the first voice signal can be extracted from the first microphone signal. At this time, the face-to-face recording terminal device may perform preprocessing on the first microphone signal using Dynamic Range Control (DRC), Audio Equalizer (EQ), Noise Reduction (NR), and Automatic Gain Control (AGC). Here, preprocessing can be performed in the same way on the second microphone signal.

한편, 실시예에 따라서는, 고객과 상담원이 상담을 진행하는 도중에, 옆 창구에서 상담 중인 음성이나 잡음, 객장 내 대기 중인 다른 고객들의 음성이나 잡음 등이 유입되는 등의 경우가 발생할 수 있다. 이러한 노이즈를 제거하기 위하여, 각각의 노이즈를 음성 노이즈와 비음성 노이즈로 구별하여 제거할 수 있다. 구체적으로, 대면녹취단말장치는 제1 마이크 신호에 VAD(Voice Activity Detection)를 적용할 수 있다. VAD를 이용하면 제1 마이크 신호 내에 포함된 음성을 감지하는 것이 가능하다. 따라서, VAD를 적용하여 제1 마이크 신호에 포함된 비음성 노이즈를 감지하고, 감지된 비음성 노이즈들을 제1 마이크 신호에서 제거할 수 있다. 또한, 음성 노이즈들의 경우에는, 음성노이즈 등을 제거하기 위한 DRC를 적용하거나, 제1 마이크 신호에 대한 EQ, Gain 등을 제어하는 전처리를 수행하는 방식으로 제거할 수 있다.Meanwhile, depending on the embodiment, while a customer and a counselor are conducting a consultation, there may be cases where voices or noises from a consultation at a neighboring counter, voices or noises from other customers waiting in the hall, etc. may flow in. In order to remove such noise, each noise can be removed by distinguishing it into speech noise and non-speech noise. Specifically, the face-to-face recording terminal device may apply VAD (Voice Activity Detection) to the first microphone signal. Using VAD, it is possible to detect voice included in the first microphone signal. Therefore, by applying VAD, non-speech noise included in the first microphone signal can be detected and the detected non-speech noises can be removed from the first microphone signal. Additionally, in the case of voice noise, it can be removed by applying DRC to remove voice noise, etc., or by performing preprocessing to control EQ, gain, etc. for the first microphone signal.

또 다른 실시예에 의하면, 스피커에서 재생된 오디오 신호가 각각 제1 마이크 어레이와 제2 마이크 어레이로 유입될 수 있다. 이 경우, 대면녹취단말장치는 제1 마이크 어레이와 제2 마이크 어레이에 각각 상이한 알고리즘을 적용하여, 오디오 신호를 제거할 수 있다. According to another embodiment, audio signals reproduced from a speaker may flow into the first microphone array and the second microphone array, respectively. In this case, the face-to-face recording terminal device can apply different algorithms to the first microphone array and the second microphone array to remove the audio signal.

구체적으로, 제1 마이크 어레이의 경우, DOA 마스킹을 이용하여 제1 마이크 신호로부터 오디오 신호를 1차적으로 제거할 수 있다. 이후, VAD를 적용하여 제1 마이크 신호에 포함된 오디오 신호를 구별할 수 있으며, 구별된 오디오 신호를 제1 마이크 신호에서 제거할 수 있다. 또한, DRC 등의 전처리를 수행하여, 오디오 신호가 제거된 제1 음성신호를 생성할 수 있다.Specifically, in the case of the first microphone array, the audio signal can be primarily removed from the first microphone signal using DOA masking. Thereafter, the audio signal included in the first microphone signal can be distinguished by applying VAD, and the differentiated audio signal can be removed from the first microphone signal. Additionally, preprocessing such as DRC may be performed to generate a first voice signal from which the audio signal has been removed.

제2 마이크 어레이의 경우, AEC(Acoustic Echo Cancelation)를 이용하여 오디오 신호를 제거할 수 있다. 즉, 대면녹취단말장치는 스피커에서 출력할 오디오 신호를 추출할 수 있으며, 이를 에코 레퍼런스(echo reference)로 설정하여 AEC를 적용할 수 있다. 이 경우, 제2 마이크 어레이에서 수신한 제2 마이크 신호 중에서 에코 레퍼런스를 AEC를 이용하여 제거함으로써, 제2 마이크 신호에 포함된 제2 화자의 음성을 제외한 오디오 신호 등의 노이즈를 제거할 수 있다.In the case of the second microphone array, the audio signal can be removed using AEC (Acoustic Echo Cancellation). In other words, the face-to-face recording terminal device can extract the audio signal to be output from the speaker and apply AEC by setting it as an echo reference. In this case, by removing the echo reference from the second microphone signal received from the second microphone array using AEC, noise such as an audio signal other than the second speaker's voice included in the second microphone signal can be removed.

추가적으로, 대면녹취단말장치는 제1 음성신호, 제2 음성신호 및 에코 레퍼런스를 서버로 전송할 수 있으며, 서버는 제1 음성신호, 제2 음성신호 및 에코 레퍼런스에 대한 STT를 수행할 수 있다. 즉, 고객와 상담원 사이의 상담내용과, 스피커를 통해 출력되는 안내 사항 등을 실제 고객에게 들려주고 설명하였음을 확인할 수 있도록 녹취록을 생성할 수 있다. 이후, 서버는 각각의 녹취록을 바탕으로 불완전판매 여부를 모니터링할 수 있다.Additionally, the face-to-face recording terminal device can transmit the first voice signal, the second voice signal, and the echo reference to the server, and the server can perform STT on the first voice signal, the second voice signal, and the echo reference. In other words, a transcript can be created so that it can be confirmed that the contents of the consultation between the customer and the counselor and the guidance output through the speaker were actually heard and explained to the customer. Afterwards, the server can monitor incomplete sales based on each transcript.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일 수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above-described present invention can be implemented as computer-readable code on a program-recorded medium. A computer-readable medium may continuously store a computer-executable program or temporarily store it for execution or download. In addition, the medium may be a variety of recording or storage means in the form of a single or several pieces of hardware combined. It is not limited to a medium directly connected to a computer system and may be distributed over a network. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, And there may be something configured to store program instructions, including ROM, RAM, flash memory, etc. Additionally, examples of other media include recording or storage media managed by app stores that distribute applications, sites or servers that supply or distribute various other software, etc. Accordingly, the above detailed description should not be construed as restrictive in all respects and should be considered illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

본 발명은 전술한 실시예 및 첨부된 도면에 의해 한정되는 것이 아니다. 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 있어, 본 발명의 기술적 사상을 벗어나지 않는 범위 내에서 본 발명에 따른 구성요소를 치환, 변형 및 변경할 수 있다는 것이 명백할 것이다.The present invention is not limited to the above-described embodiments and attached drawings. For those skilled in the art to which the present invention pertains, it will be clear that components according to the present invention can be replaced, modified, and changed without departing from the technical spirit of the present invention.

100: 대면녹취단말장치 110: 제1 마이크 어레이
120: 제2 마이크 어레이 130: 스피커
140: 다채널 신호처리부100: Face-to-face recording terminal device 110: First microphone array
120: second microphone array 130: speaker
140: Multi-channel signal processing unit

Claims

In a face-to-face recording terminal device provided between two speakers,
a first microphone array facing toward the first speaker and generating a first microphone signal;
a second microphone array facing toward the second speaker and generating a second microphone signal; and
Facing the second speaker, it includes a speaker that outputs a preset audio signal,
From the first microphone signal or the second microphone signal, a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker, and an audio signal corresponding to the audio signal output from the speaker. It includes a multi-channel signal processing unit that separates the third voice signal,
The multi-channel signal processing unit
Transmit the third voice signal to the server to perform STT (Speech-to-Text),
The multi-channel signal processing unit
By applying different algorithms to the first microphone signal and the second microphone signal, the audio signal included in the first microphone signal and the second microphone signal is removed,
Apply VAD (Voice Activity Detection) to the first microphone signal to distinguish the audio signal output from the speaker and remove it from the first microphone signal,
A face-to-face recording terminal device that removes the audio signal from the second microphone signal by applying AEC (Acoustic Echo Cancellation) that uses the audio signal as an echo reference signal.

The method of claim 1, wherein the multi-channel signal processing unit
Direction of Arrival (DOA) masking is applied to the first microphone signal based on the setting angle set corresponding to the position of the first speaker, and preprocessing is performed on the DOA-masked first microphone signal. , A face-to-face recording terminal device, characterized in that extracting the first voice signal from the first microphone signal.

The method of claim 2, wherein the multi-channel signal processing unit
Face-to-face recording, characterized in that preprocessing is performed on the first microphone signal using at least one of DRC (Dynamic Range Control), EQ (Audio Equalizer), NR (Noise Reduction), and AGC (Automatic Gain control). Terminal device.

The method of claim 3, wherein the multi-channel signal processing unit
A face-to-face recording terminal that uses the DRC to attenuate signals whose size is less than a threshold among the DOA-masked first microphone signals and amplify signals whose size is greater than the threshold to separate the first voice signal from noise. Device.

The method of claim 3, wherein the multi-channel signal processing unit
A face-to-face recording terminal device characterized in that the input gain of the received first microphone signal or the second microphone signal is limited to less than a threshold value.

The method of claim 1, wherein the multi-channel signal processing unit
A face-to-face recording terminal device, characterized in that non-voice noise included in the first microphone signal is removed by applying VAD (Voice Activity Detection) to the first microphone signal or the second microphone signal.

The method of claim 1, wherein the first microphone array and the second microphone array are
A face-to-face recording terminal device provided in an internal sealing portion including a sound insulation member to block audio signals output from the speaker from flowing into the inside of the face-to-face recording terminal device.

delete

The method of claim 1, wherein the multi-channel signal processing unit
A face-to-face recording terminal device characterized in that it transmits the echo reference to a server to perform Speech-to-Text (STT).

In a face-to-face recording method using a face-to-face recording terminal device including a first microphone array, a second microphone array, and a speaker,
A first microphone array facing the first speaker generates a first microphone signal, a second microphone array facing the second speaker generates a second microphone signal, and a speaker facing the second speaker generates a preset audio signal. outputting a signal; and
From the first microphone signal or the second microphone signal, a first voice signal corresponding to the voice of the first speaker, a second voice signal corresponding to the voice of the second speaker, and an audio signal output from the speaker Including the step of separating the third voice signal,
The separation step is
Transmitting the third voice signal to the server to perform STT (Speech-to-Text),
The separation step is
By applying different algorithms to the first microphone signal and the second microphone signal, the audio signal included in the first microphone signal and the second microphone signal is removed,
Apply VAD (Voice Activity Detection) to the first microphone signal to distinguish the audio signal output from the speaker and remove it from the first microphone signal,
A face-to-face recording method in which the audio signal is removed from the second microphone signal by applying AEC (Acoustic Echo Cancellation), which uses the audio signal as an echo reference signal.