KR102158739B1

KR102158739B1 - System, device and method of automatic translation

Info

Publication number: KR102158739B1
Application number: KR1020180043284A
Authority: KR
Inventors: 최무열; 이민규; 김상훈; 윤승
Original assignee: 한국전자통신연구원
Priority date: 2017-08-03
Filing date: 2018-04-13
Publication date: 2020-09-22
Also published as: KR20190015081A

Abstract

본 발명에 따른 2채널 음성신호를 이용한 자동통역 디바이스는 스피커, 제 1 및 제 2 마이크를 포함하는 이어셋 장치와 데이터를 송수신하는 통신모듈, 상기 2채널 음성신호를 이용하여 통역 결과를 생성하기 위한 프로그램이 저장된 메모리 및 상기 메모리에 저장된 프로그램을 실행시키는 프로세서를 포함하되, 상기 프로세서는 상기 프로그램을 실행시킴에 따라, 상기 제 1 마이크를 통해 수신한 화자의 음성인 제 1 음성신호를 상기 제 2 마이크를 통해 수신한 잡음 신호 및 상기 화자의 음성을 포함하는 제 2 음성신호와 비교하고, 상기 비교 결과에 기초하여 상기 제 1 및 2 음성신호에 포함된 상기 화자의 음성을 모두 또는 선택적으로 추출하여 자동통역을 수행한다.An automatic interpretation device using a two-channel voice signal according to the present invention includes a speaker, a communication module for transmitting and receiving data with an earphone device including first and second microphones, and a program for generating an interpretation result using the two-channel voice signal. And a processor for executing the stored memory and a program stored in the memory, wherein the processor transmits a first voice signal, which is a speaker's voice received through the first microphone, through the second microphone as the program is executed. Automatic interpretation by comparing the noise signal received through the noise signal and the second voice signal including the speaker's voice, and extracting all or selectively the speaker's voice included in the first and second voice signals based on the comparison result Perform.

Description

Automatic interpretation system, device and method {SYSTEM, DEVICE AND METHOD OF AUTOMATIC TRANSLATION}

본 발명은 2채널 음성신호를 이용한 자동통역 시스템, 디바이스 및 방법에 관한 것이다.The present invention relates to an automatic interpretation system, device, and method using a two-channel voice signal.

최근 음성인식 기술의 향상으로 인해 스마트폰을 이용한 자동통역 기능이 널리 사용되고 있으며, 많은 어플리케이션들이 다국어 서비스를 지원하고 있다. 그러나 향상된 음성인식 성능과 다국어 지원은 자동통역에 대한 기대를 높이고 있으나, 실제 통역이 필요한 상황에서는 원활한 기능을 제공해주지 못하고 있다.Due to the recent improvement in voice recognition technology, the automatic interpretation function using a smartphone is widely used, and many applications support multilingual services. However, the improved voice recognition performance and multi-language support raise expectations for automatic interpretation, but it does not provide smooth functions in situations where actual interpretation is required.

그러나 스마트폰을 이용하여 자동통역을 하는 경우 다음과 같은 문제점이 발생할 수 있다.However, in the case of automatic interpretation using a smartphone, the following problems may occur.

먼저, 사용자들은 보통 1개의 디바이스를 공유하여 2개의 언어를 통역하게 되는데 이 경우 디바이스를 사이에 두고 번갈아 발화하기 때문에, 자연스러운 대화보다는 짧은 문장을 간단히 발성하는 형태로 질문과 답변 수준에 머무르게 된다.First, users usually share one device to interpret two languages. In this case, because they speak alternately across devices, they stay at the question and answer level in the form of simply uttering short sentences rather than natural conversations.

둘째로, 서로 다른 사용자가 자신의 디바이스를 이용하면서 동일 어플리케이션을 사용하는 경우 디바이스를 공유하여 번갈아 발화하는 불편함은 덜 수 있으나, 통역 언어의 번역문 및 합성음은 두 디바이스가 페어링되지 않는한 상대방에게 전달되지 않는 문제가 있다.Second, when different users use their own devices and use the same application, the inconvenience of alternately speaking by sharing devices may be reduced, but translations and synthesized sounds of the interpreter language are delivered to the other party unless the two devices are paired. There is a problem that does not work.

셋째로, 서로 다른 사용자의 디바이스가 서로 페어링되고 핸즈프리 이어셋을 이용하는 경우 좀 더 자연스러운 대화가 가능하나, 두 사용자가 근접하여 대화할 경우 서로의 발성이 상대방의 핸즈프리 이어셋의 마이크에 입력되는 크로스톡 문제가 발생하게 된다.Third, when devices of different users are paired with each other and a hands-free earphone is used, a more natural conversation is possible, but when two users talk in close proximity, the crosstalk problem that each other's utterances are input to the microphone of the other's hands-free earphone Will occur.

본 발명은 자동 통역 상황에서 복수의 사용자들로 하여금 자연스러운 수준의 대화가 가능하도록 화자간 발성 간섭을 차단하고 외부 소음에 강인한 음성 인식이 가능하게끔 하는 2채널 음성신호를 이용한 자동통역 시스템, 디바이스 및 방법을 제공하고자 한다.The present invention is an automatic interpretation system, device, and method using a two-channel voice signal that blocks speech interference between speakers and enables strong voice recognition against external noise so that a plurality of users can have a natural level of conversation in an automatic interpretation situation. Want to provide.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.However, the technical problem to be achieved by the present embodiment is not limited to the technical problem as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 발명의 제 1 측면에 따른 2채널 음성신호를 이용한 자동통역 디바이스는 스피커, 제 1 및 제 2 마이크를 포함하는 이어셋 장치와 데이터를 송수신하는 통신모듈, 상기 2채널 음성신호를 이용하여 통역 결과를 생성하기 위한 프로그램이 저장된 메모리 및 상기 메모리에 저장된 프로그램을 실행시키는 프로세서를 포함한다. 이때, 상기 프로세서는 상기 프로그램을 실행시킴에 따라, 상기 제 1 마이크를 통해 수신한 화자의 음성인 제 1 음성신호를 상기 제 2 마이크를 통해 수신한 잡음 신호 및 상기 화자의 음성을 포함하는 제 2 음성신호와 비교하고, 상기 비교 결과에 기초하여 상기 제 1 및 2 음성신호에 포함된 상기 화자의 음성을 모두 또는 선택적으로 추출하여 자동통역을 수행한다.As a technical means for achieving the above-described technical problem, the automatic interpretation device using a two-channel voice signal according to the first aspect of the present invention is a communication module that transmits and receives data with an earphone device including a speaker, first and second microphones. And a memory storing a program for generating an interpretation result using the two-channel voice signal and a processor for executing the program stored in the memory. At this time, as the processor executes the program, the first voice signal, which is the voice of the speaker received through the first microphone, is received through the second microphone, and the second voice includes the noise signal and the voice of the speaker. An automatic interpretation is performed by comparing the voice signal and extracting all or selectively the speaker's voice included in the first and second voice signals based on the comparison result.

상기 제 1 마이크는 상기 이어셋 장치의 이어버드 내부에 위치하고, 상기 제 2 마이크는 상기 이어버드의 외부에 위치하며 화자의 입과 기 설정된 거리 내에 위치할 수 있다.The first microphone is located inside the earbud of the earset device, and the second microphone is located outside the earbud, and may be located within a preset distance from the speaker's mouth.

상기 제 1 음성신호와 상기 제 2 음성신호는 상기 이어셋 장치에서 상기 1채널 신호로 변환되어 상기 통신모듈을 통해 수신되고, 상기 프로세서는 상기 1채널 신호를 디코딩하여 상기 2채널 음성신호인 상기 제 1 및 제 2 음성신호로 추출할 수 있다.The first voice signal and the second voice signal are converted into the one-channel signal in the earphone device and received through the communication module, and the processor decodes the one-channel signal to determine the first two-channel voice signal. And extraction as a second voice signal.

상기 프로세서는 상기 제 1 음성신호로부터 상기 화자의 음성 구간 및 휴지 구간을 검출하고, 상기 제 2 음성신호에서 상기 검출된 휴지 구간에 대응되는 구간에서의 잡음 신호를 검출하며, 상기 검출된 잡음 신호의 신호대 잡음비에 기초하여 상기 화자의 음성을 추출할 수 있다.The processor detects a voice section and a pause section of the speaker from the first voice signal, detects a noise signal in a section corresponding to the detected idle section in the second voice signal, and detects the detected noise signal. The speaker's voice may be extracted based on the signal-to-noise ratio.

상기 프로세서는 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위에 해당하는 경우, 상기 제 1 및 제 2 음성신호를 가중합하고, 상기 가중합된 제 1 및 제 2 음성신호로부터 상기 화자의 음성을 추출할 수 있다.When the signal-to-noise ratio of the noise signal falls within a preset range, the processor weights the first and second voice signals, and extracts the speaker's voice from the weighted first and second voice signals. have.

상기 프로세서는 상기 잡음 신호에 복수 개의 잡음이 포함된 경우, 상기 잡음의 종류에 대응하는 음향모델을 각각 생성하고, 상기 음향모델을 통해 상기 자동통역을 수행할 수 있다.When a plurality of noises are included in the noise signal, the processor may each generate an acoustic model corresponding to the type of noise and perform the automatic interpretation through the acoustic model.

상기 프로세서는 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위를 초과하는 경우, 상기 제 2 음성신호에 포함된 화자의 음성을 추출하여 상기 자동통역을 수행할 수 있다.When the signal-to-noise ratio of the noise signal exceeds a preset range, the processor may extract a speaker's voice included in the second voice signal and perform the automatic interpretation.

상기 프로세서는 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위 미만인 경우, 상기 제 1 음성신호에 포함된 화자의 음성을 추출하여 상기 자동통역을 수행할 수 있다.When the signal-to-noise ratio of the noise signal is less than a preset range, the processor may extract a speaker's voice included in the first voice signal and perform the automatic interpretation.

또한, 본 발명의 제 2 측면에 따른 자동통역 디바이스에서의 2채널 음성신호를 이용한 자동통역 방법은 이어셋 장치로부터 1채널 신호로 변환된 제 1 및 제 2 음성신호를 수신하는 단계; 상기 1채널 신호를 2채널 음성신호인 상기 제 1 및 상기 제 2 음성신호로 디코딩하는 단계; 상기 제 1 및 상기 제 2 음성신호를 비교하는 단계; 상기 비교 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택으로 추출하는 단계 및 상기 추출된 화자의 음성을 대상으로 자동통역을 수행하는 단계를 포함한다. 이때, 상기 제 1 음성신호는 상기 이어셋 장치의 제 1 마이크를 통해 수신한 화자의 음성을 포함하고, 상기 제 2 음성신호는 상기 이어셋 장치의 제 2 마이크를 통해 수신한 잡음 신호 및 상기 화자의 음성을 포함한다.In addition, an automatic interpretation method using a two-channel voice signal in an automatic interpretation device according to a second aspect of the present invention includes the steps of: receiving first and second voice signals converted into a one-channel signal from an earphone device; Decoding the 1-channel signal into the first and second audio signals, which are 2-channel audio signals; Comparing the first and second voice signals; And extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result, and performing automatic interpretation for the extracted speaker's voices. In this case, the first voice signal includes the speaker's voice received through the first microphone of the earset device, and the second voice signal is a noise signal received through the second microphone of the earset device and the speaker’s voice. Includes.

상기 제 1 및 상기 제 2 음성신호를 비교하는 단계는, 상기 제 1 음성신호로부터 상기 화자의 음성 구간 및 휴지 구간을 검출하는 단계 및 상기 제 2 음성신호에서 상기 검출된 휴지 구간에 대응되는 구간에서의 잡음 신호를 검출하는 단계를 포함할 수 있다.Comparing the first and second voice signals may include detecting the speaker's voice section and the idle section from the first voice signal, and in a section corresponding to the detected idle section from the second voice signal. It may include the step of detecting a noise signal of.

상기 비교 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택으로 추출하는 단계는, 상기 검출된 잡음 신호의 신호대 잡음비에 기초하여 상기 화자의 음성을 추출할 수 있다.In the step of extracting all or selectively the speaker's voice included in the first and second voice signals based on the comparison result, the speaker's voice may be extracted based on the signal-to-noise ratio of the detected noise signal. .

상기 비교 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택으로 추출하는 단계는, 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위 내에 포함되는지 여부를 판단하는 단계; 상기 판단 결과 상기 기 설정된 범위 내에 포함되는 경우 상기 제 1 및 제 2 음성신호를 가중합하는 단계 및 상기 가중합된 제 1 및 제 2 음성신호로부터 상기 화자의 음성을 추출하는 단계를 포함할 수 있다.The step of extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result may include determining whether the signal-to-noise ratio of the noise signal is within a preset range; When the determination result falls within the preset range, weighting the first and second voice signals and extracting the speaker's voice from the weighted first and second voice signals.

삭제delete

본 발명에 따른 자동통역 방법은 상기 잡음 신호에 복수 개의 잡음이 포함된 경우, 상기 잡음의 종류에 대응하는 음향모델을 각각 생성할 수 있다.In the automatic interpretation method according to the present invention, when a plurality of noises are included in the noise signal, each acoustic model corresponding to the type of noise may be generated.

상기 비교 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택으로 추출하는 단계는, 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위 내에 포함되는지 여부를 판단하는 단계 및 상기 판단 결과 상기 기 설정된 범위를 초과한 경우, 상기 제 2 음성신호에 포함된 화자의 음성을 추출하는 단계를 포함할 수 있다.The step of extracting all or selectively the speaker's voice included in the first and second voice signals based on the comparison result may include determining whether the signal-to-noise ratio of the noise signal is within a preset range, and the When the determination result exceeds the preset range, extracting a speaker's voice included in the second voice signal may be included.

상기 비교 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택으로 추출하는 단계는, 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위 내에 포함되는지 여부를 판단하는 단계 및 상기 판단 결과 상기 기 설정된 범위 미만인 경우, 상기 제 1 음성신호에 포함된 화자의 음성을 추출하는 단계를 포함할 수 있다.The step of extracting all or selectively the speaker's voice included in the first and second voice signals based on the comparison result may include determining whether the signal-to-noise ratio of the noise signal is within a preset range, and the When the determination result is less than the preset range, extracting a speaker's voice included in the first voice signal may be included.

또한, 본 발명의 제 3 측면에 따른 2채널 음성신호를 이용하는 자동통역 시스템은 이어버드의 내부에 위치하는 제 1 마이크를 통해 화자의 음성인 제 1 음성신호를 수신하고, 상기 이어버드의 외부에 위치하는 제 2 마이크를 통해 잡음 신호 및 상기 화자의 음성을 포함하는 제 2 음성신호를 수신하여, 상기 제 1 및 제 2 음성신호를 1채널 신호로 변환하여 전송하는 이어셋 장치 및 상기 1채널 신호를 수신하여 2채널 음성신호인 상기 제 1 및 제 2 음성신호를 추출하고, 상기 제 1 및 제 2 음성신호를 비교한 결과에 기초하여 상기 제 1 및 제 2 음성신호에 포함된 상기 화자의 음성을 모두 또는 선택적으로 추출하여 자동통역을 수행하는 자동통역 디바이스를 포함한다.In addition, the automatic interpretation system using the two-channel voice signal according to the third aspect of the present invention receives the first voice signal, which is the speaker's voice, through a first microphone located inside the earbud, and receives the first voice signal from the outside of the earbud. An earphone device for receiving a noise signal and a second voice signal including the speaker's voice through a second microphone located, converting the first and second voice signals into a one-channel signal, and transmitting the one-channel signal And extracts the first and second voice signals, which are two-channel voice signals, and the speaker's voice included in the first and second voice signals based on the result of comparing the first and second voice signals. It includes an automatic interpretation device that extracts all or selectively and performs automatic interpretation.

상기 자동통역 디바이스는 상기 제 1 음성신호로부터 상기 화자의 음성 구간 및 휴지 구간을 검출하고, 상기 제 2 음성신호에서 상기 검출된 휴지 구간에 대응되는 구간에서의 잡음 신호를 검출하며, 상기 검출된 잡음 신호의 신호대 잡음비에 기초하여 상기 제 1 및 2 음성신호에 포함된 상기 화자의 음성을 추출할 수 있다.The automatic interpretation device detects a voice section and a pause section of the speaker from the first voice signal, detects a noise signal in a section corresponding to the detected idle section from the second voice signal, and detects the detected noise The speaker's voice included in the first and second voice signals may be extracted based on the signal-to-noise ratio of the signal.

상기 자동통역 디바이스는 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위에 해당하는 경우, 상기 제 1 및 제 2 음성신호를 가중합하고, 상기 가중합된 제 1 및 제 2 음성신호로부터 상기 화자의 음성을 추출하여 상기 자동통역을 수행할 수 있다.When the signal-to-noise ratio of the noise signal falls within a preset range, the automatic interpretation device weights the first and second voice signals and extracts the speaker's voice from the weighted first and second voice signals. Thus, the automatic interpretation can be performed.

상기 자동통역 디바이스는 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위를 초과하는 경우, 상기 제 2 음성신호에 포함된 화자의 음성을 추출하여 상기 자동통역을 수행하고, 상기 잡음 신호의 신호대 잡음비가 기 설정된 범위 미만인 경우, 상기 제 1 음성신호에 포함된 화자의 음성을 추출하여 상기 자동통역을 수행할 수 있다.When the signal-to-noise ratio of the noise signal exceeds a preset range, the automatic interpretation device extracts the speaker's voice included in the second voice signal to perform the automatic interpretation, and the signal-to-noise ratio of the noise signal is preset. If it is less than the range, the automatic interpretation may be performed by extracting the speaker's voice included in the first voice signal.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 자유로운 대화형 음성인식 기능을 사용하는 자동통역 상황에서의 크로스톡(cross-talk) 문제를 해소할 수 있다.According to any one of the above-described problem solving means of the present invention, it is possible to solve the cross-talk problem in an automatic interpretation situation using a free interactive voice recognition function.

또한, 2채널 음성신호를 이용함으로써 잡음 구간 및 잡음 종류에 특화된 음향모델 생성 및 적용을 통해 외부소음에 강인한 음성인식 성능이 보장된 자동통역 서비스를 제공할 수 있다.In addition, by using a two-channel voice signal, it is possible to provide an automatic interpretation service that guarantees strong voice recognition performance against external noise through the creation and application of an acoustic model specialized for the noise section and noise type.

도 1은 본 발명의 일 실시예에 따른 자동통역 시스템을 설명하기 위한 도면이다.
도 2는 이어셋 장치의 일 실시예를 도시한 도면이다.
도 3은 이어셋 장치의 블록도이다.
도 4는 1채널 신호를 설명하기 위한 도면이다.
도 5는 본 발명의 일 실시예에 따른 자동통역 디바이스의 블록도이다.
도 6은 2채널 음성신호로 복원한 음성파형의 샘플을 도시한 도면이다.
도 7은 본 발명의 일 실시예에 따른 자동통역 방법의 순서도이다.
도 8은 제 1 및 제 2 음성신호를 모두 또는 선택적으로 적용하여 화자의 음성을 추출하는 내용을 설명하기 위한 순서도이다.1 is a view for explaining an automatic interpretation system according to an embodiment of the present invention.
2 is a diagram showing an embodiment of an earset device.
3 is a block diagram of an earset device.
4 is a diagram for explaining a 1-channel signal.
5 is a block diagram of an automatic interpretation device according to an embodiment of the present invention.
6 is a diagram showing a sample of an audio waveform restored to a 2-channel audio signal.
7 is a flowchart of an automatic interpretation method according to an embodiment of the present invention.
FIG. 8 is a flowchart for explaining contents of extracting a speaker's voice by applying all or selectively the first and second voice signals.

아래에서는 첨부한 도면을 참조하여 본원이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본원의 실시예를 상세히 설명한다. 그러나 본원은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본원을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, exemplary embodiments of the present application will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present application. However, the present application may be implemented in various different forms and is not limited to the embodiments described herein. In addition, in the drawings, parts not related to the description are omitted in order to clearly describe the present application, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서 어떤 부분이 어떤 구성요소를 “포함”한다고 할 때 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.When a part of the specification "includes" a certain component, it means that other components may be further included, rather than excluding other components unless otherwise stated.

본 발명은 자동통역 시스템(1), 디바이스(200) 및 방법에 관한 것이다.The present invention relates to an automatic interpretation system (1), a device (200) and a method.

현재 사용되고 있는 대부분의 자동통역 기능을 제공하는 스마트폰 어플리케이션은 하나의 디바이스를 두고 언어가 다른 두 사람이 사용하는 형태이다.Most of the smartphone applications currently in use that provide automatic interpretation functions are used by two people with different languages with one device.

한국어와 영어를 예로 들면, 한국인 화자가 발성할 경우 통역된 영여가 출력되거나 발성되고, 반대로 외국인 화자가 영어를 발성하면 통역된 한국어가 발성되는 형태이다.For example, in Korean and English, when a Korean speaker speaks, the translated English is output or uttered, whereas when a foreign speaker speaks English, the interpreted Korean is uttered.

이 경우는 자연스러운 대화라기보다는 최소한의 의사 소통 형태이며 발성 형태도 부자연스러울 수밖에 없다. 따라서, 두 사람의 자연스러운 대화 형태의 통역을 가능하게 하기 위해서는 디바이스를 손에 들고 통역 앱을 바라보는 기존의 사용례를 탈피해야만 한다.In this case, rather than a natural conversation, it is a minimal form of communication, and the form of vocalization is bound to be unnatural. Therefore, in order to enable interpretation in the form of a natural conversation between two people, it is necessary to break away from the existing use case of holding a device in hand and looking at an interpretation app.

먼저는 통역 기능을 수행하는 스마트폰 디바이스로부터 사용자의 손이 자유로워야 한다. 그러기 위해서는 디바이스의 마이크와 스피커를 대체할 이어셋이 필요하다. 이와 같이 이어셋을 착용한 상태로 서로 다른 두 사람이 대화하는 경우 자연스러운 발성이 가능하게 된다.First, the user's hand must be free from a smartphone device that performs an interpreter function. To do that, you need an earphone to replace the device's microphone and speakers. In this way, when two different people communicate with each other while wearing the ear set, natural vocalization is possible.

그러나 이 경우 음성인식 관점에서 반드시 해결해야 할 문제가 발생된다. 즉, 하나의 디바이스를 공유하는 방식에서는 디바이스의 마이크를 사용하게 되며 통역 대화에 참여한 두 사람이 번갈아 발성하게 되므로 마이크로 입력되는 음성은 외부 잡음을 제외하고 발성하는 화자의 음성만 입력되게 된다.However, in this case, a problem that must be solved from the viewpoint of speech recognition arises. That is, in the method of sharing one device, the microphone of the device is used, and two people participating in the interpretation conversation alternately utter the voice, so only the voice of the speaker who is speaking is input except for external noise.

그런데 서로 다른 두 사람이 각각 이어셋을 착용한 상태에서 대화하는 경우, 이어셋 마이크에는 자신의 발성 뿐만 아니라 상대방의 발성도 입력되게 된다.However, when two different people communicate with each other wearing an earset, not only their own utterance but also the other's utterance is inputted to the earset microphone.

자연스러운 통역 대화를 위해서는 서로가 1미터 이내로 근접하게 되는데 최근 이어셋의 높은 마이크 감도는 자신의 목소리 외에 상대방의 목소리에도 반응할 수 있다.For natural interpretation conversations, each other comes close to each other within 1 meter. Recently, the earset's high microphone sensitivity allows it to respond to the voice of the other party in addition to its own voice.

예를 들어, 한국어, 영어 통역인 경우 한국인의 발성은 자신이 착용한 이어셋을 통해 한국어 인식기에도 입력되고, 동시에 상대방이 착용한 이어셋을 통해 영어 인식기에도 입력된다. 반대의 경우도 마찬가지다. 이 경우 서로의 음성 인식기는 오작동하게 되어 통역 대화를 방해하게 된다.For example, in the case of a Korean or English interpreter, the Korean's vocalization is input to the Korean recognizer through the earset worn by the person, and at the same time, input to the English recognizer through the earset worn by the other party. The opposite is also true. In this case, the voice recognizers of each other malfunction, thereby interfering with the interpretation conversation.

이와 같은 치명적인 문제를 해결하기 위해서는 크로스톡(cross-talk) 방지 기능이 필요하다.In order to solve such a fatal problem, a cross-talk prevention function is required.

크로스톡 방지를 위해서 기존의 방법으로는 언어인식 기술이 있다. 해당 기술의 경우 발화자의 언어를 인식하여 특정 언어만을 해당 음성 인식기의 입력으로 받아들이게 되므로 음성 인식기에는 하나의 언어만 입력된다. 그러나 언어인식 기술은 언어인식 오류가 발생할 경우 통역은 불가능하게 되는 문제가 있다. To prevent crosstalk, there is language recognition technology as an existing method. In the case of this technology, only one language is input to the speech recognizer because the speech recognizer recognizes the language of the speaker and receives only a specific language as an input of the corresponding speech recognizer. However, language recognition technology has a problem that interpretation is impossible when a language recognition error occurs.

또한, 이러한 언어인식 실패를 감안하여 미리 설정된 복수 개의 언어에 대한 인식기를 동시에 가동하는 방법을 사용하기도 하는데, 사용자의 언어를 미리 설정하는 것이 어렵기 때문에 통역 기능에 제한이 따른다.In addition, in consideration of such language recognition failure, a method of simultaneously operating recognizers for a plurality of preset languages is used. However, since it is difficult to preset a user's language, the interpretation function is limited.

이러한 크로스톡 문제를 해소하기 위하여, 본 발명의 일 실시예에 따른 자동통역 시스템(1), 디바이스(200) 및 방법은 이어셋 장치(100)의 마이크(120)가 자신이 발성한 경우에만 음성인식 기능을 수행하도록 하는 화자 발성 인지 기능을 통해, 자동통역 상황에서 2인 이상의 사용자들이 자연스러운 수준의 대화가 가능하도록 할 수 있다.In order to solve such a crosstalk problem, the automatic interpretation system 1, the device 200, and the method according to an embodiment of the present invention recognize voice only when the microphone 120 of the earset device 100 speaks. Through the speaker's speech recognition function that performs the function, it is possible to allow two or more users to have a natural level of conversation in an automatic interpretation situation.

이하에서는 도 1 내지 도 6을 참조하여 본 발명의 일 실시예에 따른 자동통역 시스템(1) 및 디바이스(200)에 대하여 설명하도록 한다.Hereinafter, the automatic interpretation system 1 and the device 200 according to an embodiment of the present invention will be described with reference to FIGS. 1 to 6.

도 1은 본 발명의 일 실시예에 따른 자동통역 시스템(1)을 설명하기 위한 도면이다. 1 is a view for explaining an automatic interpretation system 1 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자동통역 시스템(1)은 이어셋 장치(100)와 자동통역 디바이스(200)를 포함한다. An automatic interpretation system 1 according to an embodiment of the present invention includes an earphone device 100 and an automatic interpretation device 200.

이어셋 장치(100)는 핸즈프리 이어셋 장치일 수 있으며 저전력 블루투스(Bluetooth Low Energy, BLE)를 통해 자동통역 디바이스(200)와 페어링될 수 있으나, 페어링 방식은 반드시 블루투스에 한정되는 것은 아니다.The earset device 100 may be a hands-free earset device, and may be paired with the automatic interpretation device 200 through Bluetooth Low Energy (BLE), but the pairing method is not necessarily limited to Bluetooth.

자동통역 디바이스(200)는 이어셋 장치(100)로부터 무선으로 2채널 음성신호를 수신하면, 해당 신호를 디코딩하여 화자의 음성을 추출한 뒤 자동통역을 수행한다.When the automatic interpretation device 200 receives a two-channel audio signal wirelessly from the earphone device 100, the automatic interpretation device 200 decodes the corresponding signal to extract the speaker's voice, and then performs automatic interpretation.

이때, 자동통역 디바이스(200)는 휴대용 단말기에 인터넷 통신과 정보 검색 등 컴퓨터 지원 기능을 추가한 지능형 단말기로서, 사용자가 원하는 다수의 응용 프로그램(즉, 애플리케이션)을 설치하여 실행할 수 있는 휴대폰, 스마트 폰(smart phone), 패드(Pad), 스마트 워치(Smart watch), 웨어러블(wearable) 단말, 기타 이동통신 단말 등일 수 있다.At this time, the automatic interpretation device 200 is an intelligent terminal in which computer support functions such as Internet communication and information search are added to a portable terminal, and a mobile phone or a smart phone capable of installing and executing a number of applications (ie, applications) desired by the user It may be a (smart phone), a pad, a smart watch, a wearable terminal, and other mobile communication terminals.

도 2는 이어셋 장치(100)의 일 실시예를 도시한 도면이다. 도 3은 이어셋 장치(100)의 블록도이다. 도 4는 1채널 신호를 설명하기 위한 도면이다.2 is a diagram showing an embodiment of the ear set device 100. 3 is a block diagram of the earset device 100. 4 is a diagram for explaining a 1-channel signal.

본 발명의 일 실시예에 따른 이어셋 장치(100)는 도 2와 같이, 이어버드(110) 내부에 위치한 스피커와 제 1 마이크(120), 이어버드(110) 외부에 위치한 제 2 마이크(130), 블루투스 칩셋(140) 및 배터리(150)를 포함한다.Earset device 100 according to an embodiment of the present invention, as shown in Figure 2, the speaker and the first microphone 120 located inside the earbud 110, the second microphone 130 located outside the earbud 110 , Bluetooth chipset 140 and a battery 150.

이때, 이어셋 장치(100)는 예를 들어 블루투스 넥밴드 타입의 이어셋일 수 있으나, 반드시 이에 한정되는 것은 아니며 스피커와 마이크를 구비하는 것이면 어느 것이든 가능하다. In this case, the earset device 100 may be, for example, a Bluetooth neckband-type earset, but is not limited thereto, and any device including a speaker and a microphone may be used.

넥밴드 타입의 이어셋 장치(100)는 사용자의 목에 착용하는 형태로 양쪽 귀에 삽입하는 2개의 이어버드(110)를 구비하며, 2개의 이어버드(110)에는 각각 스피커가 구비된다. 그리고 하나 이상의 이어버드(110)의 내부에는 화자의 음성인 제 1 음성신호를 수신하는 제 1 마이크(120)를 포함한다. 제 1 마이크(120)는 이어버드(110)의 내부에 위치하기 때문에 외부 소음은 대부분 차폐되며 이어셋 장치(100)를 착용 중인 화자의 음성만을 비교적 깨끗하게 수신할 수 있다.The neckband-type earset device 100 includes two earbuds 110 that are inserted into both ears in a form worn on the user's neck, and each of the two earbuds 110 is provided with a speaker. In addition, the at least one earbud 110 includes a first microphone 120 for receiving a first voice signal, which is a speaker's voice. Since the first microphone 120 is located inside the earbud 110, external noise is mostly shielded, and only the voice of a speaker wearing the earset device 100 can be received relatively cleanly.

한편, 이어버드(110) 내의 제 1 마이크(120)를 통해 취득한 제 1 음성신호는 일반 마이크로 취득한 신호에 비해 음질이 현저히 떨어지게 된다.Meanwhile, the sound quality of the first voice signal acquired through the first microphone 120 in the earbud 110 is significantly lowered compared to the signal acquired with a general microphone.

이를 위해 본 발명의 일 실시예에서의 이어셋 장치(100)는 제 1 마이크(120)와 더불어 이어버드(110) 외부에 위치하는 제 2 마이크(130)를 통해 2채널 음성신호를 이용함으로써 음성인식 성능을 향상시킬 수 있다.To this end, the earset device 100 in an embodiment of the present invention recognizes voice by using a two-channel voice signal through a second microphone 130 located outside the earbud 110 in addition to the first microphone 120. Can improve performance.

제 2 마이크(130)는 잡음 신호와 화자의 음성을 포함하는 제 2 음성신호를 수신한다. 이때, 잡음 신호는 상기 이어셋 장치(100)를 착용한 화자의 음성을 포함한 외부 소음, 다른 화자의 음성을 포함하는 개념이다.The second microphone 130 receives a noise signal and a second voice signal including a speaker's voice. In this case, the noise signal is a concept including external noise including the voice of the speaker wearing the earset device 100 and the voice of another speaker.

제 2 마이크(130)는 이어버드(110)의 외부에 위치하며 바람직하게는 화자의 음성을 취득하기 적합한 위치인 화자의 입과 기 설정된 거리 내에 위치할 수 있다. 이때 적합한 위치로 제 2 마이크(130)는 넥밴드 타입의 경우 넥밴드의 본체에 위치할 수 있으며, 일반적인 와이어 타입의 이어셋 장치의 경우에는 와이어 상에 위치할 수 있다. The second microphone 130 is located outside the earbud 110 and may be preferably located within a preset distance from the speaker's mouth, which is a suitable position for acquiring the speaker's voice. In this case, the second microphone 130 may be positioned on the main body of the neckband in the case of the neckband type, and may be positioned on the wire in the case of a general wire type earset device.

한편, 제 1 마이크(120)에 의해 취득된 제 1 음성신호는 이어셋 장치(100)를 착용 중인 사용자를 인증하기 위한 용도로 사용될 수 있으며, 제 2 마이크(130)는 일반적인 마이크로 제 2 음성신호는 일반적인 마이크에 의해 취득되는 외부 신호이다.Meanwhile, the first voice signal acquired by the first microphone 120 may be used for authenticating a user wearing the earset device 100, and the second microphone 130 is a general micro second voice signal. This is an external signal acquired by a general microphone.

이때, 사용자를 인증하기 위한 제 1 음성신호는 사용자의 귓속에서 공명되는 신호, 또는 목 부위에서 성대의 떨림에 의해 감지되는 신호 등 특수한 목적을 위해 화자의 신체에 부착되어 사용됨에 따라 취득되는 신호를 의미한다.At this time, the first voice signal for authenticating the user is a signal that is acquired as it is attached to the speaker's body and used for a special purpose, such as a signal resonating in the user's ear or a signal detected by the trembling of the vocal cords in the neck. it means.

제 1 음성신호는 화자 자신 이외의 소리는 반응을 안하거나 반응을 해도 자신의 발성에 의한 신호와 뚜렷하게 구분이 가능하다. 제 1 음성신호를 이용할 경우 2인 이상의 화자가 함께 발화하는 상황에서도 사용자의 목소리만을 구별할 수 있으며, 특히 사용자의 발화의 시작과 끝을 정확하게 선별할 수 있다.The first audio signal can be clearly distinguished from the signal by his or her utterance even if the speaker does not respond to sounds other than the speaker himself or responds. In the case of using the first voice signal, even when two or more speakers speak together, only the user's voice can be distinguished, and in particular, the start and end of the user's speech can be accurately selected.

즉, 자신 이외의 화자가 발화하는 경우를 정확히 구분할 수 있기 때문에 상대방의 목소리에 의한 간섭 문제로부터 자유로워질 수 있다. 뿐만, 아니라 후술하는 바와 같이 화자의 발화의 시작과 끝인 발화 구간(End Point Detection, EPD)과 휴지 구간을 정확히 구분할 수 있기 때문에 휴지 구간에서 발생되는 외부 잡음 신호만을 추출할 수 있고, 추출된 잡음 신호의 잡음의 종류를 판별하여 해당 잡음에 특화된 음향모델을 사용할 수도 있다.In other words, since it is possible to accurately distinguish the case where a speaker other than the self is speaking, it is possible to be free from interference problems caused by the voice of the other party. In addition, as described later, since the end point detection (EPD), which is the start and end of the speaker's speech, and the idle section can be accurately distinguished, only the external noise signal generated in the idle section can be extracted, and the extracted noise signal It is also possible to use an acoustic model specialized for the noise by discriminating the type of noise.

도 3을 참조하면, 본 발명의 일 실시예에 따른 이어셋 장치(100)의 블루투스 칩셋(CSR8670, 140)은 오디오 인터페이스 모듈(141), 암 프로세서(STM32L, 142), DSP 모듈(143) 및 블루투스 모듈(144)을 포함한다.Referring to FIG. 3, the Bluetooth chipsets CSR8670 and 140 of the earset device 100 according to an embodiment of the present invention include an audio interface module 141, a female processor (STM32L, 142), a DSP module 143, and a Bluetooth device. Module 144.

오디오 인터페이스 모듈(141)은 제 1 및 제 2 마이크(120, 130)에 의해 취득된 2채널 음성신호인 제 1 및 제 2 음성신호를 수신한다. 2채널 음성신호는 오디오 인터페이스 모듈(141)을 거쳐 I2S 디지털 오디오 단자로 출력된다.The audio interface module 141 receives first and second audio signals, which are two-channel audio signals acquired by the first and second microphones 120 and 130. The two-channel audio signal is output to the I2S digital audio terminal through the audio interface module 141.

한편, 현존하는 모든 블루투스 칩셋은 2개의 마이크 입력을 허용하기는 하나, DSP 모듈(143)을 거치며 핸즈 프리 프로파일(Hands-Free Profile, HFP) 프로토콜을 통해 1채널(mono signal)만 블루투스 모듈(144)로 송신하게 된다.Meanwhile, all existing Bluetooth chipsets allow two microphone inputs, but only one channel (mono signal) through the hands-free profile (HFP) protocol through the DSP module 143 ).

따라서, 본 발명의 일 실시예에서의 이어셋 장치(100)는 기존의 핸즈프리 프로파일 대신 데이터 통신용인 시리얼 포트 프로파일(Serial Port Profile, SPP) 프로토콜을 사용하는 암 프로세서(142)를 추가적으로 사용하여 제 1 및 제 2 음성신호인 2채널 음성신호를 1채널 신호로 변환할 수 있다. Accordingly, the earset device 100 according to an embodiment of the present invention additionally uses the arm processor 142 using a serial port profile (SPP) protocol for data communication instead of the existing hands-free profile, A 2-channel audio signal, which is a second audio signal, may be converted into a 1-channel signal.

암 프로세서(142)에 의해 변환된 1채널 신호는 도 4와 같이 바이트 단위로 교번하도록 구성될 수 있다. 즉, 제 1 음성신호는 ‘0’바이트로, 제 2 음성신호는 ‘1’바이트로 할당하여 1채널 신호로 구성할 수 있다. 이때, 변환된 1채널 신호에는 주변 블루투스 신호의 간섭 방지 및 전송률 향상을 위해 음성 코딩 방식이 적용될 수 있다.The 1-channel signal converted by the arm processor 142 may be configured to alternately in byte units as shown in FIG. 4. That is, the first voice signal may be assigned as '0' byte and the second voice signal may be assigned as '1' byte to constitute a 1-channel signal. In this case, a voice coding scheme may be applied to the converted 1-channel signal in order to prevent interference from surrounding Bluetooth signals and improve a transmission rate.

이와 같이 구성된 1채널 신호는 UART 단자로 입력되고, 블루투스 모듈(144)을 통해 자동통역 디바이스(200)로 전송할 수 있다.The 1-channel signal configured as described above is input through the UART terminal and may be transmitted to the automatic interpretation device 200 through the Bluetooth module 144.

한편, 상술한 이어셋 장치(100)의 세부적인 하드웨어 스펙은 일 예시에 불과하며, 다양한 스펙을 가지는 구성들의 조합을 통해 상술한 기능을 수행할 수 있으면 어느 것이든 충분하다.On the other hand, the detailed hardware specifications of the above-described earset device 100 are only an example, and any one is sufficient if the above-described functions can be performed through a combination of configurations having various specifications.

다음으로, 본 발명의 일 실시예에 따른 자동통역 디바이스(200)에 대해 설명하도록 한다.Next, a description will be given of the automatic interpretation device 200 according to an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른 자동통역 디바이스(200)의 블록도이다.5 is a block diagram of an automatic interpretation device 200 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자동통역 디바이스(200)는 통신모듈(210), 메모리(220) 및 프로세서(230)를 포함한다.The automatic interpretation device 200 according to an embodiment of the present invention includes a communication module 210, a memory 220, and a processor 230.

통신모듈(210)은 이어셋 장치(100) 및 다른 화자의 자동통역 디바이스(200)와 데이터를 송수신한다. 이러한 통신모듈(210)은 유선 통신 모듈 및 무선 통신 모듈을 모두 포함할 수 있다. 유선 통신 모듈은 전화선 통신 장치, 케이블 홈(MoCA), 이더넷(Ethernet), IEEE1294, 통합 유선 홈 네트워크 및 RS-485 제어 장치로 구현될 수 있다. 또한, 무선 통신 모듈은 WLAN(wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, 무선 USB 기술 및 무선 HDMI 기술 등으로 구현될 수 있다.The communication module 210 transmits and receives data to and from the earphone device 100 and the automatic interpretation device 200 of another speaker. The communication module 210 may include both a wired communication module and a wireless communication module. The wired communication module may be implemented as a telephone line communication device, a cable groove (MoCA), an Ethernet, IEEE1294, an integrated wired home network, and an RS-485 control device. In addition, the wireless communication module may be implemented with WLAN (wireless LAN), Bluetooth, HDR WPAN, UWB, ZigBee, Impulse Radio, 60GHz WPAN, Binary-CDMA, wireless USB technology, and wireless HDMI technology.

메모리(220)에는 2채널 음성신호를 이용하여 통역 결과를 생성하기 위한 프로그램이 저장된다. 여기에서, 메모리(220)는 전원이 공급되지 않아도 저장된 정보를 계속 유지하는 비휘발성 저장장치 및 휘발성 저장장치를 통칭하는 것이다.The memory 220 stores a program for generating an interpretation result using a two-channel audio signal. Here, the memory 220 collectively refers to a nonvolatile storage device and a volatile storage device that continuously maintains stored information even when power is not supplied.

예를 들어, 메모리(220)는 콤팩트 플래시(compact flash; CF) 카드, SD(secure digital) 카드, 메모리 스틱(memory stick), 솔리드 스테이트 드라이브(solid-state drive; SSD) 및 마이크로(micro) SD 카드 등과 같은 낸드 플래시 메모리(NAND flash memory), 하드 디스크 드라이브(hard disk drive; HDD) 등과 같은 마그네틱 컴퓨터 기억 장치 및 CD-ROM, DVD-ROM 등과 같은 광학 디스크 드라이브(optical disc drive) 등을 포함할 수 있다.For example, the memory 220 may be a compact flash (CF) card, a secure digital (SD) card, a memory stick, a solid-state drive (SSD), and a micro SD card. Magnetic computer storage devices such as NAND flash memory, hard disk drives (HDD), such as cards, and optical disc drives such as CD-ROM, DVD-ROM, etc. I can.

프로세서(230)는 메모리(220)에 저장된 프로그램을 실행시킨다. 구체적으로 프로세서(230)는 통신모듈(210)을 통해 이어셋 장치(100)로부터 1채널 신호를 수신하면, 1채널 신호를 디코딩 과정을 통해 원래의 2채널 음성신호인 제 1 및 제 2 음성신호로 복원한다.The processor 230 executes a program stored in the memory 220. Specifically, when receiving a 1-channel signal from the earphone device 100 through the communication module 210, the processor 230 converts the 1-channel signal into the first and second audio signals, which are the original 2-channel audio signals through a decoding process. Restore.

그리고 제 1 및 제 2 음성신호를 비교한 결과에 기초하여 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택적으로 추출하여 자동통역을 수행한다.And, based on the result of comparing the first and second voice signals, all or selectively extracting all or selectively the speaker's voices included in the first and second voice signals, and performing automatic interpretation.

도 6은 2채널 음성신호로 복원한 음성파형의 샘플을 도시한 도면이다.6 is a diagram showing a sample of an audio waveform restored to a 2-channel audio signal.

2채널 음성신호 중 제 1 음성신호는 이어셋 장치(100)의 이어버드(110) 내에 위치한 제 1 마이크(120)로부터 획득된 신호이고, 제 2 음성신호는 이어셋 장치(100)의 외부에 위치한 제 2 마이크(130)로부터 획득된 신호이다.Among the two-channel audio signals, the first audio signal is a signal obtained from the first microphone 120 located in the earbud 110 of the earset device 100, and the second audio signal is a second audio signal located outside the earset device 100. 2 This is a signal obtained from the microphone 130.

상술한 바와 같이 제 2 음성신호는 외부에 위치한 제 2 마이크(130)의 특성상화자의 발성뿐만 아니라 주변 화자의 목소리와 잡음을 모두 포함한다. As described above, the second voice signal includes not only the voice of the speaker but also the voice and noise of the surrounding speaker due to the characteristic of the second microphone 130 located outside.

반면, 제 1 음성신호는 이어버드(110) 내부에 위치한 제 1 마이크(120)에 의해 획득되는 신호이기 때문에, 신호의 레벨은 작지만 외부 잡음으로부터 비교적 독립되어 사용자 본인의 발성만 취득되는 특성이 있다.On the other hand, since the first voice signal is a signal obtained by the first microphone 120 located inside the earbud 110, the level of the signal is small, but relatively independent from external noise, so that only the user's own voice is acquired. .

도 6은 ‘화자 A’와 ‘화자 B’가 음악이 나오는 실내에서 서로 동시에 발화하는 상황에서, ‘화자 A’의 이어셋 장치에서 취득된 2채널 음성신호를 ‘화자 A’의 자동통역 디바이스(200)에서 복원한 것이다.6 is a situation where'speaker A'and'speaker B'simultaneously utter each other in a room where music is played, and the two-channel audio signal acquired from the ear set device of'speaker A'is automatically interpreted by the'speaker A'. ).

제 2 음성신호는 ‘화자 A’와 ‘화자 B’의 음성 그리고 음악소리가 모두 섞여 서로의 구간을 구분하지 못하는 반면, 제 1 음성신호는 ‘화자 A’의 발성만 녹음된 것을 확인할 수 있다. 뿐만 아니라, 제 1 음성신호는 ‘화자 A’의 음성구간(A, A’)사이의 휴지구간(B)도 명확히 구분이 가능하다. In the second audio signal, the voices of “Speaker A” and “Speaker B” and the sound of music are all mixed, so that each section cannot be distinguished, whereas the first audio signal only records the utterances of “Speaker A”. In addition, the first voice signal can clearly distinguish between the pause sections (B) between the voice sections (A, A'of'Speaker A').

프로세서(230)는 이러한 제 1 음성신호로부터 화자의 음성구간(A, A’) 및 휴지구간(B)을 검출하고, 검출된 제 1 음성신호의 휴지구간(B)에 대응되는 제 2 음성신호에서의 잡음 신호를 검출하며, 검출된 잡음신호의 신호대 잡음비(SNR)에 기초하여 화자의 음성을 추출할 수 있다.The processor 230 detects the speaker's voice section (A, A') and the pause section (B) from the first voice signal, and a second voice signal corresponding to the detected pause section (B) of the first voice signal. The noise signal at is detected, and the speaker's voice can be extracted based on the signal-to-noise ratio (SNR) of the detected noise signal.

이러한 특징으로 인해 제 1 음성신호는 화자 발성을 인지하기 위한 용도로 사용될 수 있으며, 제 1 음성신호의 휴지구간에 대응되는 제 2 음성신호를 이용할 경우 현재 주변 환경의 잡음의 세기, 종류의 판별이 가능하게 된다.Due to this characteristic, the first voice signal can be used for recognizing the speaker's utterance. When using the second voice signal corresponding to the idle period of the first voice signal, it is difficult to determine the intensity and type of noise in the current environment. It becomes possible.

제 1 음성신호를 이용하여 화자의 음성구간 및 휴지구간을 구분하고 나면, 프로세서(230)는 이를 이용하여 제 2 음성신호의 검출된 잡음신호의 신호대 잡음비를 아래 식 1에 따라 산출한다.After classifying the speaker's voice section and the idle section using the first voice signal, the processor 230 uses this to calculate the signal-to-noise ratio of the detected noise signal of the second voice signal according to Equation 1 below.

[식 1][Equation 1]

여기에서 S_rms와 N_rms는 각각 음성신호와 잡음신호의 평균제곱근을 의미한다.Here, S _rms and N _rms mean the root of the mean of the speech and noise signals, respectively.

한편, 단일신호로부터 SNR을 산출하려면 신호의 잡음 구간을 정확히 파악해야 하는데, 단일신호의 경우 음성과 잡음이 섞여있기 때문에 정확한 음성 및 잡음 구간을 구하기 어려운 경우가 많다.On the other hand, in order to calculate the SNR from a single signal, it is necessary to accurately identify the noise section of the signal. In the case of a single signal, since voice and noise are mixed, it is often difficult to obtain an accurate voice and noise section.

그러나 본 발명의 일 실시예에의 경우 2채널 음성신호를 사용하기 때문에 정확한 음성 구간 및 잡음 구간을 구분할 수 있으며, 이에 따라 정확한 SNR을 산출하여 신호의 잡음상태를 판단할 수 있다.However, in the case of an embodiment of the present invention, since a two-channel voice signal is used, an accurate voice section and a noise section can be distinguished, and accordingly, an accurate SNR can be calculated to determine the noise state of the signal.

산출된 SNR값은 3가지의 주변 환경 잡음 상태로 구분될 수 있다. SNR값이 기 설정된 범위(예를 들어 5~15dB)에 해당하는 경우 잡음 음성으로, 기 설정된 범위 미만(예를 들어 0~5dB)인 경우 강한 잡음 음성으로, 기 설정된 범위를 초과(예를 들어 15~20dB)하는 경우 선명한 음성으로 구분될 수 있다.The calculated SNR value can be classified into three ambient noise states. If the SNR value falls within the preset range (e.g. 5 to 15 dB), it is a noise voice, if it is less than the preset range (e.g. 0 to 5 dB), it is a strong noise voice, and exceeds the preset range (e.g. 15-20dB) can be distinguished by clear voice.

분류된 방식에 따라 프로세서(230)는 제 1 음성신호와 제 2 음성신호에 대한 식 2에 따른 가중합 방법을 적용하여 최종 음성인식을 위한 입력값으로 사용할 수 있다.According to the classified method, the processor 230 may apply the weighted sum method according to Equation 2 for the first voice signal and the second voice signal to be used as an input value for final voice recognition.

[식 2][Equation 2]

여기에서 α는 가중치 팩터(weighting factor)로 0≤α≤1의 값을 갖는다. S_m1은 제 1 음성신호이고 s_m2는 제 2 음성신호이다.Here, α is a weighting factor and has a value of 0≦α≦1. S _m1 is the first audio signal and s _m2 is the second audio signal.

프로세서(230)는 잡음 신호의 신호대 잡음비가 기 설정된 범위를 초과한 경우, 즉 선명한 음성으로 분류되는 경우 가중치 팩터 α를 0으로 둔다. 이에 따라, 프로세서(230)는 제 2 음성신호에 포함된 화자의 음성을 추출하여 최종 음성인식을 위한 입력값으로 사용할 수 있다.The processor 230 sets the weight factor α to 0 when the signal-to-noise ratio of the noise signal exceeds a preset range, that is, is classified as a clear speech. Accordingly, the processor 230 may extract the speaker's voice included in the second voice signal and use it as an input value for final voice recognition.

또한, 프로세서(230)는 잡음 신호의 신호대 잡음비가 기 설정된 범위 미만인 경우, 즉 강한 잡음 음성으로 분류되는 경우 가중치 팩터 α를 1로 둔다. 이에 따라, 프로세서(230)는 제 1 음성신호에 포함된 화자의 음성을 추출하여 최종 음성인식을 위한 입력값으로 사용할 수 있다.Also, the processor 230 sets the weight factor α to 1 when the signal-to-noise ratio of the noise signal is less than a preset range, that is, classified as a strong noise speech. Accordingly, the processor 230 may extract the speaker's voice included in the first voice signal and use it as an input value for final voice recognition.

이러한 제 1 음성신호는 이어버드(110) 내부에 포함된 제 1 마이크(120)로부터 취득된 것이기 때문에, 외부 마이크인 제 2 마이크(130)에 의해 취득한 제 2 음성신호에 비해 음질이 현저히 저하되어 독립적으로 음성인식에 사용하기에는 음성인식의 성능 저하를 가져올 수 있다.Since this first audio signal is acquired from the first microphone 120 included in the earbud 110, the sound quality is significantly lowered compared to the second audio signal acquired by the second microphone 130, which is an external microphone. Independent use of voice recognition may result in poor voice recognition performance.

그럼에도 불구하고 0~5dB와 같은 강한 잡음 환경에서는 제 2 음성신호를 통한 음성인식은 거의 불가능하며, 반면 제 1 음성신호를 이용하면 약간의 성능이 저하되더라도 음성인식이 가능하기 때문에, 강한 잡음 환경에서도 일정한 음성인식 성능을 확보할 수 있다는 장점이 있다.Nevertheless, in a strong noise environment such as 0~5dB, voice recognition through the second voice signal is almost impossible. On the other hand, if the first voice signal is used, voice recognition is possible even if performance is slightly degraded. There is an advantage of securing a certain voice recognition performance.

또한, 프로세서(230)는 잡음 신호의 신호대 잡음비가 기 설정된 범위에 해당하는 경우에는 가중치 팩터 α를 0과 1을 제외한 값으로 SNR값을 고려하여 두 신호의 가중합으로 사용할 수 있다. 이때, 가중치 팩터 α는 기 설정된 범위에서의 SNR 값에 비례되도록 설정될 수 있다.In addition, when the signal-to-noise ratio of the noise signal falls within a preset range, the processor 230 may use the weight factor α as a weighted sum of the two signals in consideration of the SNR value excluding 0 and 1. In this case, the weight factor α may be set to be proportional to the SNR value in a preset range.

두 음성신호를 단순히 가중합하여 최종 음성인식의 입력값으로 사용하기 위해서는 두 신호의 샘플들이 음성 구간과 비음성 구간에서 정확히 일치해야 한다. 본 발명의 일 실시예서의 2채널 음성신호는 샘플 단위로 서로 일치하도록 생성되었기 때문에 두 신호를 가중합하는 경우에도 신호가 정확히 일치한다.In order to simply weight the two voice signals and use them as an input value for final voice recognition, the samples of the two signals must be exactly matched in the voice section and the non-voice section. Since the two-channel audio signals according to an embodiment of the present invention are generated to match each other in units of samples, even when the two signals are weighted, the signals are exactly the same.

이와 같은 특징을 통해 프로세서(230)는 제 1 음성신호의 휴지 구간에 해당하는 제 2 음성신호의 구간을 분석하여 가중합된 제 1 및 제 2 음성신호로부터 화자의 음성을 추출할 수 있다.Through this feature, the processor 230 may analyze a section of the second voice signal corresponding to the idle section of the first voice signal and extract the speaker's voice from the weighted first and second voice signals.

이때, 프로세서(230)는 잡음 신호에 하나 이상의 잡음이 포함된 경우, 잡음의 종류에 대응하는 음향모델을 각각 생성할 수 있다.In this case, when the noise signal includes one or more noises, the processor 230 may each generate an acoustic model corresponding to the type of noise.

일반적으로 음성인식에서는 입력신호의 음향환경과 음성인식 모델의 음향환경이 일치하는 경우 음성인식 성능이 최적화된다. 이론적으로는 복수 개의 음향환경에 특화된 음향모델을 구축해두고 입력신호의 음향환경에 따라 선별하여 사용하면 음향환경에 최적화된 음성인식 시스템이 가능해진다.In general, in speech recognition, speech recognition performance is optimized when the acoustic environment of the input signal and the acoustic environment of the voice recognition model match. Theoretically, if an acoustic model specialized for a plurality of acoustic environments is constructed and selected according to the acoustic environment of the input signal, a voice recognition system optimized for the acoustic environment becomes possible.

그러나 입력신호의 음향환경을 미리 알기 위해서는 정확한 음성구간 및 비음성구간의 판별이 우선되어야 하며, 판별에 오류가 발생하는 경우 오히려 음성인식 성능을 더욱 저하시킬 수 있게 된다. However, in order to know the acoustic environment of the input signal in advance, it is necessary to prioritize the discrimination of the correct voice section and the non-voice section, and if an error occurs in the discrimination, the voice recognition performance can be further degraded.

또한, 음향모델은 한번 결정되면 음성인식 과정에서 바꿀 수 없기 때문에 기존의 음성인식 시스템에서는 하나의 음향모델만을 사용한다.In addition, since the acoustic model cannot be changed during the speech recognition process once it is determined, only one acoustic model is used in the existing speech recognition system.

이와 달리, 본 발명의 일 실시예에 따른 자동통역 디바이스(200)는 2채널 음성신호의 분석을 통해 입력신호의 비음성 구간의 정확도를 높여 음향환경 파악이 가능해지므로, 종래기술과 같은 문제점을 해소하면서 음향환경에 특화된 음향모델 선별이 가능하다는 특징이 있다.In contrast, the automatic interpretation device 200 according to an embodiment of the present invention improves the accuracy of the non-speech section of the input signal through the analysis of the two-channel voice signal so that the acoustic environment can be grasped, thus solving the problems as in the prior art. In addition, it has the characteristic that it is possible to select an acoustic model specialized for an acoustic environment.

참고로, 본 발명의 실시예에 따른 도 5에 도시된 구성 요소들은 소프트웨어 또는 FPGA(Field Programmable Gate Array) 또는 ASIC(Application Specific Integrated Circuit)와 같은 하드웨어 형태로 구현될 수 있으며, 소정의 역할들을 수행할 수 있다.For reference, the components shown in FIG. 5 according to an embodiment of the present invention may be implemented in software or in hardware such as a field programmable gate array (FPGA) or an application specific integrated circuit (ASIC), and perform predetermined roles. can do.

그렇지만 '구성 요소들'은 소프트웨어 또는 하드웨어에 한정되는 의미는 아니며, 각 구성 요소는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다.However,'elements' are not meant to be limited to software or hardware, and each element may be configured to be in an addressable storage medium or configured to reproduce one or more processors.

따라서, 일 예로서 구성 요소는 소프트웨어 구성 요소들, 객체지향 소프트웨어 구성 요소들, 클래스 구성 요소들 및 태스크 구성 요소들과 같은 구성 요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다.Thus, as an example, components are components such as software components, object-oriented software components, class components and task components, processes, functions, properties, procedures, and sub- Includes routines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays and variables.

구성 요소들과 해당 구성 요소들 안에서 제공되는 기능은 더 작은 수의 구성 요소들로 결합되거나 추가적인 구성 요소들로 더 분리될 수 있다.Components and functions provided within the components can be combined into a smaller number of components or further separated into additional components.

이하에서는 도 7 및 도 8을 참조하여 본 발명의 일 실시예에 따른 자동통역 디바이스(200)에서의 2채널 음성신호를 이용한 자동통역 방법을 설명하도록 한다.Hereinafter, an automatic interpretation method using a 2-channel voice signal in the automatic interpretation device 200 according to an embodiment of the present invention will be described with reference to FIGS. 7 and 8.

도 7은 본 발명의 일 실시예에 따른 자동통역 방법의 순서도이다. 도 8은 제 1 및 제 2 음성신호를 모두 또는 선택적으로 적용하여 화자의 음성을 추출하는 내용을 설명하기 위한 순서도이다.7 is a flowchart of an automatic interpretation method according to an embodiment of the present invention. FIG. 8 is a flowchart for explaining contents of extracting a speaker's voice by applying all or selectively the first and second voice signals.

본 발명의 일 실시예에 따른 자동통역 방법은 먼저, 이어셋 장치(100)로부터 1채널 신호로 변환된 제 1 및 제 2 음성신호를 수신하면(S110), 1채널 신호를 2채널 음성신호인 제 1 및 제 2 음성신호로 디코딩한다(S120).In the automatic interpretation method according to an embodiment of the present invention, first, when first and second audio signals converted into 1-channel signals are received from the earphone device 100 (S110), the 1-channel signal is converted into a 2-channel audio signal. It is decoded into the first and second voice signals (S120).

이때, 제 1 음성신호는 이어셋 장치(100)의 이어버드(110) 내부에 위치한 제 1 마이크(120)를 통해 수신한 화자의 음성을 포함하고, 제 2 음성신호는 이어버드(110)의 외부에 위치한 제 2 마이크(120)를 통해 수신한 잡음 신호 및 화자의 음성을 포함한다.At this time, the first voice signal includes the speaker's voice received through the first microphone 120 located inside the earbud 110 of the earset device 100, and the second voice signal is external to the earbud 110 It includes a noise signal received through the second microphone 120 located at and the speaker's voice.

다음으로, 제 1 및 제 2 음성신호를 비교하고(S130), 비교 결과에 기초하여 제 1 및 제 2 음성신호에 포함된 화자의 음성을 모두 또는 선택적으로 추출한다(S140).Next, the first and second voice signals are compared (S130), and all or selectively, all or all of the speaker's voices included in the first and second voice signals are extracted based on the comparison result (S140).

이때,제 1 및 제 2 음성신호를 비교하는 단계에서는 제 1 음성신호로부터 화자의 음성 구간 및 휴지 구간을 검출하고, 검출된 휴지 구간에 대응되는 제 2 음성신호의 구간에서의 잡음 신호를 검출한다.At this time, in the step of comparing the first and second voice signals, the speaker's voice section and the idle section are detected from the first voice signal, and a noise signal in the section of the second voice signal corresponding to the detected idle section is detected. .

이후, 검출된 잡음 신호의 신호대 잡음비를 산출하여(S141) 화자의 음성을 추출하게 된다. 구체적으로 잡음 신호의 신호대 잡음비가 기 설정된 범위 내에 포함되는지 여부를 판단하여 포함되는 경우(S142), 제 1 및 제 2 음성신호를 가중합하고(S144), 가중합된 제 1 및 제 2 음성신호로부터 화자의 음성을 추출할 수 있다(S146).Thereafter, the signal-to-noise ratio of the detected noise signal is calculated (S141) to extract the speaker's voice. Specifically, when it is determined whether or not the signal-to-noise ratio of the noise signal is included within a preset range (S142), the first and second voice signals are weighted (S144), and from the weighted first and second voice signals The speaker's voice can be extracted (S146).

반면, 잡음 신호의 신호대 잡음비가 기 설정된 범위를 초과한 경우 제 2 음성신호에 포함된 화자의 음성을 추출하게 되고(S143, S146), 잡음 신호의 신호대 잡음비가 기 설정된 범위 미만인 경우 제 1 음성신호에 포함된 화자의 음성을 추출하게 된다(S145, S146).On the other hand, if the signal-to-noise ratio of the noise signal exceeds a preset range, the speaker's voice included in the second voice signal is extracted (S143, S146), and if the signal-to-noise ratio of the noise signal is less than the preset range, the first voice signal The speaker's voice included in is extracted (S145, S146).

다음으로, 추출된 화자의 음성을 대상으로 자동통역을 수행한다(S150).Next, automatic interpretation is performed on the extracted speaker's voice (S150).

상술한 설명에서, 단계 S110 내지 S150은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 6에서의 자동통역 시스템(1) 및 디바이스(200)에 관하여 이미 기술된 내용은 도 7 및 도 8의 자동통역 방법에도 적용된다. In the above description, steps S110 to S150 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, or the order between steps may be changed. In addition, even if other contents are omitted, the contents already described with respect to the automatic interpretation system 1 and the device 200 in FIGS. 1 to 6 are also applied to the automatic interpretation method of FIGS. 7 and 8.

본 발명의 방법 및 시스템은 특정 실시예와 관련하여 설명되었지만, 그것들의 구성 요소 또는 동작의 일부 또는 전부는 범용 하드웨어 아키텍쳐를 갖는 컴퓨터 시스템을 사용하여 구현될 수 있다.Although the methods and systems of the present invention have been described in connection with specific embodiments, some or all of their components or operations may be implemented using a computer system having a general-purpose hardware architecture.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

1: 자동통역 시스템
100: 이어셋 장치
110: 이어버드
120: 제 1 마이크
130: 제 2 마이크
140: 블루투스 칩셋
150: 배터리
200: 자동통역 디바이스
210: 통신모듈
220: 메모리
230: 프로세서1: Automatic interpretation system
100: earset device
110: earbuds
120: first microphone
130: second microphone
140: Bluetooth chipset
150: battery
200: automatic interpretation device
210: communication module
220: memory
230: processor

Claims

In the automatic interpretation device using a two-channel audio signal,
A communication module that transmits and receives data to and from an earphone device including a speaker, first and second microphones, the first microphone is located inside the earbud of the earset device, the second microphone is located outside the earbud, and the speaker Is located within a preset distance from the mouth of
A memory storing a program for generating an interpretation result using the two-channel audio signal, and
Including a processor for executing a program stored in the memory,
As the processor executes the program, the first voice signal, which is the voice of the speaker received through the first microphone, is received through the second microphone, and the second voice signal including the noise signal and the voice of the speaker And, based on the comparison result, automatic interpretation is performed by extracting all or selectively the speaker's voices included in the first and second voice signals,
The processor detects a voice section and a pause section of the speaker from the first voice signal, detects a noise signal in a section corresponding to the detected idle section in the second voice signal, and detects the detected noise signal. An automatic interpretation device for extracting the speaker's speech from at least one of the first speech signal and the second speech signal based on a signal-to-noise ratio.

delete

The method of claim 1,
The first audio signal and the second audio signal are converted into a 1-channel signal in the earphone device and received through the communication module,
Wherein the processor decodes the one-channel signal and extracts the two-channel voice signals into the first and second voice signals.

delete

The method of claim 1,
When the signal-to-noise ratio of the noise signal falls within a preset range, the processor weights the first and second voice signals, and extracts the speaker's voice from the weighted first and second voice signals. Automatic interpretation device.

The method of claim 5,
When the noise signal includes a plurality of noises, the processor generates each acoustic model corresponding to the type of noise, and performs the automatic interpretation through the acoustic model.

The method of claim 1,
Wherein the processor extracts the speaker's voice included in the second voice signal and performs the automatic interpretation when the signal-to-noise ratio of the noise signal exceeds a preset range.

The method of claim 1,
Wherein the processor extracts the speaker's voice included in the first voice signal and performs the automatic interpretation when the signal-to-noise ratio of the noise signal is less than a preset range.

In the automatic interpretation method using a 2-channel audio signal in an automatic interpretation device,
Receiving first and second audio signals converted into 1-channel signals from the earphone device;
Decoding the 1-channel signal into the first and second audio signals, which are 2-channel audio signals;
Comparing the first and second voice signals;
Extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result, and
Including the step of performing automatic interpretation for the extracted speaker's voice,
The first voice signal includes a speaker's voice received through a first microphone inside the earbud of the earbud device, and the second voice signal is a noise signal received through a second microphone outside the earbud and the Contains the speaker's voice,
The extracting comprises detecting a voice section and a pause section of the speaker from the first voice signal, detecting a noise signal in a section corresponding to the detected idle section from the second voice signal, and the detected And extracting the speaker's voice from at least one of the first voice signal and the second voice signal based on the signal-to-noise ratio of the noise signal.

delete

The method of claim 9,
Extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result,
Determining whether the signal-to-noise ratio of the noise signal is within a preset range;
If the determination result falls within the preset range, weighting the first and second voice signals and
And extracting the speaker's voice from the weighted first and second voice signals.

The method of claim 13,
When the noise signal contains a plurality of noises, the automatic interpretation method further comprising generating an acoustic model corresponding to the type of the noise.

The method of claim 9,
Extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result,
Determining whether the signal-to-noise ratio of the noise signal is within a preset range, and
And extracting a speaker's voice included in the second voice signal when the determination result exceeds the preset range.

The method of claim 9,
Extracting all or selectively the speaker's voices included in the first and second voice signals based on the comparison result,
Determining whether the signal-to-noise ratio of the noise signal is within a preset range, and
And extracting a speaker's voice included in the first voice signal when the determination result is less than the preset range.

In the automatic interpretation system using a two-channel audio signal,
The first voice signal, which is the speaker's voice, is received through a first microphone located inside the earbud, and a second voice including a noise signal and the speaker's voice through a second microphone located outside the earbud An earphone device for receiving a signal, converting the first and second audio signals into a 1-channel signal and transmitting the same; And
The first and second voice signals, which are two-channel voice signals, are extracted by receiving the one-channel signal, and included in the first and second voice signals based on a result of comparing the first and second voice signals. Including an automatic interpretation device for performing automatic interpretation by extracting all or selectively the voice of the speaker,
The automatic interpretation device detects a voice section and a pause section of the speaker from the first voice signal, detects a noise signal in a section corresponding to the detected idle section from the second voice signal, and the detected An automatic interpretation system for extracting the speaker's voice from at least one of the first voice signal and the second voice signal based on a signal-to-noise ratio of a noise signal.

delete

The method of claim 17,
When the signal-to-noise ratio of the noise signal falls within a preset range, the automatic interpretation device weights the first and second voice signals and extracts the speaker's voice from the weighted first and second voice signals. Automatic interpretation system.

The method of claim 17,
When the signal-to-noise ratio of the noise signal exceeds a preset range, the automatic interpretation device extracts the speaker's voice included in the second voice signal and performs the automatic interpretation,
When the signal-to-noise ratio of the noise signal is less than a preset range, the automatic interpretation system is performed by extracting the speaker's voice included in the first voice signal.