KR20230022789A

KR20230022789A - Server and method for automatic interpretation based on zero ui

Info

Publication number: KR20230022789A
Application number: KR1020220045016A
Authority: KR
Inventors: 윤승; 김상훈; 이민규; 맹준규
Original assignee: 한국전자통신연구원
Priority date: 2021-08-09
Filing date: 2022-04-12
Publication date: 2023-02-16

Abstract

Provided is a method performed by a zero UI-based automatic interpretation server that communicates with a plurality of terminal devices having a microphone function, a speaker function, a communication function, and a wearable function. The method comprises the steps of: connecting terminal devices located within a predetermined automatic interpretation zone; receiving a voice signal of a first user from a first terminal device among the terminal devices in the automatic interpretation zone; matching a plurality of users located within a speech reception range of the terminal device; and performing automatic interpretation of the voice signal and transmitting the automatic interpretation result to a second terminal device of at least one second user corresponding to the matching result. The present invention enables natural automatic interpretation conversations based on the zero UI in face-to-face situations.

Description

Zero UI based automatic interpretation server and method {SERVER AND METHOD FOR AUTOMATIC INTERPRETATION BASED ON ZERO UI}

본 발명은 제로 유아이 기반의 자동 통역 서버 및 방법에 관한 것으로, 특히, 표시 화면과 같은 사용자 인터페이스(User Interface: UI)가 필요하지 않은 제로 유아이(Zero UI) 기반의 자동 통역 서버 및 그 방법에 관한 기술이다.The present invention relates to a zero UI-based automatic interpretation server and method, and more particularly, to a zero UI-based automatic interpretation server and method that do not require a user interface (UI) such as a display screen. It is a skill.

음성인식, 자동번역 및 음성합성 기술의 발달에 따라 자동통역 기술이 널리 확산되고 있다. 자동통역 기술은 일반적으로 스마트폰 또는 자동통역을 위한 전용 단말기에 의해 수행된다.With the development of speech recognition, automatic translation, and speech synthesis technologies, automatic interpretation technologies are widely spread. Automatic interpretation technology is generally performed by a smart phone or a dedicated terminal for automatic interpretation.

사용자는 스마트폰 또는 전용 단말기에서 제공하는 화면을 터치하거나 버튼을 클릭한 후, 스마트폰 또는 전용 단말기를 입 근처에 가까이 대고 통역하고자 하는 문장을 발성한다. After touching a screen provided by a smartphone or a dedicated terminal or clicking a button, the user brings the smartphone or dedicated terminal close to his or her mouth and speaks the sentence to be interpreted.

이후 스마트폰 또는 전용 단말기는 음성 인식 및 자동 번역 등을 통해 사용자의 발화 문장으로부터 번역문을 생성하고, 그 번역문을 화면에 출력하거나 음성 합성을 통해 그 번역문에 대응하는 통역된 음성을 출력하는 방식으로 통역 결과를 상대방에게 제공한다.Thereafter, a smartphone or a dedicated terminal generates a translation from the user's spoken sentence through voice recognition and automatic translation, and outputs the translated text on the screen or outputs an interpreted voice corresponding to the translated text through speech synthesis. Present the results to the other party.

이처럼 스마트폰 또는 전용 단말기에 의해 수행되는 일반적인 자동 통역 과정은 통역하고자 하는 문장을 발성할 때마다 스마트폰 또는 전용 단말기의 터치 동작 또는 클릭 동작을 요구한다.As such, a general automatic interpretation process performed by a smart phone or a dedicated terminal requires a touch operation or a click operation of the smart phone or dedicated terminal whenever a sentence to be interpreted is uttered.

또한 이 경우에 상대방이 어떤 언어를 사용하는지 불확실한 상태에서는 어떤 언어를 통역 대상으로 해야 하는지 불명확한 경우도 발생한다.In addition, in this case, when it is unclear which language the other party is using, it is unclear which language is to be interpreted.

이러한 요소들은 사용자에게 매우 불편을 가하며, 자연스러운 대화를 방해하는 요소들이다.These elements are very inconvenient for users and are elements that interfere with natural conversation.

공개특허공보 제10-2021-0124050호 (2021.10.14)Publication No. 10-2021-0124050 (2021.10.14)

본 발명이 해결하고자 하는 과제는 자동 통역 구역 내에 위치한 사용자들을 자동으로 매칭 연결하여, 사용자가 통역하고자 하는 문장을 발성할 때마다 수행하는 불필요한 동작 없이, 상대방과의 자연스러운 대화를 수행할 수 있는 자동 통역 서버 및 방법을 제공하는데 있다.An object to be solved by the present invention is automatic interpretation that can perform natural conversation with the other party without unnecessary motions performed each time the user utters a sentence to be interpreted by automatically matching and connecting users located in an automatic interpretation zone. It is to provide a server and method.

본 발명의 다른 목적은, 사용자의 음성이 상대방의 자동 통역 단말 장치로 입력되거나 반대로 상대방의 음성이 사용자의 자동 통역 단말 장치로 입력되는 상황에서 사용자의 자동 통역 단말 장치 및/또는 상대방측 사용자의 자동 통역 단말 장치가 오동작하는 문제를 해결할 수 있을 뿐만 아니라, 통역 상대방이 아닌 다른 제3자의 음성이 입력되어 자동 통역 단말 장치가 오동작하는 문제를 해결할 수 있는 자동 통역 서버 및 방법을 제공하는데 있다.Another object of the present invention is to provide automatic interpretation of the user's automatic interpretation terminal device and/or the other party's user in a situation where the user's voice is input to the other party's automatic interpretation terminal device or vice versa. An automatic interpretation server and method are provided that not only solves the problem of an interpretation terminal device malfunctioning, but also solves the problem that an automatic interpretation terminal device malfunctions due to input of a voice of a third party other than the interpreter.

다만, 본 발명이 해결하고자 하는 과제는 상기된 바와 같은 과제로 한정되지 않으며, 또다른 과제들이 존재할 수 있다.However, the problem to be solved by the present invention is not limited to the above problem, and other problems may exist.

상술한 과제를 해결하기 위한 본 발명의 제1 측면에 따른 마이크 기능, 스피커 기능, 통신 기능 및 웨어러블 기능을 구비하는 복수의 단말 장치들과 통신하는 제로 유아이 기반의 자동 통역 서버에 의해 수행되는 방법은 소정의 자동 통역 구역 내 위치한 단말 장치들을 연결하는 단계; 상기 자동 통역 구역 내 단말 장치 중 제1 단말 장치로부터 제1 사용자의 음성 신호를 수신하는 단계; 상기 단말 장치의 발화 수신 가능 거리 내에 위치한 복수의 사용자들을 매칭하는 단계; 및 상기 음성 신호에 대한 자동 통역을 수행하여 상기 매칭 결과에 상응하는 적어도 하나의 제2 사용자의 제2 단말 장치로 자동 통역 결과를 전송하는 단계를 포함한다.A method performed by a zero child-based automatic interpretation server communicating with a plurality of terminal devices having a microphone function, a speaker function, a communication function, and a wearable function according to the first aspect of the present invention for solving the above problems is connecting terminal devices located within a predetermined automatic interpretation zone; receiving a voice signal of a first user from a first terminal device among terminal devices within the automatic interpretation zone; matching a plurality of users located within an utterance receiving range of the terminal device; and performing automatic interpretation on the voice signal and transmitting the automatic interpretation result to a second terminal device of at least one second user corresponding to the matching result.

또한, 본 발명의 제2 측면에 따른 제로 유아이 기반의 자동 통역 서버는 소정의 자동 통역 구역 내 위치한 복수의 단말 장치들과 연결되어 통신하는 통신모듈, 상기 복수의 단말 장치들로부터 사용자의 음성 신호 및 사용 언어를 포함하는 화자 정보를 등록 및 저장하고, 자동 통역 기능을 제공하기 위한 프로그램이 저장된 메모리 및 상기 메모리에 저장된 프로그램을 실행시킴에 따라, 상기 자동 통역 구역 내 단말 장치의 음성 신호를 수신하면, 상기 음성 신호 및 화자 정보에 기초하여 상기 단말 장치의 발화 수신 가능 거리 내에 위치한 복수의 사용자들을 매칭하고, 상기 음성 신호에 대한 자동 통역을 수행하여 상기 매칭 결과에 상응하는 단말 장치로 자동 통역 결과를 상기 통신모듈을 통해 전송하는 프로세서를 포함한다.In addition, the zero child-based automatic interpretation server according to the second aspect of the present invention includes a communication module connected to and communicating with a plurality of terminal devices located within a predetermined automatic interpretation zone, a user's voice signal from the plurality of terminal devices, and When a memory storing a program for registering and storing speaker information including a language used and providing an automatic interpretation function is stored and a program stored in the memory is executed, a voice signal from a terminal device within the automatic interpretation zone is received, Based on the voice signal and speaker information, a plurality of users located within an utterance receiving range of the terminal device are matched, automatic interpretation is performed on the voice signal, and an automatic interpretation result is transmitted to the terminal device corresponding to the matching result. It includes a processor that transmits through a communication module.

상술한 과제를 해결하기 위한 본 발명의 다른 면에 따른 컴퓨터 프로그램은, 하드웨어인 컴퓨터와 결합되어 제로 유아이 기반의 자동 통역 방법을 실행하며, 컴퓨터 판독가능 기록매체에 저장된다.A computer program according to another aspect of the present invention for solving the above problems is combined with a computer that is hardware to execute a zero child-based automatic interpretation method, and is stored in a computer-readable recording medium.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

전술한 본 발명의 일 실시예에 의하면, 자동 통역 단말 장치가 웨어러블 기기의 형태로 구현되어, 자동 통역을 수행하기 위한 화면 또는 버튼과 같은 사용자 인터페이스가 필요하지 않기 때문에, 사용자가 단말기의 화면을 터치하거나 버튼을 클릭하는 불필요한 동작없이, 자동 통역을 처리함으로써, 사용자와 상대방 간의 자연스러운 대화가 가능하다.According to one embodiment of the present invention described above, since the automatic interpretation terminal device is implemented in the form of a wearable device, a user interface such as a screen or a button for performing automatic interpretation is not required, so that the user can touch the screen of the terminal A natural conversation between the user and the other party is possible by processing automatic interpretation without unnecessary actions such as clicking or clicking buttons.

또한, 자동통역 구역 내에서 서버와 단말 장치간, 그리고 각 단말 장치간 자동 연결을 지원함으로써 사용자 간의 불필요한 연결 단계를 생략할 수 있다.In addition, unnecessary connection steps between users can be omitted by supporting automatic connection between the server and terminal devices and between each terminal device within the automatic interpretation area.

또한, 개인화 음성 검출 과정과 실제 발화한 사용자에 대한 화자 검증 및 식별 과정을 통해, 발화한 사용자의 음성이 상대방의 단말 장치로 입력되어 상대방의 단말 장치에서 중복 자동 통역을 수행하는 오동작을 방지할 수 있다.In addition, through the personalized voice detection process and the process of verifying and identifying the speaker of the user who actually uttered the utterance, it is possible to prevent an erroneous operation in which the voice of the uttered user is input to the other party's terminal device and redundant automatic interpretation is performed in the other party's terminal device. there is.

이러한 효과를 통해 본 발명은 면대면 상황에서 제로(zero) UI 기반의 자연스러운 자동 통역 대화를 가능하게 한다.Through these effects, the present invention enables natural automatic interpretation conversation based on zero UI in a face-to-face situation.

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.The effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the description below.

도 1은 본 발명의 일 실시예에 따른 제로 유아이 기반의 자동 통역 시스템의 전체 구성도이다.
도 2는 본 발명의 일 실시예에 따른 단말 장치의 블록도이다.
도 3은 본 발명의 일 실시예에 따른 자동 통역 서버의 블록도이다.
도 4는 본 발명의 일 실시예에 따른 자동 통역 방법의 순서도이다.
도 5는 자동 통역 구역 내 단말 장치들을 연결하는 과정을 설명하기 위한 도면이다.
도 6은 본 발명의 일 실시예에서 사용자들을 매칭하는 과정을 설명하기 위한 순서도이다.
도 7은 본 발명의 일 실시예에서의 사용자들을 매칭하는 과정을 설명하기 위한 예시도이다.1 is an overall configuration diagram of an automatic interpretation system based on zero infants according to an embodiment of the present invention.
2 is a block diagram of a terminal device according to an embodiment of the present invention.
3 is a block diagram of an automatic interpretation server according to an embodiment of the present invention.
4 is a flowchart of an automatic interpretation method according to an embodiment of the present invention.
5 is a diagram for explaining a process of connecting terminal devices in an automatic interpretation zone.
6 is a flowchart illustrating a process of matching users in one embodiment of the present invention.
7 is an exemplary diagram for explaining a process of matching users in one embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention, and methods of achieving them, will become clear with reference to the detailed description of the following embodiments taken in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, only these embodiments are intended to complete the disclosure of the present invention, and are common in the art to which the present invention belongs. It is provided to fully inform the person skilled in the art of the scope of the invention, and the invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.Terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase. As used herein, "comprises" and/or "comprising" does not exclude the presence or addition of one or more other elements other than the recited elements. Like reference numerals throughout the specification refer to like elements, and “and/or” includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various components, these components are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first element mentioned below may also be the second element within the technical spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used with meanings commonly understood by those skilled in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined.

도 1은 본 발명의 일 실시예에 따른 제로 유아이 기반의 자동 통역 시스템(1)의 전체 구성도이다. 도 2는 본 발명의 일 실시예에 따른 단말 장치의 블록도이다. 도 3은 본 발명의 일 실시예에 따른 자동 통역 서버(100)의 블록도이다.1 is an overall configuration diagram of an automatic interpretation system 1 based on zero infants according to an embodiment of the present invention. 2 is a block diagram of a terminal device according to an embodiment of the present invention. 3 is a block diagram of an automatic interpretation server 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 자동 통역 시스템(1)은 자동 통역 서버(100, 이하 서버), 제1 사용자의 제1 단말 장치(200), 제2 사용자의 제2 단말 장치(300) 및 데이터베이스(400)를 포함한다. An automatic interpretation system 1 according to an embodiment of the present invention includes an automatic interpretation server 100 (hereinafter, a server), a first terminal device 200 of a first user, a second terminal device 300 of a second user, and a database. (400).

한편, 도 1에서는 제1 및 제2 단말 장치(200, 300)만을 도시하였으나 반드시 이에 한정되는 것은 아니며, 단말 장치의 개수에는 제한이 없다. 즉, 복수의 단말 장치가 서버(100)와 연결되어 자동 통역 서비스가 수행될 수 있다.Meanwhile, although only the first and second terminal devices 200 and 300 are illustrated in FIG. 1 , it is not necessarily limited thereto, and the number of terminal devices is not limited. That is, a plurality of terminal devices may be connected to the server 100 to perform an automatic interpretation service.

제1 사용자와 제2 사용자는 자동 통역 기반의 대화를 나누는 사용자로서, 제1 사용자는 제1 언어를 사용할 수 있는 사용자이고, 제2 사용자는 제1 언어와 상이한 제2 언어를 사용할 수 있는 사용자이다. The first user and the second user are users having a conversation based on automatic interpretation, the first user being a user who can use a first language, and the second user being a user who can use a second language different from the first language. .

이하, 설명에서 제1 사용자는 제1 단말 장치(200)를 통해 제1 언어로 음성을 발화하고, 제2 사용자는 제2 단말 장치(300)를 통해 제1 언어가 제2 언어로 번역된 자동 통역 결과를 수신하는 것으로 가정하여 설명하도록 한다. Hereinafter, in the description, the first user utters a voice in the first language through the first terminal device 200, and the second user automatically translates the first language into the second language through the second terminal device 300. The description will be made on the assumption that an interpretation result is received.

이때, 제1 사용자와 제2 사용자의 역할은 반드시 고정되는 것이 아니며, 대화를 주고받음에 따라 음성 발화 및 자동 통역 결과 수신은 각각의 단말 장치에서 모두 수행됨은 물론이다.At this time, the roles of the first user and the second user are not necessarily fixed, and as the conversation is exchanged, voice utterance and automatic interpretation result reception are performed in each terminal device, of course.

한편, 제1 및 제2 단말 장치(200, 300)의 형태는 특별히 한정되는 것은 아니나 웨어러블 기기로 구성됨이 바람직하다. 즉, 제1 및 제2 단말 장치(200, 300)는 각각 단일의 웨어러블 기기로 구성될 수도 있으며, 또는 제1 및 제2 사용자 단말(스마트폰, 태블릿 등)과 웨어러블 기기가 유선 또는 무선으로 연결되는 형태로 실시될 수도 있다.Meanwhile, the shapes of the first and second terminal devices 200 and 300 are not particularly limited, but are preferably configured as wearable devices. That is, the first and second terminal devices 200 and 300 may each be configured as a single wearable device, or the first and second user terminals (smartphone, tablet, etc.) and the wearable device may be wired or wirelessly connected. It may be implemented in the form of

도 2를 참조하면, 제1 및 제2 단말 장치(200, 300)는 음성 수집부(210, 310), 통신부(220, 320) 및 음성 출력부(230, 330)를 포함하여 구성될 수 있다.Referring to FIG. 2 , the first and second terminal devices 200 and 300 may include voice collection units 210 and 310 , communication units 220 and 320 , and voice output units 230 and 330 . .

음성 수집부(210, 310)는 각 사용자들의 음성을 수집하는 구성으로, 고성능 마이크 기능을 구비하는 장치일 수 있다. 음성 수집부(210, 310)는 사용자의 음성을 음성 신호로 변환하여 통신부(220, 320)로 전달한다.The voice collection units 210 and 310 are configured to collect voices of each user and may be devices having a high-performance microphone function. The voice collection unit 210 or 310 converts the user's voice into a voice signal and transmits it to the communication unit 220 or 320 .

통신부(220, 320)는 음성 수집부(210, 310)로부터 전달된 사용자의 음성 신호를 서버(100)로 전송한다. 이를 수신한 서버(100)는 음성 신호를 기반으로 자동 통역 결과를 생성하게 된다. 이러한 통신부(220, 320)는 서버(100) 또는 다른 단말 장치와 연결될 수 있도록 하는 기능을 제공한다.The communication unit 220 or 320 transmits the user's voice signal transmitted from the voice collection unit 210 or 310 to the server 100 . Upon receiving this, the server 100 generates an automatic interpretation result based on the voice signal. These communication units 220 and 320 provide a function allowing connection with the server 100 or other terminal devices.

음성 출력부(230, 330)는 통신부(220, 320)를 통해 자동 통역 결과를 수신하여 사용자에게 전달한다. 음성 출력부(230, 330)는 스피커를 구비하는 기기로, 예를 들어 이어폰 또는 헤드셋으로 실시될 수 있다.The voice output unit 230 or 330 receives the automatic interpretation result through the communication unit 220 or 320 and delivers it to the user. The audio output units 230 and 330 are devices having speakers, and may be implemented as earphones or headsets, for example.

도 3을 참조하면, 본 발명의 일 실시예에 따른 자동 통역 서버(100)는 통신모듈(110), 메모리(120) 및 프로세서(130)를 포함한다.Referring to FIG. 3 , an automatic interpretation server 100 according to an embodiment of the present invention includes a communication module 110 , a memory 120 and a processor 130 .

통신모듈(110)은 소정의 자동 통역 구역 내에 위치한 복수의 단말 장치(200, 300)들과 연결되어 통신한다. 즉, 각 단말 장치(200, 300)의 통신부를 통해 음성 신호를 수신하거나, 자동 통역 결과를 전송한다. The communication module 110 is connected to and communicates with a plurality of terminal devices 200 and 300 located within a predetermined automatic interpretation area. That is, a voice signal is received through the communication unit of each terminal device 200 or 300 or an automatic interpretation result is transmitted.

메모리(120)에는 복수의 단말 장치(200, 300)들로부터 사용자의 음성 신호 및 사용 언어를 포함하는 화자 정보를 등록 및 저장하고, 자동 통역 기능을 제공하기 위한 프로그램이 저장된다.The memory 120 stores a program for registering and storing speaker information including a user's voice signal and a language used by the plurality of terminal devices 200 and 300 and providing an automatic interpretation function.

프로세서(130)는 메모리(120)에 저장된 프로그램을 실행시킴에 따라, 자동 통역 구역 내 단말 장치의 음성 신호를 수신하면, 음성 신호 및 화자 정보에 기초하여 단말 장치(200, 300)의 발화 수신 가능 거리 내에 위치한 복수의 사용자를 매칭한다. 그리고 음성 신호에 대한 자동 통역을 수행하여 매칭 결과에 상응하는 단말 장치(200, 300)로 자동 통역 결과를 통신모듈을 통해 전송한다.When the processor 130 executes the program stored in the memory 120 and receives a voice signal from the terminal device within the automatic interpretation zone, the speech of the terminal device 200 or 300 can be received based on the voice signal and speaker information. Match multiple users located within the distance. Then, automatic interpretation is performed on the voice signal, and the automatic interpretation result is transmitted to the terminal device 200 or 300 corresponding to the matching result through the communication module.

한편, 서버(100)는 각 사용자 및 단말 장치(200, 300)의 정보를 데이터베이스(400)에 등록하는 과정을 수행해야 한다. 서버(100)는 단말 장치(200, 300)를 통해 또는 별도의 웹사이트 등을 통해 사용자 본인의 화자 특성을 구별할 수 있는 화자 임베딩 기반의 화자 정보를 등록한다. 이때, 서버(100)는 사용자의 음성 신호와 사용 언어를 포함하는 화자 정보를 등록받게 된다. Meanwhile, the server 100 must perform a process of registering information on each user and the terminal devices 200 and 300 in the database 400 . The server 100 registers speaker information based on speaker embedding capable of distinguishing the user's own speaker characteristics through the terminal devices 200 and 300 or through a separate website. At this time, the server 100 registers speaker information including the user's voice signal and the language used.

등록된 화자 정보는 이후 자동 통역 과정에서 개인화 음성 검출, 화자 검증, 화자 식별 과정에 사용된다. 또한, 사용 언어와 같은 부가 정보는 자동 통역 상대방을 연결하는데 사용될 수 있다.The registered speaker information is then used for personalized voice detection, speaker verification, and speaker identification in the automatic interpretation process. Also, additional information such as a language used can be used to connect an automatic interpretation partner.

이하에서는 도 4 내지 도 7를 참조하여 본 발명의 일 실시예에 따른 자동 통역 서버(100)에 의해 수행되는 방법에 대해 보다 상세히 설명하도록 한다.Hereinafter, a method performed by the automatic interpretation server 100 according to an embodiment of the present invention will be described in more detail with reference to FIGS. 4 to 7 .

도 4는 본 발명의 일 실시예에 따른 자동 통역 방법의 순서도이다. 도 5는 자동 통역 구역 내 단말 장치들을 연결하는 과정을 설명하기 위한 도면이다.4 is a flowchart of an automatic interpretation method according to an embodiment of the present invention. 5 is a diagram for explaining a process of connecting terminal devices in an automatic interpretation zone.

먼저, 서버(100)는 소정의 자동 통역 구역 내에 위치한 단말 장치들을 연결한다(S110). First, the server 100 connects terminal devices located within a predetermined automatic interpretation area (S110).

도 5를 참조하면, 서버(100)는 소정의 자동 통역 구역을 활성화시키고(S111), 활성화된 자동 통역 구역 내에 대한 단말 장치(200, 300)의 인식 결과를 수신한다(S112). 이후, 서버(100)는 자동 통역 구역을 인식한 단말 장치(200, 300)와 연결을 수행한다(S113). Referring to FIG. 5 , the server 100 activates a predetermined automatic interpretation zone (S111) and receives a recognition result of the terminal devices 200 and 300 within the activated automatic interpretation zone (S112). Thereafter, the server 100 connects with the terminal devices 200 and 300 recognizing the automatic interpretation area (S113).

보다 구체적으로, 서버(100)는 사용자의 단말 장치(200, 300)가 자동 통역 구역 내에 진입하거나 또는 자동 통역 구역이 활성화되어, 단말 장치(200, 300)가 자동 통역 구역을 인식하게 되면 서버(100)와 단말 장치(200, 300)는 서로 연결될 수 있다.More specifically, when the terminal device 200 or 300 of the user enters the automatic interpretation zone or the automatic interpretation zone is activated and the terminal device 200 or 300 recognizes the automatic interpretation zone, the server 100 ( 100) and the terminal devices 200 and 300 may be connected to each other.

서버(100)는 화자 정보에 포함된 사용 언어 정보를 기반으로 통역 대상 사용자의 단말 장치(200, 300)들이 연결되도록 한다. 사용 언어 정보는 각 사용자가 사용하는 언어의 종류를 나타내는 정보로, 제1 단말 장치(200)의 사용자가 상대방인 제2 단말 장치(300)의 제2 사용자가 사용하는 언어를 식별하기 위한 정보로 사용될 수 있다. 이러한 사용 언어 정보를 기반으로 서버(100)는 제1 사용자의 제1 단말 장치(200)와 제2 사용자의 제2 단말 장치(300)의 연결 여부를 결정할 수 있다. The server 100 connects the terminal devices 200 and 300 of the user to be interpreted based on the language information included in the speaker information. The used language information is information indicating the type of language used by each user, and is information for identifying the language used by the second user of the second terminal device 300 to whom the user of the first terminal device 200 is the other party. can be used Based on the language information used, the server 100 may determine whether to connect the first terminal device 200 of the first user and the second terminal device 300 of the second user.

또한, 서버(100)는 연결 과정에서 사용자에게 어떤 사용자와 연결되었는지 또는 연결이 실패하였는지에 관한 정보를 제공할 수 있다. 이 경우, 서버(100)는 각 단말 장치(200, 300)를 통해 또는 각 사용자 단말로 현재 연결된 사용자 리스트, 또는 연결 실패된 사용자 리스트를 제공할 수 있다. In addition, the server 100 may provide the user with information about which user was connected or whether the connection failed during the connection process. In this case, the server 100 may provide a list of users currently connected to each user terminal or a list of users whose connection has failed through each of the terminal devices 200 and 300 .

리스트를 확인한 각 사용자는 단말 장치(200, 300)를 통해 연결되지 않은 타 사용자와의 연결을 서버(100)로 요청하거나, 또는 연결 성립된 타 사용자와의 연결의 해제 요청을 할 수 있다.Each user who checks the list may request to the server 100 to connect with another user who is not connected through the terminal device 200 or 300 or request to release the connection with another user who has been established.

한편, 본 발명의 일 실시예에서의 자동 통역 구역은 단말 장치(200, 300)가 해당 구역에 위치함에 따라, 통역 가능한 단말 장치(200, 300) 간의 자동 연결이 가능하도록 하고, 매칭된 단말 장치 간의 음성 신호에 대한 자동 통역 결과를 제공 가능한 구역을 의미한다. Meanwhile, the automatic interpretation zone in an embodiment of the present invention enables automatic connection between the terminal devices 200 and 300 capable of interpreting as the terminal devices 200 and 300 are located in the corresponding zone, and matches the terminal devices 200 and 300. It refers to an area capable of providing automatic interpretation results for voice signals between languages.

즉, 자동 통역 구역 내에서 자동 통역 서비스가 제공되며, 자동 통역 구역 밖에서는 단말 장치(200, 300)와 서버(100) 간 연결이 제한됨에 따라 자동 통역 서비스는 제공되지 않는다.That is, the automatic interpretation service is provided within the automatic interpretation zone, and the automatic interpretation service is not provided outside the automatic interpretation zone because the connection between the terminal devices 200 and 300 and the server 100 is limited.

이러한 자동 통역 구역은 일 실시예로, 특정 커버리지로 지정된 구역, 예를 들어 특정 WiFi를 사용하는 구역이 자동 통역 구역으로 지정될 수 있다. 다른 일 실시예로, 사용자가 모바일 핫스팟을 이용하여 자동 통역 구역을 활성화할 수도 있다. 또 다른 실시예로, 블루투스 저에너지(Bluetooth Low Energy, BLE) 등 다른 통신 수단을 통하여 자동 통역 구역이 활성화될 수도 있다. 경우에 따라서는 QR 코드 및 스마트폰 메시지 등을 이용하여 특정 서버 주소를 활성화하여 수동으로 사용자들을 동일한 자동 통역 구역 내에 위치시킬 수도 있다.As an example of such an automatic interpretation zone, an area designated as a specific coverage, for example, a zone using a specific WiFi may be designated as an automatic interpretation zone. In another embodiment, a user may activate an automatic interpretation zone using a mobile hotspot. As another embodiment, the automatic interpretation zone may be activated through other communication means such as Bluetooth Low Energy (BLE). In some cases, a specific server address may be activated using a QR code or smartphone message to manually position users within the same automatic interpretation zone.

다음으로, 서버(100)는 자동 통역 구역 내 연결된 단말 장치(200, 300) 중 제1 단말 장치(200)로부터 제1 사용자의 음성 신호를 수신하면(S120), 단말 장치의 발화 수신 가능 거리 내에 위치한 복수의 사용자들을 매칭한다(S130).Next, when the server 100 receives the first user's voice signal from the first terminal device 200 among the connected terminal devices 200 and 300 within the automatic interpretation zone (S120), the server 100 is within a range where the terminal device can receive an utterance. A plurality of located users are matched (S130).

도 6은 본 발명의 일 실시예에서 사용자들을 매칭하는 과정을 설명하기 위한 순서도이다. 도 7은 본 발명의 일 실시예에서의 사용자들을 매칭하는 과정을 설명하기 위한 예시도이다. 6 is a flowchart illustrating a process of matching users in one embodiment of the present invention. 7 is an exemplary diagram for explaining a process of matching users in one embodiment of the present invention.

일 실시예로, 서버(100)는 제1 단말 장치(200)를 통해 입력된 제1 사용자의 음성 신호가 제1 사용자 본인의 음성 신호인지 여부를 판단한다(S131). 서버(100)는 제1 사용자의 음성 신호를 대상으로 개인화 음성 검출(Personal Voice Activity Detection) 기술을 이용하여 입력된 음성이 제1 사용자 본인의 음성인지 여부를 확인한다. As an embodiment, the server 100 determines whether the first user's voice signal input through the first terminal device 200 is the first user's own voice signal (S131). The server 100 determines whether the input voice is the first user's own voice by using a personal voice activity detection technology for the first user's voice signal.

이러한 개인화 음성 검출 과정은 제1 사용자의 음성 신호에서 실제 음성이 존재하는 음성 구간을 검출하는 과정으로, 실제 음성의 시작점(start point)과 종료점(end point)을 검출하는 과정이다.This personalized voice detection process is a process of detecting a voice section in which a real voice exists in the voice signal of the first user, and is a process of detecting a start point and an end point of the actual voice.

서버(100)는 판단 결과 제1 사용자 본인의 음성이 아닌 잡음으로 판단시 자동 통역 과정을 종료한다. 또한, 제1 사용자 본인의 음성이 아닌 타인(제2 사용자)의 음성으로 판단시, 제1 단말 장치(200)를 기준으로 하는 제2 사용자에 대한 화자 식별 과정을 수행한다.The server 100 ends the automatic interpretation process when it is determined that the noise is not the voice of the first user. In addition, when it is determined that the voice of another person (second user) is not the first user's own voice, a speaker identification process for the second user based on the first terminal device 200 is performed.

한편, 서버(100)는 판단 결과 제1 사용자 본인의 음성인 경우, 제1 사용자의 음성 신호와 화자 정보에 기초하여, 제1 단말 장치(200)를 기준으로 하는 화자 검증(Speaker Verification)을 수행한다(S132). Meanwhile, when the server 100 determines that the voice is the first user's voice, the server 100 performs speaker verification based on the first terminal device 200 based on the first user's voice signal and speaker information. It does (S132).

화자 검증 과정에서는 데이터베이스(400)에 저장된 화자 임베딩 기반의 화자 정보를 기반으로 해당 음성의 화자가 제1 사용자임을 검증한다. 이때, 서버(100)는 화자 검증 결과 화자가 제1 사용자가 아닌 타인(제2 사용자)으로 검증된 경우(S133-N), 해당 음성을 타인으로 분류하여 후술하는 화자 식별 과정을 수행한다. In the speaker verification process, it is verified that the speaker of the corresponding voice is the first user based on speaker information based on speaker embedding stored in the database 400 . In this case, as a result of the speaker verification, the server 100 classifies the voice as another person and performs a speaker identification process to be described later, when the speaker is verified as another person (second user) rather than the first user (S133-N).

한편, 시나리오 정책에 따라 화자 검증에 실패 시 서버(100)는 즉시 동작을 종료할 수도 있다.Meanwhile, when speaker verification fails according to a scenario policy, the server 100 may immediately end its operation.

이와 달리 서버(100)는 화자가 제1 사용자임이 검증된 경우(S133-N), 제1 단말 장치(200)가 연결 가능한 조건인지 확인 후 연결이 불가능한 상태이면 종료하고, 연결이 가능한 상태라면 연결을 시도한다. 또한, 기 연결된 상태라면 연결을 유지한다(S134).In contrast, when it is verified that the speaker is the first user (S133-N), the server 100 checks whether the first terminal device 200 is connectable, terminates the connection if the connection is impossible, and connects if the connection is possible. try In addition, if it is in a pre-connected state, the connection is maintained (S134).

다음으로, 서버(100)는 음성 신호에 대한 자동 통역을 수행하여 매칭 결과에 상응하는 적어도 하나의 제2 사용자의 제2 단말 장치(300)로 자동 통역 결과, 즉 음성 신호가 번역된 합성음을 전송하고, 제1 사용자 단말에는 음성 인식 결과를 출력한다(S140).Next, the server 100 performs automatic interpretation on the voice signal and transmits the result of the automatic interpretation, that is, the synthesized sound obtained by translating the voice signal, to the second terminal device 300 of at least one second user corresponding to the matching result. and outputs the voice recognition result to the first user terminal (S140).

한편, 제1 사용자의 음성 신호가 제2 단말 장치(300)로 입력되는 경우, 서버(100)는 제2 단말 장치(300)를 통해 해당 음성 신호를 수신하게 된다. 이 경우, 서버(100)는 화자 식별을 수행하기 이전에, 제2 단말 장치(300)를 통해 입력된 제1 사용자의 음성 신호가 제2 사용자의 타인에 해당하는 음성 신호인지 여부를 판단한다(S131).Meanwhile, when the first user's voice signal is input to the second terminal device 300, the server 100 receives the corresponding voice signal through the second terminal device 300. In this case, before performing speaker identification, the server 100 determines whether the first user's voice signal input through the second terminal device 300 is the second user's voice signal corresponding to another person ( S131).

그리고 서버(100)는 제2 단말 장치(300)에 입력된 음성 신호가 제2 사용자 본인의 음성 신호가 아닌 타인(제1 사용자)의 음성 신호인 것으로 판단된 경우, 제1 사용자의 음성 신호와 화자 정보에 기초하여, 제2 단말 장치(300)를 기준으로 하는 화자 식별을 수행한다(S135). 그리고 화자 식별 수행 결과 수신한 음성 신호와 화자 임베딩 간에 기 설정된 임계치 이상을 만족하는 경우, 제1 사용자의 음성 신호의 화자를 제1 사용자로 식별한다(S136).Further, when the server 100 determines that the voice signal input to the second terminal device 300 is not the voice signal of the second user but that of another person (first user), the server 100 determines the voice signal of the first user and the voice signal of the second user. Based on the speaker information, speaker identification based on the second terminal device 300 is performed (S135). Then, when a predetermined threshold value or more is satisfied between the received voice signal and speaker embedding as a result of speaker identification, the speaker of the first user's voice signal is identified as the first user (S136).

이후 서버(100)는 제2 단말 장치(300)가 연결 가능한 조건인지 확인 후 연결이 불가능한 상태이면 종료하고, 연결이 가능한 상태라면 연결을 시도한다. 또한, 기 연결된 상태라면 연결을 유지한다(S137).Thereafter, the server 100 checks whether the second terminal device 300 is connectable, terminates the connection if the connection is impossible, and attempts the connection if the connection is possible. In addition, if it is already connected, the connection is maintained (S137).

이와 같이 서버(100)는 화자 검증 및 화자 식별이 완료된 제1 및 제2 단말 장치(200, 300)에 상응하는 제1 및 제2 사용자를 매칭시키게 된다.In this way, the server 100 matches the first and second users corresponding to the first and second terminal devices 200 and 300 for which speaker verification and speaker identification have been completed.

다음으로, 서버(100)는 음성 신호에 대한 자동 통역을 수행하여 매칭 결과에 상응하는 적어도 하나의 제2 사용자의 제2 단말 장치(300)로 자동 통역 결과, 즉 음성 신호가 번역된 합성음을 전송한다(S140).Next, the server 100 performs automatic interpretation on the voice signal and transmits the result of the automatic interpretation, that is, the synthesized sound obtained by translating the voice signal, to the second terminal device 300 of at least one second user corresponding to the matching result. Do (S140).

도 7을 참조하면, 서버(100)에 등록 과정을 마친 사용자 A는 자동 통역 구역 내에 위치한 다른 사용자 B, D, E와 자동으로 연결이 된다. Referring to FIG. 7 , user A who has completed the registration process in the server 100 is automatically connected to other users B, D, and E located in an automatic interpretation zone.

이때, 사용자 A-B가 하나의 대화쌍이 되고, 멀리 위치한 사용자 D-E가 또 다른 대화쌍이라고 가정하면, 사용자 A-B와 사용자 D-E는 각각 서버(100)에 연결이 되어 있고, 대화쌍에 상응하는 단말 장치 간에 연결이 되어 있는 상태이다. At this time, assuming that user A-B is one conversation pair and user D-E located far away is another conversation pair, users A-B and D-E are connected to the server 100, respectively, and terminal devices corresponding to the conversation pair are connected. is in a state of being

하지만, 발화 수신 가능 거리 밖에 위치하여 사용자 A-B의 음성이 사용자 D-E의 단말 장치에 전달되지 않고, 마찬가지로 사용자 D-E의 음성 또한 사용자 A-B의 단말 장치에 전달되지 않아, 자동 통역 과정에서 서로 분리 동작된다. 도 7의 예시에서는 사용자 A를 제1 사용자로, 사용자 B를 제2 사용자로 설명하도록 한다.However, the voices of users A-B are not delivered to the terminal devices of users D-E because they are located outside the speech reception range, and similarly, the voices of users D-E are not delivered to the terminal devices of users A-B, so they operate separately from each other in the automatic interpretation process. In the example of FIG. 7 , user A is described as the first user and user B as the second user.

만약, 사용자 D-E의 대화쌍이 사용자 A-B의 근처에 있어 서로 간의 음성이 각자의 단말 장치에 입력되는 경우라면, 실제 모국어 사용자 간의 상황에서도 음성이 들리는 거리에 있는 만큼, 이들의 음성 또한 자동 통역의 대상으로 인식하여 서로 다른 대화쌍에게 전달할 수도 있다. If the conversation pair of users D-E is near users A-B and their voices are input to their respective terminal devices, their voices are also subject to automatic interpretation as they are within hearing distance of their voices even in the actual situation between native language users. It can be recognized and transmitted to different conversation pairs.

이 경우, 해당 단말 장치의 사용자는 이들의 음성이 전달되거나 연결됨을 감지하였을 때 관련성이 낮아 자동 통역 결과를 제공받고 싶지 않을 경우, 연결을 끊는 동작을 수행하여 자동 통역 대상에서 배제할 수도 있다.In this case, when the user of the corresponding terminal device senses that their voices are transmitted or connected and does not want to be provided with an automatic interpretation result due to low relevance, they may perform an operation to disconnect and be excluded from the automatic interpretation target.

이렇게 사용자 A와 사용자 B가 연결된 상태에서, 한국어 사용자 A(제1 사용자)가 “안녕하세요”라는 음성을 발성하면 이는 단말 장치를 통해 실시간으로 서버(100)로 전송된다. When user A and user B are connected in this way, when Korean user A (first user) utters a voice saying “hello”, it is transmitted to the server 100 in real time through the terminal device.

이후, 서버(100)는 사용자 A의 음성 신호를 대상으로 개인화 음성 검출을 실시하여 제1 단말 장치(200)를 기준으로 사용자 A의 본인 음성인지 여부를 검출하고, 검출된 음성에 대해 화자 검증을 수행하여 사용자 A가 화자임을 확인한 후 연결을 대기한다. Thereafter, the server 100 performs personalized voice detection on the voice signal of user A, detects whether it is user A's own voice based on the first terminal device 200, and performs speaker verification on the detected voice. After confirming that user A is the speaker, it waits for connection.

또한, 미국인 사용자 B(제2 사용자)의 시나리오에서, 서버(100)는 사용자 A의 음성에 대해 개인화 음성 검출을 시도하여 사용자 B 본인의 음성이 아닌 타인인 사용자 A의 음성인 것을 확인하고, 해당 음성이 누구의 음성인지를 확인하기 위하여 데이터베이스(400)에 등록된 화자 정보를 토대로 사용자 A가 화자임을 식별한다.In addition, in the scenario of American user B (second user), the server 100 attempts personalized voice detection for user A's voice, confirms that it is the voice of another user A, and not user B's own voice. User A is identified as a speaker based on speaker information registered in the database 400 in order to confirm whose voice the voice belongs to.

이와 같이 화자 검증 및 화자 식별이 완료된 사용자 A와 사용자 B의 단말 장치를 서로 연결시키게 되며, 만약 이들 간에 이미 연결이 완료된 상태라면 연결을 유지시킨다. 이후, 서버(100)는 사용자 A의 음성이 자동 통역된 후 번역된 합성음이 사용자 B의 단말 장치로 전달되어, 사용자 B는 “Hello”라는 자동 통역 결과를 음성으로 제공받게 된다.In this way, the terminal devices of user A and user B whose speaker verification and speaker identification have been completed are connected to each other, and if the connection between them has already been completed, the connection is maintained. Thereafter, the server 100 automatically interprets user A's voice, and then transfers the translated synthesized sound to user B's terminal device, so that user B receives an automatic interpretation result of “Hello” as a voice.

한편, 본 발명의 일 실시예에서, 화자 정보가 미등록된 제3 사용자의 음성 신호가 제1 및 제2 단말 장치(200, 300)로 입력되는 경우, 서버(100)는 개인화 음성 검출 과정을 통해 제1 및 제2 단말 장치(200, 300)를 기준으로 제1 및 제2 사용자의 타인에 해당하는 음성 신호인지 여부를 판별한다. Meanwhile, in an embodiment of the present invention, when a voice signal of a third user whose speaker information is not registered is input to the first and second terminal devices 200 and 300, the server 100 performs a personalized voice detection process. Based on the first and second terminal devices 200 and 300, it is determined whether or not the voice signal corresponds to another person of the first and second users.

이후, 서버(100)는 타인으로 판별된 제3 사용자의 음성 신호를 화자 정보와 비교하여, 제1 및 제2 단말 장치(200, 300)를 기준으로 하는 화자 식별을 수행한다. 그리고 서버(100)는 화자 식별 결과 제3 사용자가 미등록된 사용자이므로 제3 사용자에 대한 제1 및 제2 사용자와의 매칭을 불허한다.Thereafter, the server 100 compares the voice signal of the third user determined as another person with speaker information, and performs speaker identification based on the first and second terminal devices 200 and 300 . Further, as a result of speaker identification, the server 100 disallows matching of the third user with the first and second users because the third user is an unregistered user.

본 발명의 일 실시예에서의 예외 케이스를 도 7의 예시를 통해 설명하도록 한다. An exception case in one embodiment of the present invention will be described through the example of FIG. 7 .

자동 통역 서버(100)에 연결되지 않은 중국인 비사용자 C의 음성이 사용자 A-B의 발화 수신 가능 거리 내에 있어, 사용자 A 및 B의 각 단말 장치로 사용자 C의 음성이 입력되는 경우, 서버(100)는 개인화 음성 검출 과정에서 사용자 A 및 B의 단말 장치를 기준으로 모두 타인으로 검출하게 되고, 이후 화자 식별 과정에서 화자 정보를 통해 자동 통역되지 않도록 결정한다.When the voice of non-Chinese user C, who is not connected to the automatic interpretation server 100, is within a receiving distance of the utterance of users A-B, and the voice of user C is input to the terminal devices of users A and B, the server 100 In the personalized voice detection process, the terminal devices of users A and B are detected as other people, and then, in the speaker identification process, automatic interpretation is determined not to be performed through speaker information.

이때, 화자 식별 과정에서는 사용자 C의 음성과 다른 사용자의 화자 임베딩을 비교하게 되는데, 서버(100)는 비교 결과 기 설정된 임계치 미만인 경우 자동 통역을 바로 중단할 수 있다.At this time, in the speaker identification process, the voice of user C is compared with the embedding of another user's speaker, and the server 100 may immediately stop automatic interpretation if the comparison result is less than a preset threshold.

또 다른 예외 케이스로, 화자 식별 과정에서 서버(100)는 화자 정보에 따른 음성 신호의 유사도가 기 설정된 임계치 이상인 제3 사용자가 존재하는 경우, 제1 및 제2 단말 장치(200, 300)를 기준으로 제3 사용자가 발화 수신 가능 거리 내에 위치하는지 여부를 판단한다. 그리고 판단 결과 발화 수신 가능 거리 밖에 위치하는 경우, 제3 사용자에 대한 제1 및 제2 사용자와의 매칭을 불허한다.As another exception, in the speaker identification process, the server 100 uses the first and second terminal devices 200 and 300 as criteria when there exists a third user whose voice signal similarity according to speaker information is equal to or greater than a preset threshold. As a result, it is determined whether the third user is located within an utterance reception possible distance. And, as a result of the determination, if the utterance is located outside the reception possible distance, matching with the first and second users for the third user is disallowed.

이는, 제1 및 제2 단말 장치(200, 300)에 제3의 음성이 입력되는 경우, 서버(100)는 개인화 음성 검출 과정을 통해 타인으로 검출하게 되어, 화자 식별 과정을 수행하게 된다.This means that when a third voice is input to the first and second terminal devices 200 and 300, the server 100 detects it as someone else through a personalized voice detection process and performs a speaker identification process.

이러한 화자 식별 과정을 수행한 결과, 유사도가 임계치 이상 또는 무조건 화자 정보에 등록된 사용자 1명을 선택하도록 하는 정책이 지정된 경우, 대화 상대가 아닌 무관한 제3 사용자가 화자로 지정되는 경우가 발생할 수 있다. As a result of this speaker identification process, if the similarity is higher than a threshold or if a policy is specified to unconditionally select one user registered in speaker information, a third user who is not a conversation partner may be designated as a speaker. there is.

이러한 문제를 해소하기 위해, 서버(100)는 제1 및 제2 단말 장치(200, 300)를 기준으로 제3 사용자의 단말 장치가 발화 수신 가능 거리 밖에 위치하는 것을 판단하여 제1 및 제2 사용자와의 매칭을 불허하게 된다. In order to solve this problem, the server 100 determines that the terminal device of the third user is located outside the speech reception range based on the first and second terminal devices 200 and 300, and the first and second users. Matching with is not allowed.

반대로, 발화 수신 가능 거리 내에 위치한 경우라면, 제1 내지 제3 사용자를 모두 매칭시켜 상호 간의 음성에 대한 자동 통역 결과를 제공하게 된다.Conversely, if the utterance is located within a receiving distance, the first to third users are all matched to provide an automatic interpretation result for each other's voice.

한편, 상술한 설명에서, 단계 S110 내지 S140는 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 3의 내용은 도 4 내지 도 7의 방법에도 적용될 수 있다.Meanwhile, in the above description, steps S110 to S140 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. Also, some steps may be omitted if necessary, and the order of steps may be changed. In addition, even if other omitted contents, the contents of FIGS. 1 to 3 may be applied to the methods of FIGS. 4 to 7.

이상에서 전술한 본 발명의 일 실시예는, 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.One embodiment of the present invention described above may be implemented as a program (or application) to be executed in combination with a computer, which is hardware, and stored in a medium.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, Ruby, 파이썬, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-mentioned program is C, C++, JAVA, Ruby, C, C++, JAVA, Ruby, which the processor (CPU) of the computer can read through the device interface of the computer so that the computer reads the program and executes the methods implemented as a program. It may include code coded in a computer language such as Python or machine language. These codes may include functional codes related to functions defining necessary functions for executing the methods, and include control codes related to execution procedures necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, these codes may further include memory reference related codes for which location (address address) of the computer's internal or external memory should be referenced for additional information or media required for the computer's processor to execute the functions. there is. In addition, when the processor of the computer needs to communicate with any other remote computer or server in order to execute the functions, the code uses the computer's communication module to determine how to communicate with any other remote computer or server. It may further include communication-related codes for whether to communicate, what kind of information or media to transmit/receive during communication, and the like.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium is not a medium that stores data for a short moment, such as a register, cache, or memory, but a medium that stores data semi-permanently and is readable by a device. Specifically, examples of the storage medium include ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical data storage device, etc., but are not limited thereto. That is, the program may be stored in various recording media on various servers accessible by the computer or various recording media on the user's computer. In addition, the medium may be distributed to computer systems connected through a network, and computer readable codes may be stored in a distributed manner.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustrative purposes, and those skilled in the art can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the detailed description above, and all changes or modifications derived from the meaning and scope of the claims and equivalent concepts should be construed as being included in the scope of the present invention. do.

1: 자동 통역 시스템
100: 자동 통역 서버
110: 통신모듈
120: 메모리
130: 프로세서
200: 제1 단말 장치
300: 제2 단말 장치
400: 데이터베이스1: Automatic Interpretation System
100: automatic interpretation server
110: communication module
120: memory
130: processor
200: first terminal device
300: second terminal device
400: database

Claims

A method performed by a zero child-based automatic interpretation server communicating with a plurality of terminal devices having a microphone function, a speaker function, a communication function, and a wearable function,
connecting terminal devices located within a predetermined automatic interpretation zone;
receiving a voice signal of a first user from a first terminal device among terminal devices within the automatic interpretation zone;
matching a plurality of users located within an utterance receiving range of the terminal device; and
Performing automatic interpretation on the voice signal and transmitting the automatic interpretation result to a second terminal device of at least one second user corresponding to the matching result.
A zero infant-based automatic interpretation method.

According to claim 1,
The step of connecting terminal devices located in the predetermined automatic interpretation zone,
activating the predetermined automatic interpretation zone;
receiving a recognition result of the terminal device for the activated automatic interpretation zone; and
Including the step of performing a connection with a terminal device recognizing the automatic interpretation zone,
A zero infant-based automatic interpretation method.

According to claim 1,
Further comprising receiving speaker information including a user's voice signal and a language used from the plurality of terminal devices,
A zero infant-based automatic interpretation method.

According to claim 3,
The step of matching a plurality of users located within an utterance receiving range of the terminal device,
performing speaker verification based on the first terminal device based on the voice signal of the first user and the speaker information;
performing speaker identification based on the second terminal device based on the voice signal of the first user and the speaker information; and
Matching users corresponding to the first and second terminal devices for which the speaker verification and speaker identification have been completed.
A zero infant-based automatic interpretation method.

According to claim 4,
The step of matching a plurality of users located within an utterance receiving range of the terminal device,
Prior to the step of performing the speaker verification, further comprising determining whether the first user's voice signal input through the first terminal device is the first user's own voice signal.
A zero infant-based automatic interpretation method.

According to claim 4,
The step of matching a plurality of users located within an utterance receiving range of the terminal device,
Prior to the speaker identification, determining whether the first user's voice signal input through the second terminal device is a second user's voice signal corresponding to another person, further comprising:
A zero infant-based automatic interpretation method.

According to claim 4,
The step of matching a plurality of users located within an utterance receiving range of the terminal device may include, when a voice signal of a third user whose speaker information is not registered is input to the first and second terminal devices,
determining whether the first and second terminal devices are the voice signals corresponding to other people of the first and second users;
performing speaker identification based on the first and second terminal devices based on the voice signal of the third user and the speaker information; and
Disallowing matching with the unregistered third user based on the speaker identification result.
A zero infant-based automatic interpretation method.

According to claim 3,
In the step of matching a plurality of users located within an utterance receiving range of the terminal device, if there is a third user whose voice signal similarity according to the speaker information is equal to or greater than a preset threshold value,
determining whether the third user is located within the speaking reception range; and
Disallowing matching with the first and second users for the third user when the determination result is located outside the utterance receivable distance.
A zero infant-based automatic interpretation method.

In the zero child-based automatic interpretation server,
A communication module connected to and communicating with a plurality of terminal devices located in a predetermined automatic interpretation area;
A memory storing a program for registering and storing speaker information including a user's voice signal and a language used from the plurality of terminal devices and providing an automatic interpretation function; and
As the program stored in the memory is executed, when a voice signal of a terminal device within the automatic interpretation zone is received, a plurality of users located within an utterance receiving range of the terminal device are matched based on the voice signal and speaker information , a processor for performing automatic interpretation on the voice signal and transmitting an automatic interpretation result to a terminal device corresponding to the matching result through the communication module;
Zero UI-based automatic interpretation server.

According to claim 9,
After activating the predetermined automatic interpretation zone, the processor connects with the terminal device recognizing the automatic interpretation zone when receiving a recognition result of the terminal device for the activated automatic interpretation zone through the communication module. person,
Zero UI-based automatic interpretation server.

According to claim 9,
The processor performs speaker verification based on the first terminal device corresponding to the first user based on the voice signal of the first user and the speaker information, and performs speaker verification based on the voice signal of the first user and the speaker information. to perform speaker identification based on the second terminal device, and then matching each user corresponding to the first and second terminal devices for which the speaker verification and speaker identification have been completed.
Zero UI-based automatic interpretation server.

According to claim 11,
wherein the processor determines whether the first user's voice signal input through the first terminal device is the first user's own voice signal before performing the speaker verification;
Zero UI-based automatic interpretation server.

According to claim 11,
wherein the processor determines whether the first user's voice signal input through the second terminal device is a second user's voice signal corresponding to another person before performing the speaker identification;
Zero UI-based automatic interpretation server.

According to claim 11,
When the voice signal of the third user whose speaker information is not registered is input to the first and second terminal devices, the processor determines whether the first and second users correspond to other people based on the first and second terminal devices. It is determined whether or not it is a voice signal, and based on the voice signal of the third user and the speaker information, speaker identification based on the first and second terminal devices is performed, and the unregistered speaker is identified based on the speaker identification result. To disallow matching with a third user who has been
Zero UI-based automatic interpretation server.

According to claim 9,
When there exists a third user whose voice signal similarity according to the speaker information is greater than or equal to a predetermined threshold, the processor determines whether the third user is located within the utterance receivable distance, and as a result of the determination, utterance receivable distance If located outside, disallowing matching with the first and second users for the third user,
Zero UI-based automatic interpretation server.