KR102384165B1

KR102384165B1 - Real-time two-way communication method and system through video and audio data transmission and reception in virtual space

Info

Publication number: KR102384165B1
Application number: KR1020200129544A
Authority: KR
Inventors: 허민강
Original assignee: 주식회사 이너테인먼트
Priority date: 2019-10-08
Filing date: 2020-10-07
Publication date: 2022-04-07
Also published as: KR20210042028A

Abstract

본 발명은 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법 및 시스템에 있어서, 호스트가 제공하는 영상 또는 컨텐츠가 재생되고 있을 때, 호스트와 유저간에 커뮤니케이션이 가능하고 호스트가 복수의 유저와의 커뮤니케이션을 놓치지 않고 수행할 수 있게 하는 가상공간 내 실시간 양방향 커뮤니케이션 방법 및 시스템에 관한 것으로 호스트에게 가상현실 컨텐츠의 영상 및 소리를 출력하고, 적어도 하나 이상의 유저의 음성을 출력하는 호스트모듈; 상기 호스트가 제공하는 가상현실 컨텐츠의 영상 및 소리를 적어도 하나 이상의 유저모듈에 전송하고, 상기 유저의 음성 데이터를 수신해 상기 호스트모듈에 전송하는 제어서버; 및 상기 호스트가 제공하는 가상현실 컨텐츠의 영상 및 소리를 상기 유저에게 출력하고, 상기 유저가 입력한 음성을 상기 제어서버로 전송하는 적어도 하나 이상의 유저모듈;을 포함하는 구성을 개시한다.The present invention relates to a real-time two-way communication method and system through video and audio data transmission/reception in a virtual space. A method and system for real-time two-way communication in a virtual space that can perform without missing a host module, comprising: a host module for outputting video and sound of virtual reality content to a host, and outputting at least one user's voice; a control server that transmits the video and sound of the virtual reality content provided by the host to at least one user module, receives the user's voice data, and transmits it to the host module; and at least one user module for outputting an image and sound of the virtual reality content provided by the host to the user and transmitting the voice input by the user to the control server.

Description

REAL-TIME TWO-WAY COMMUNICATION METHOD AND SYSTEM THROUGH VIDEO AND AUDIO DATA TRANSMISSION AND RECEPTION IN VIRTUAL SPACE}

본 발명은 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법 및 시스템에 있어서, 호스트가 제공하는 영상 또는 컨텐츠가 재생되고 있을 때, 호스트와 유저간에 커뮤니케이션이 가능하고 호스트가 복수의 유저와의 커뮤니케이션을 놓치지 않고 수행할 수 있게 하는 가상공간 내 실시간 양방향 커뮤니케이션 방법 및 시스템에 관한 것이다.The present invention relates to a real-time two-way communication method and system through video and audio data transmission/reception in a virtual space. It relates to a real-time two-way communication method and system in a virtual space that can perform

인간이 가상적으로 만들어내는 가상의 공간을 인공현실(artificial reality), 사이버공간(cyberspace), 가상세계(virtual worlds)라고도 한다. 가장 먼저 가상현실 기법이 적용된 게임의 경우 입체적으로 구성된 화면 속에 게임을 하는 사람이 그 게임의 주인공으로 등장해 문제를 풀어나간다. 이러한 가상현실은 의학 분야에서는 수술 및 해부 연습에 사용되고, 항공ㆍ군사 분야에서는 비행조종 훈련에 이용되는 등 각 분야에 도입, 활발히 응용되고 있다.The virtual space created by humans is also called artificial reality, cyberspace, and virtual worlds. First of all, in the case of a game to which the virtual reality technique is applied, the person playing the game appears as the main character in the three-dimensional screen and solves the problem. In the medical field, virtual reality is used for surgery and anatomical practice, and in the aviation and military field, it is introduced and actively applied in each field, such as being used for flight control training.

한편, 가상현실(VR·virtual reality)과 현실 세계에 가상정보를 더해 보여주는 기술인 증강현실(AR·augmented reality)을 혼합한 기술은 혼합현실(MR·mixed reality)이라고 한다. VR과 AR, MR은 모두 실제로 존재하지 않은 현실을 구현해 사람이 이를 인지할 수 있도록 하는 기술이라는 점에서 공통점이 있다. 다만 AR은 실제 현실에 가상의 정보를 더해 보여주는 방식이고, VR은 모두 허구의 상황이 제시된다는 점에서 차이가 있다. MR은 AR과 VR을 혼합해 현실 배경에 현실과 가상의 정보를 혼합시켜 제공하는데, 대용량 데이터를 처리할 수 있는 기술이 필요하다.Meanwhile, a technology that mixes virtual reality (VR) and augmented reality (AR), a technology that adds virtual information to the real world, is called mixed reality (MR). VR, AR, and MR all have in common in that they are technologies that enable people to perceive reality by realizing a reality that does not exist in reality. However, there is a difference in that AR is a method of adding virtual information to the real world and VR presents a fictional situation. MR mixes AR and VR to provide a mixture of real and virtual information in a real background, but technology that can process large amounts of data is required.

HMD(head mounted display)는 VR 체험을 위해 사용자가 머리에 장착하는 디스플레이 디바이스로, 최근에는 사용자의 시각을 외부와 차단한 후 사용자의 시각에 가상세계를 보여줘 실제로 가상세계에 들어와 있는 것처럼 느끼게하는 역할을 한다. 눈앞에 디스플레이가 오도록 얼굴에 쓰는 형태로 마이크, 스테레오 스피커를 비롯해 여러 센서 등이 탑재돼 있다. VR 헤드셋에 스마트폰을 탑재해 스마트폰 패널을 활용하는 기기는 다이브라고 부른다.HMD (head mounted display) is a display device that is mounted on the user's head for VR experience. Recently, after blocking the user's view from the outside, the role of showing the virtual world to the user's point of view makes the user feel as if he or she is actually in the virtual world. do It has a microphone, a stereo speaker, and several sensors mounted on it in the form of a face worn with a display in front of you. A device that uses a smartphone panel by mounting a smartphone in a VR headset is called a dive.

이러한 VR 기술이 발전하면서 VR을 활용한 다양한 시도가 이루어지고 있고, 코로나의 창궐로 인해 언택트 기술이 중요한 시대가 도래하면서 VR을 이용한 언택트 세미나, 쇼핑, 교육, 회의 등의 서비스를 제공하기 위한 기술이 필요한 실정이다.As such VR technology develops, various attempts are being made using VR, and with the advent of an era where untact technology is important due to the outbreak of Corona, it is necessary to provide services such as untact seminars, shopping, education, and meetings using VR. There is a need for technology.

따라서, 본 발명은 상기한 바와 같은 문제점을 해결하기 위한 것으로서, 가상 공간 내에서 호스트가 제공하는 컨텐츠를 공유하면서 호스트와 유저간에 양방향 커뮤니케이션이 가능한 방법 및 시스템을 제공하고자 한다.Accordingly, an object of the present invention is to provide a method and system that enables two-way communication between a host and a user while sharing content provided by the host in a virtual space in order to solve the above problems.

본 발명은 가상 공간 호스트와 유저간에 양방향 커뮤니케이션을 수행하면서 복수의 유저가 발생시키는 음성을 놓치지 않고 호스트가 소통할 수 있는 방법 및 시스템을 제공하고자 한다.An object of the present invention is to provide a method and system in which a host can communicate without missing a voice generated by a plurality of users while performing two-way communication between a virtual space host and a user.

상기한 문제를 해결하기 위한 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템은 호스트에게 가상현실 컨텐츠의 영상 및 소리를 출력하고, 적어도 하나 이상의 유저의 음성을 출력하는 호스트모듈; 상기 호스트가 제공하는 가상현실 컨텐츠의 영상 및 소리를 적어도 하나 이상의 유저모듈에 전송하고, 상기 유저의 음성 데이터를 수신해 상기 호스트모듈에 전송하는 제어서버; 및 상기 호스트가 제공하는 가상현실 컨텐츠의 영상 및 소리를 상기 유저에게 출력하고, 상기 유저가 입력한 음성을 상기 제어서버로 전송하는 적어도 하나 이상의 유저모듈;을 포함할 수 있다.A real-time two-way communication system through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention for solving the above problem outputs images and sounds of virtual reality content to a host, and outputs at least one user's voice a host module to; a control server that transmits the video and sound of the virtual reality content provided by the host to at least one user module, receives the user's voice data, and transmits it to the host module; and at least one user module for outputting the image and sound of the virtual reality content provided by the host to the user, and transmitting the voice input by the user to the control server.

본 발명의 일 실시 예에 따르면, 상기 호스트 모듈은 상기 가상현실 컨텐츠에 상기 유저의 아바타를 출력하고, 상기 유저의 음성이 출력될 때, 음성을 입력한 상기 유저의 아바타에 시각적 효과를 출력할 수 있다.According to an embodiment of the present invention, the host module may output the avatar of the user to the virtual reality content, and output a visual effect to the avatar of the user who has input the voice when the user's voice is output. there is.

본 발명의 일 실시 예에 따르면, 상기 제어서버는 복수의 상기 유저모듈에서 상기 유저의 음성 데이터를 개별적으로 수신하고, 상기 유저의 음성 데이터를 이용해 유저의 음성을 텍스트로 변환하고, 상기 호스트모듈 및 상기 유저모듈 중 적어도 하나에 전송하고, 상기 호스트모듈 및 상기 유저모듈은 상기 텍스트를 영상에 출력할 수 있다.According to an embodiment of the present invention, the control server individually receives the user's voice data from a plurality of the user modules, converts the user's voice into text using the user's voice data, the host module and It is transmitted to at least one of the user modules, and the host module and the user module may output the text to the image.

본 발명의 일 실시 예에 따르면, 상기 호스트 모듈은 출력된 복수의 상기 텍스트 중 상기 호스트가 선택한 텍스트를 삭제할 수 있다.According to an embodiment of the present invention, the host module may delete the text selected by the host from among the plurality of output texts.

본 발명의 일 실시 예에 따르면, 상기 유저모듈은 상기 유저의 입력에 따라 상기 텍스트를 수정 및 삭제하고, 수정 및 삭제 결과 상기 컨텐츠 영상에 출력하고, 상기 수정 및 삭제 결과를 상기 제어서버로 전송하고, 상기 제어서버는 상기 수정 및 삭제 결과를 상기 호스트모듈에 전송하고 상기 호스트모듈은 상기 수정 및 삭제 결과를 반영해 상기 호스트에게 출력할 수 있다.According to an embodiment of the present invention, the user module corrects and deletes the text according to the user's input, outputs the result of the correction and deletion to the content image, and transmits the result of the correction and deletion to the control server, , the control server may transmit the result of the modification and deletion to the host module, and the host module may reflect the result of the modification and deletion and output the result to the host.

상기한 문제를 해결하기 위한 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 호스트가 제공할 가상현실 컨텐츠를 입력하는 단계; 상기 호스트 및 적어도 하나 이상의 유저에게 상기 가상현실 컨텐츠의 영상 및 소리를 출력하는 단계; 및 상기 유저가 입력한 음성을 상기 호스트에 출력하는 단계;를 포함할 수 있다.According to an embodiment of the present invention for solving the above problem, a real-time two-way communication method through video and audio data transmission and reception in a virtual space includes: inputting virtual reality content to be provided by a host; outputting the image and sound of the virtual reality content to the host and at least one user; and outputting the voice input by the user to the host.

본 발명의 일 실시 예에 따르면, 상기 가상현실 컨텐츠의 영상 및 소리를 출력 단계는 상기 가상현실 컨텐츠에 상기 유저의 아바타를 출력하고, 상기 유저가 입력한 음성을 상기 호스트에 출력하는 단계는 상기 유저의 음성이 출력될 때, 음성을 입력한 상기 유저의 아바타에 시각적 효과를 출력할 수 있다.According to an embodiment of the present invention, the step of outputting the image and sound of the virtual reality content includes outputting the user's avatar to the virtual reality content, and outputting the voice input by the user to the host includes the user When the voice of the user is output, a visual effect may be output to the avatar of the user who has input the voice.

본 발명의 일 실시 예에 따르면, 상기 유저가 입력한 음성을 상기 호스트에 출력하는 단계는 복수의 상기 유저의 음성 데이터를 개별적으로 수신하고, 상기 유저의 음성을 텍스트로 변환하고, 상기 텍스트를 영상에 출력할 수 있다.According to an embodiment of the present invention, the step of outputting the voice input by the user to the host includes individually receiving a plurality of voice data of the users, converting the user's voice into text, and converting the text into an image. can be printed on

본 발명의 일 실시 예에 따르면, 상기 유저의 입력에 따라 상기 텍스트를 수정 및 삭제하고, 상기 수정 및 삭제 결과를 상기 컨텐츠 영상에 출력하는 단계;를 더 포함할 수 있다.According to an embodiment of the present invention, the method may further include: correcting and deleting the text according to the user's input, and outputting the correction and deletion results to the content image.

본 발명의 일 실시 예에 따르면, 출력된 복수의 상기 텍스트 중 상기 호스트가 선택한 텍스트를 삭제하는 단계;를 더 포함할 수 있다.According to an embodiment of the present invention, the method may further include: deleting the text selected by the host from among the plurality of output texts.

본 발명에 따르면, 가상 공간 내에서 호스트가 제공하는 컨텐츠를 공유하면서 호스트와 유저간에 양방향 커뮤니케이션이 가능하게 할 수 있다.According to the present invention, it is possible to enable two-way communication between a host and a user while sharing content provided by the host in a virtual space.

또한, 복수의 유저와 양방향 커뮤니케이션을 수행하면서도, 호스트가 놓치지 않고 모든 유저의 요구를 충족시킬 수 있는 커뮤니케이션 방법 및 시스템을 제공할 수 있다.In addition, it is possible to provide a communication method and system capable of satisfying the needs of all users without missing a host while performing two-way communication with a plurality of users.

한편, 본 발명의 효과는 이상에서 언급한 효과들로 제한되지 않으며, 이하에서 설명할 내용으로부터 통상의 기술자에게 자명한 범위 내에서 다양한 효과들이 포함될 수 있다.On the other hand, the effects of the present invention are not limited to the above-mentioned effects, and various effects may be included within the range obvious to those skilled in the art from the description below.

도 1은 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템의 설명을 위한 개념도이다.
도 2는 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템의 블록도이다.
도 3은 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법의 흐름도이다.
도 4는 본 발명의 일 실시 예에 따른 제어서버의 SST 인공지능 학습 네트워크의 일 예시이다.1 is a conceptual diagram for explaining a real-time two-way communication system through video and audio data transmission and reception in a virtual space.
2 is a block diagram of a real-time two-way communication system through transmission and reception of video and audio data in a virtual space according to an embodiment of the present invention.
3 is a flowchart of a real-time two-way communication method through transmission and reception of video and audio data in a virtual space according to an embodiment of the present invention.
4 is an example of an SST artificial intelligence learning network of a control server according to an embodiment of the present invention.

이하, 첨부된 도면들을 참조하여 본 발명에 따른 '가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법 및 시스템'을 상세하게 설명한다. 설명하는 실시 예들은 본 발명의 기술사상을 당업자가 용이하게 이해할 수 있도록 제공되는 것으로 이에 의해 본 발명이 한정되지 않는다. 또한, 첨부된 도면에 표현된 사항들은 본 발명의 실시 예들을 쉽게 설명하기 위해 도식화된 도면으로 실제로 구현되는 형태와 상이할 수 있다.Hereinafter, a 'real-time two-way communication method and system through video and audio data transmission and reception in a virtual space' according to the present invention will be described in detail with reference to the accompanying drawings. The described embodiments are provided so that those skilled in the art can easily understand the technical spirit of the present invention, and the present invention is not limited thereto. In addition, matters expressed in the accompanying drawings may be different from the forms actually implemented in the drawings schematically for easy explanation of the embodiments of the present invention.

한편, 이하에서 표현되는 각구성부는 본 발명을 구현하기 위한 예일 뿐이다. 따라서, 본 발명의 다른 구현에서는 본 발명의 사상 및 범위를 벗어나지 않는 범위에서 다른 구성부가 사용될 수 있다.On the other hand, each component expressed below is only an example for implementing the present invention. Accordingly, in other implementations of the present invention, other components may be used without departing from the spirit and scope of the present invention.

또한, 각구성부는 순전히 하드웨어 또는 소프트웨어의 구성만으로 구현될 수도 있지만, 동일 기능을 수행하는 다양한 하드웨어 및 소프트웨어 구성들의 조합으로 구현될 수도 있다. 또한, 하나의 하드웨어 또는 소프트웨어에 의해 둘 이상의 구성부들이 함께 구현될 수도 있다.In addition, each component may be implemented purely by a configuration of hardware or software, or may be implemented by a combination of various hardware and software components that perform the same function. In addition, two or more components may be implemented together by one piece of hardware or software.

또한, 어떤 구성요소들을 '포함'한다는 표현은, '개방형'의 표현으로서 해당구성요소들이 존재하는 것을 단순히 지칭할 뿐이며, 추가적인 구성요소들을 배제하는 것으로 이해되어서는 안된다.In addition, the expression 'including' certain components merely refers to the existence of the corresponding components as an expression of 'open type', and should not be construed as excluding additional components.

도 1은 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템의 설명을 위한 개념도이고, 도 2는 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템의 블록도이고, 도 3은 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법의 흐름도이고, 도 4는 본 발명의 일 실시 예에 따른 제어서버의 SST 인공지능 학습 네트워크의 일 예시이다.1 is a conceptual diagram for explaining a real-time two-way communication system through video and audio data transmission/reception in a virtual space, and FIG. 2 is a block diagram of a real-time two-way communication system through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention. 3 is a flowchart of a real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention, and FIG. 4 is an SST artificial intelligence learning network of a control server according to an embodiment of the present invention. One example.

도 1 내지 도 4를 참조하면, 본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템(1)은 호스트모듈(100), 제어서버(200) 및 유저모듈(300)을 포함할 수 있다. 상기 호스트모듈(100) 또는 상기 유저모듈(300)은 복수의 개별적인 구성으로 포함될 수 있다. 상기 호스트모듈(100) 또는 상기 유저(300) 모듈은 복수의 호스트(10) 또는 복수의 유저(20)를 동일한 가상현실(VR) 컨텐츠에서 실시간으로 양방향 커뮤니케이션을 하도록 할 수 있다.1 to 4 , a real-time two-way communication system 1 through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention includes a host module 100 , a control server 200 and a user module 300 . ) may be included. The host module 100 or the user module 300 may be included in a plurality of individual components. The host module 100 or the user 300 module may enable a plurality of hosts 10 or a plurality of users 20 to perform two-way communication in the same virtual reality (VR) content in real time.

상기 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템(1)은 호스트(10)와 유저(20)에 컨텐츠를 제공할 수 있다. 상기 컨텐츠는 영상 및 소리를 포함할 수 있다. 상기 컨텐츠는 가상현실(VR) 컨텐트, 증강현실(AR) 컨텐츠를 포함할 수 있다. 상기 컨텐츠는 상기 호스트(10)가 선택 또는 지정하여 상기 호스트모듈(100) 및 상기 유저모듈(300)을 통해 상기 호스트(10) 및 상기 유저(20)에 제공될 수 있다. 상기 가상현실(VR) 내에서 상기 호스트(10)와 상기 유저(20)는 상호간에 실시간 양방향 커뮤니케이션을 수행할 수 있다. 상기 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템(1)의 호스트모듈(100), 제어서버(200) 및 유저모듈(300)은 네트워크를 이용해 정보를 주고받을 수 있다. The real-time two-way communication system 1 through the transmission and reception of video and audio data in the virtual space may provide content to the host 10 and the user 20 . The content may include images and sounds. The content may include virtual reality (VR) content and augmented reality (AR) content. The content may be selected or designated by the host 10 and provided to the host 10 and the user 20 through the host module 100 and the user module 300 . In the virtual reality (VR), the host 10 and the user 20 may perform real-time two-way communication with each other. The host module 100, the control server 200, and the user module 300 of the real-time two-way communication system 1 through video and audio data transmission/reception in the virtual space can exchange information using a network.

여기서, 네트워크는, 복수의 단말 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 이러한 네트워크의 일 예에는 RF, 3GPP(3rd Generation Partnership Project) 네트워크, LTE(Long Term Evolution) 네트워크, 5GPP(5rd Generation Partnership Project) 네트워크, WIMAX(World Interoperability for Microwave Access) 네트워크, 인터넷(Internet), LAN(Local Area Network), Wireless LAN(Wireless Local Area Network), WAN(Wide Area Network), PAN(Personal Area Network), 블루투스(Bluetooth) 네트워크, NFC 네트워크, 위성 방송 네트워크, 아날로그 방송 네트워크, DMB(Digital Multimedia Broadcasting) 네트워크 등이 포함되나 이에 한정되지는 않는다.Here, the network refers to a connection structure in which information exchange is possible between each node, such as a plurality of terminals and servers, and examples of such networks include RF, 3rd Generation Partnership Project (3GPP) network, and Long Term (LTE). Evolution) network, 5th Generation Partnership Project (5GPP) network, WIMAX (World Interoperability for Microwave Access) network, Internet, LAN (Local Area Network), Wireless LAN (Wireless Local Area Network), WAN (Wide Area Network) , PAN (Personal Area Network), Bluetooth (Bluetooth) network, NFC network, satellite broadcasting network, analog broadcasting network, DMB (Digital Multimedia Broadcasting) network, etc. include, but are not limited thereto.

상기 호스트모듈(100)은 상기 호스트(10)에게 가상현실 컨텐츠의 영상 및 소리를 출력하고, 적어도 하나 이상의 유저(20)의 음성을 출력할 수 있다. 상기 유저(20)의 음성은 상기 유저모듈(300)에서 입력받아 상기 제어서버(200)를 통해 전송할 수 있다. The host module 100 may output an image and sound of virtual reality content to the host 10 , and may output the voice of at least one user 20 . The voice of the user 20 may be received from the user module 300 and transmitted through the control server 200 .

상기 호스트모듈(100)은 카메라 및 마이크를 포함할 수 있다. 상기 호스트모듈(100)은 스크린 및 스피커를 포함할 수 있다. 상기 호스트모듈(100)은 상기 컨텐츠의 영상 및 소리를 상기 호스트(10)에게 출력할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)를 촬영하여 상기 호스트은 (10)의 영상을 상기 제어서버(200)로 전송할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)의 PC, 모바일 디바이스, IoT 디바이스 중 하나를 포함할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)가 착용하는 HMD(head mounted display)일 수 있다. 상기 호스트모듈(100)은 상기 가상현실 컨텐츠 영상에 상기 유저(20)의 아바타를 출력할 수 있다. The host module 100 may include a camera and a microphone. The host module 100 may include a screen and a speaker. The host module 100 may output the image and sound of the content to the host 10 . The host module 100 may photograph the host 10 , and the host may transmit the image of 10 to the control server 200 . The host module 100 may include one of a PC, a mobile device, and an IoT device of the host 10 . The host module 100 may be a head mounted display (HMD) worn by the host 10 . The host module 100 may output the avatar of the user 20 on the virtual reality content image.

상기 호스트모듈(100)은 상기 유저(20)의 음성이 출력될 때, 음성을 입력한 상기 유저의 아바타에 시각적 효과를 출력할 수 있다. 예를 들어, 상기 가상현실 컨텐츠에서 복수의 상기 유저(20) 중에서 어느 하나가 상기 유저모듈(300)에 음성을 입력하면 상이 유저모듈(300)은 상기 제어서버(200)로 상기 음성의 데이터를 전송하고, 상기 제어서버(200)는 상기 음성의 데이터를 상기 호스트모듈(100)로 전송하고, 상기 호스트모듈(100)은 상기 음성의 데이터를 소리로 출력할 수 있다. 상기 호스트모듈(100)은 상기 가상현실 컨텐츠 상에 현재 상기 제어서버(200)를 통해 연결된 상기 유저모듈(300)의 수만큼 아바타를 표시할 수 있다. 즉, 상기 아바타는 상기 유저모듈(300)과 1 대 1 대응될 수 있다. 상기 아바타는 상기 제어서버(200)가 생성해 데이터를 상기 호스트모듈(100)로 전송하거나 또는 상기 호스트모듈(100)이 생성하는 것일 수 있다. 상기 아바타는 상기 유저모듈(300)에서도 출력될 수 있다. 상기 호스트모듈(100)은 상기 유저(20)의 음성이 출력될 때 상기 음성을 전송한 상기 유저모듈(300)에 대응되는 아바타에 시각적 효과를 출력할 수 있다. 상기 시각적 효과는 상기 아바타의 밝기 또는 명도 변화, 아이콘, 자막, 깜빡임 등의 효과를 포함할 수 있고, 이에 한정되니 않고 인간이 변화를 인식할 수 있는 모든 시각적 효과를 포함할 수 있다. 상기 호스트모듈(100)은 텍스트를 출력할 수 있다. When the user 20's voice is output, the host module 100 may output a visual effect to the avatar of the user who has input the voice. For example, in the virtual reality content, when any one of the plurality of users 20 inputs a voice to the user module 300 , the different user module 300 transmits the voice data to the control server 200 . and the control server 200 may transmit the data of the voice to the host module 100, and the host module 100 may output the data of the voice as sound. The host module 100 may display as many avatars as the number of the user modules 300 currently connected through the control server 200 on the virtual reality content. That is, the avatar may correspond to the user module 300 one-to-one. The avatar may be generated by the control server 200 and transmitted to the host module 100 or generated by the host module 100 . The avatar may also be output from the user module 300 . When the user 20's voice is output, the host module 100 may output a visual effect to the avatar corresponding to the user module 300 that has transmitted the voice. The visual effect may include an effect such as a change in brightness or brightness of the avatar, an icon, a caption, a blinking effect, etc., but is not limited thereto, and may include any visual effect that a human can recognize the change in. The host module 100 may output text.

상기 제어서버(200)는 상기 호스트모듈(100)과 상기 유저모듈(300)이 주고받는 데이터를 중계할 수 있다. 상기 제어서버(200)는 외부 서버와 연결되어 외부 데이터를 가져와 상기 호스트모듈(100) 및 상기 유저모듈(300)에 전송할 수 있다. 상기 제어서버(200)는 상기 호스트(10)가 선택한 컨텐츠를 상기 호스트모듈(100)을 통해 전송하면 상기 컨텐츠를 상기 유저모듈(300)에 전송할 수 있다. 상기 제어서버(200)는 상기 호스트모듈(100)에서 링크가 전송되면 상기 링크에 포함된 데이터를 수신해 상기 호스트모듈(100) 및 상기 유저모듈(300)에 전송할 수 있다. The control server 200 may relay data exchanged between the host module 100 and the user module 300 . The control server 200 may be connected to an external server to bring external data and transmit it to the host module 100 and the user module 300 . When the content selected by the host 10 is transmitted through the host module 100 , the control server 200 may transmit the content to the user module 300 . When the link is transmitted from the host module 100 , the control server 200 may receive data included in the link and transmit it to the host module 100 and the user module 300 .

상기 제어서버(200)는 상기 호스트(10)가 제공하는 가상현실 컨텐츠의 영상 및 소리를 적어도 하나 이상의 유저모듈(300)에 전송하고, 상기 유저(20)의 음성 데이터를 수신해 상기 호스트모듈(100)에 전송할 수 있다. The control server 200 transmits the image and sound of the virtual reality content provided by the host 10 to at least one user module 300, and receives the user's 20 voice data to receive the host module ( 100) can be transmitted.

상기 제어서버(200)는 복수의 상기 유저모듈(300)에서 상기 유저(20)의 음성 데이터를 개별적으로 수신할 수 있다. 상기 제어서버(200)는 상기 유저(20)의 음성 데이터를 이용해 상기 유저(20)의 음성을 텍스트로 변환하고, 상기 호스트모듈(100) 및 상기 유저모듈(300) 중 적어도 하나에 전송할 수 있다. 상기 호스트모듈(100) 및 상기 유저모듈(300)은 상기 텍스트를 출력할 수 있다. The control server 200 may individually receive the voice data of the user 20 from the plurality of user modules 300 . The control server 200 may convert the user's 20 voice into text using the user's 20 voice data, and transmit it to at least one of the host module 100 and the user module 300 . . The host module 100 and the user module 300 may output the text.

상기 유저모듈(300)은 상기 호스트(10)가 제공하는 가상현실 컨텐츠의 영상 및 소리를 상기 유저(10)에게 출력하고, 상기 유저(10)가 입력한 음성을 상기 제어서버(200)로 전송할 수 있다.The user module 300 outputs the image and sound of the virtual reality content provided by the host 10 to the user 10 , and transmits the voice input by the user 10 to the control server 200 . can

상기 유저모듈(300)은 상기 유저(10)의 음성을 변환한 텍스트를 상기 유저(20)의 입력에 따라 상기 텍스트를 수정 및 삭제할 수 있다. 상기 유저모듈(300)은 수정 및 삭제 결과 상기 컨텐츠 영상에 출력할 수 잇다. 상기 유저모듈(300)은 상기 수정 및 삭제 결과를 상기 제어서버(200)로 전송할 수 있다. 상기 제어서버(200)는 상기 수정 및 삭제 결과를 상기 호스트모듈(100)에 전송할 수 있다. 상기 호스트모듈(100)은 상기 수정 및 삭제 결과를 반영해 상기 호스트에게 출력할 수 있다.The user module 300 may modify and delete the text obtained by converting the user 10's voice according to the input of the user 20 . The user module 300 may output the result of correction and deletion to the content image. The user module 300 may transmit the result of the modification and deletion to the control server 200 . The control server 200 may transmit the result of the modification and deletion to the host module 100 . The host module 100 may reflect the result of the modification and deletion and output the result to the host.

상기 호스트 모듈(100)은 출력된 복수의 상기 텍스트 중 상기 호스트가 선택한 텍스트를 삭제할 수 있다. The host module 100 may delete the text selected by the host from among the plurality of output texts.

본 발명의 일 실시 예에 따르면, 상기 호스트 모듈(100) 또는 상기 유저모듈(300)은 상기 유저(20)의 음성을 변환한 텍스트를 출력할 수 있다. 텍스트가 출력되기 때문에 상기 호스트(10)는 복수의 유저의 문의 사항 등에 대해서 놓치지 않고 대응할 수 있다. 상기 유저(20)는 출력되는 텍스트를 확인하고 상기 호스트(10)가 대응할 필요가 없는 텍스트를 선택해 삭제할 수 있다. 상기 유저(20)는 출력되는 텍스트를 확인하고 상기 텍스트가 상기 유저(20)의 음성을 텍스트로 변환하는 과정에서 오류가 발생한 경우 이를 수정할 수 있다. 상기 호스트(10)는 출력된 텍스트를 확인하고 상기 유저(20)의 요구사항(예를 들면 질문 또는 주문이 될 수 있다.)에 대응할 수 있다. 상기 호스트(10)는 자신이 대응을 마친 텍스트를 선택해 삭제할 수 있다. According to an embodiment of the present invention, the host module 100 or the user module 300 may output text obtained by converting the voice of the user 20 . Since the text is output, the host 10 can respond to inquiries from a plurality of users without missing out. The user 20 may check the output text and select and delete text that the host 10 does not need to respond to. The user 20 may check the outputted text, and if an error occurs in the text converting the user's 20 voice into text, it may be corrected. The host 10 may check the output text and respond to a request of the user 20 (eg, it may be a question or an order). The host 10 may select and delete the corresponding text.

상기 제어서버(200)는 STT(Speech-to-Text) 기술을 이용해 상기 유저(20)의 음성 인터페이스를 통해 텍스트(문자) 데이터를 추출해낼 수 있다.The control server 200 may extract text (character) data through the voice interface of the user 20 using STT (Speech-to-Text) technology.

상기 제어서버(200)는 음향학점 관점에서 말하는 유저, 공간, 노이즈 등의 환경적인 데이터를 이용하고 언어학적 관점에서는 어휘, 문맥, 문법 등을 모델링하기 위한 언어 데이터를 이용해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 음성/언어 데이터로부터 인식 네트워크 모델을 생성하는 오프라인 학습 단계와 사용자가 발성한 음성을 인식하는 온라인 탐색 단계를 통해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 기보유하고 있는 음성과 언어 데이터를 사용해서 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 디코딩 단계에서는 학습 단계 결과인 음향 모델(Acoustic Model), 언어 모델(Language Model)과 발음 사전(Pronunciation Lexicon)을 이용하여 입력된 특징 벡터를 모델과 비교, 스코어링(Scoring)하여 단어 열을 최종 결정할 수 있다.The control server 200 uses environmental data such as the user speaking, space, noise, etc. from the acoustic credit point of view, and uses language data for modeling vocabulary, context, grammar, etc. from the linguistic point of view of the user's 20 voice. can be converted to text. The control server 200 may convert the voice of the user 20 into text through an offline learning step of generating a recognition network model from voice/language data and an online search step of recognizing the voice uttered by the user. The control server 200 may convert the user's 20 voice into text using pre-owned voice and language data. In the decoding step, the control server 200 compares the input feature vector with the model using an acoustic model, a language model, and a pronunciation lexicon, which are the results of the learning step, and scores (Scoring) Thus, the word sequence can be finally determined.

상기 제어서버(200)는 해당 언어의 음운 환경별 발음의 음향적 특성을 확률 모델로 대표 패턴을 생성하여 음향 모델링을 하고, 어휘 선택, 문장 단위 구문 구조 등 해당 언어의 사용성 문제에 대해 문법 체계를 통계적으로 학습하여 언어모델링을 할 수 있다. 상기 제어서버(200)는 발음 사전 구축을 위해서는 텍스트를 소리 나는 대로 변환하는 음소 변환(Grapheme-to-Phoneme) 구현을 할 수 있다. 상기 제어서버(200)는 표준 발음을 대상으로 하는 발음 변환 규칙만으로는 방언이나 사용자의 발화 습관과 어투에 따른 다양한 패턴을 반영하기 어려운 경우가 있어 별도의 사전을 구축할 수 있다.The control server 200 performs acoustic modeling by generating a representative pattern using a probabilistic model of the acoustic characteristics of pronunciation for each phonological environment of the corresponding language, and provides a grammar system for usability problems of the corresponding language, such as vocabulary selection and sentence unit syntax structure. Language modeling can be done by statistically learning. The control server 200 may implement a phoneme conversion (Grapheme-to-Phoneme) that converts text as it is spoken in order to construct a pronunciation dictionary. The control server 200 may build a separate dictionary because it is difficult to reflect various patterns according to dialects or users' speech habits and tones only with pronunciation conversion rules for standard pronunciations.

상기 제어서버(200)는 딥러닝(Deep Learning)에 의해 고도화된 음향모델 적응 학습에 기반할 수 있다. 상기 제어서버(200)는 Fully connected DNN(Deep Neural Network), CNN(Convolutional Neural Network)에 기반해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다.The control server 200 may be based on adaptive learning of an acoustic model advanced by deep learning. The control server 200 may convert the voice of the user 20 into text based on a fully connected deep neural network (DNN) and a convolutional neural network (CNN).

상기 제어서버(200)는 상기 유저(20)의 음성 데이터를 CNN을 통해 분석해 발음적 특징을 추출할 수 있다. 상기 제어서버(200)는 상기 발음적 특징을 추출해 상기 유저(20)의 음성을 단어별로 구간을 분할할 수 있다. 상기 제어서버(200)는 단어 또는 형태소별 발음적 특징을 학습한 데이터를 포함할 수 있다. 상기 제어서버(200)는 단어 또는 형태소별 발음적 특징을 학습한 데이터를 갱신할 수 있다. The control server 200 may extract the phonetic features by analyzing the voice data of the user 20 through CNN. The control server 200 may extract the phonetic features and divide the voice of the user 20 into sections for each word. The control server 200 may include data obtained by learning the phonetic characteristics of each word or morpheme. The control server 200 may update data obtained by learning the phonetic characteristics of each word or morpheme.

상기 제어서버(200)는 상기 단어 또는 형태소의 발음적 특징에 기반해 상기 유저(20)의 음성 데이터에서 분할된 단어를 추정할 수 있다. 상기 제어서버(200)는 기반해 상기 유저(20)의 음성 데이터에서 분할된 단어를 상기 단어 또는 형태소의 발음적 특징에 따라 확률이 가장 높은 단어로 1차적으로 1차 단어로 결정할 수 있다. The control server 200 may estimate the divided word from the voice data of the user 20 based on the phonetic characteristics of the word or morpheme. The control server 200 may determine the divided word from the voice data of the user 20 based on the word with the highest probability according to the phonetic characteristics of the word or morpheme as the primary word.

상기 제어서버(200)는 특정 단어에 대한 발임이 유사한 단어들과 유사도를 포함하는 발음 유사군 데이터를 포함할 수 있다. 상기 발음 유사군 데이터는 특정 단어가 있으면, 상기 특정 단어와 발음이 유사한 단어들을 유사한 정도에 따라 나열한 데이터를 의미할 수 있다. 상기 제어서버(200)는 문장에 있어서 단어들 간에 앞, 뒤로 쓰이는 확률을 학습한 문장연관 데이터를 포함할 수 있다. The control server 200 may include pronunciation similarity group data including a degree of similarity to words having similar pronunciation to a specific word. When there is a specific word, the pronunciation similarity group data may mean data in which words having similar pronunciation to the specific word are listed according to similarity levels. The control server 200 may include sentence-related data obtained by learning the probability of being used before and after words between words in a sentence.

상기 제어서버(200)는 1차적으로 결정한 상기 1차 단어들을 나열할 수 있다. 상기 제어서버(200)는 상기 1차 단어들의 앞, 뒤 단어들과 문장에 함께 쓰일 확률을 상기 문장연관 데이터에 기반해 분석할 수 있다. 상기 제어서버(200)는 상기 문장연관 데이터에 기반해 상기 1차 단어의 앞, 뒤 단어들과 연관 확률이 낮은 단어를 수정 대상 단어로 결정할 수 있다. The control server 200 may list the primary words determined primarily. The control server 200 may analyze the probability of being used together in a sentence with words before and after the primary words based on the sentence-related data. The control server 200 may determine a word having a low association probability with words before and after the primary word as a word to be corrected based on the sentence association data.

상기 제어서버(200)는 상기 수정 대상 단어의 발음 유사군에서 유사도가 특정 확률 이상인 단어들로 대체할 수 있다. 본 발명의 일 실시 예에 따르면, 상기 일정 확률은 80%일 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어를 2차 단어로 결정할 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어가 복수인 경우 연관 확률이 가장 높은 단어를 2차 단어로 결정할 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어가 없는 경유 상기 1차 단어를 2차 단어로 결정할 수 있다. 본 발명의 일 실시 예에 따르면, 상기 임계값은 50%일 수 있다.The control server 200 may substitute words having a similarity greater than or equal to a specific probability in the pronunciation similarity group of the target word to be corrected. According to an embodiment of the present invention, the predetermined probability may be 80%. The control server 200 may determine a word having a high association probability higher than a threshold value among the replaced words as the secondary word. The control server 200 may determine a word having the highest association probability as a secondary word when there are a plurality of words having association probability higher than a threshold value among the replaced words. The control server 200 may determine, as a secondary word, the primary word through which there is no word having an association probability higher than or equal to a threshold value among the replaced words. According to an embodiment of the present invention, the threshold value may be 50%.

상기 제어서버(200)는 상기 연관 확률을 하기 수학식 1에 따라 확률로 연산할 수 있다.The control server 200 may calculate the association probability as a probability according to Equation 1 below.

[수학식 1] [Equation 1]

여기서, H는 확률함수, S는 단어, W_m은 앞, 뒤의 m번째 단어를 의미한다. Here, H is a probability function, S is a word, and W _m is the mth word before and after.

S에 들어가는 단어는 1차 단어 또는 대체된 단어 중 어느 하나일 수 있다.A word entering S may be either a primary word or a substituted word.

본 발명의 일 실시 예에 따라, 확률 또는 임계값을 설정하고 상기 제어서버(200)는 상기 수학식 1에서 n을 5 이상으로 설정할 수 있다. n이 4 이하일 경우 기존의 STT 엔진과 정확도 면에서 큰 차이를 보이지 못하였으나, n이 5 이상인 경우 기존의 STT 엔진이 보여주던 오차율이 50%이상 낮아지는 결과를 확인할 수 있었다. According to an embodiment of the present invention, after setting a probability or a threshold value, the control server 200 may set n to 5 or more in Equation 1 above. When n was 4 or less, there was no significant difference in accuracy with the existing STT engine, but when n was 5 or more, it was confirmed that the error rate shown by the existing STT engine was lowered by more than 50%.

상기 제어서버(200)는 상기 유저(20)가 상기 텍스트를 수정하면, 수정된 결과를 이용해 상기 단어 또는 형태소의 발음적 특징을 재학습하여 갱신할 수 있다. 상기 제어서버(200)는 상기 유저(20)가 상기 텍스트를 수정하면, 상기 문장연관 데이터를 재학습해 갱신할 수 있다. When the user 20 corrects the text, the control server 200 may re-learn and update the phonetic characteristics of the word or morpheme using the modified result. When the user 20 corrects the text, the control server 200 may re-learn and update the sentence-related data.

본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 호스트가 제공할 가상현실 컨텐츠를 입력하는 단계(S110)를 포함할 수 있다The real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention may include the step of inputting virtual reality content to be provided by the host (S110).

S110 단계에서, 상기 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템(1)은 호스트(10)와 유저(20)에 컨텐츠를 제공할 수 있다. 상기 컨텐츠는 영상 및 소리를 포함할 수 있다. 상기 컨텐츠는 가상현실(VR) 컨텐트, 증강현실(AR) 컨텐츠를 포함할 수 있다. 상기 컨텐츠는 상기 호스트(10)가 선택 또는 지정하여 상기 호스트모듈(100) 및 상기 유저모듈(300)을 통해 상기 호스트(10) 및 상기 유저(20)에 제공될 수 있다. 상기 가상현실(VR) 내에서 상기 호스트(10)와 상기 유저(20)는 상호간에 실시간 양방향 커뮤니케이션을 수행할 수 있다. 상기 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 시스템(1)의 호스트모듈(100), 제어서버(200) 및 유저모듈(300)은 네트워크를 이용해 정보를 주고받을 수 있다. In step S110 , the real-time two-way communication system 1 through the transmission and reception of video and audio data in the virtual space may provide content to the host 10 and the user 20 . The content may include images and sounds. The content may include virtual reality (VR) content and augmented reality (AR) content. The content may be selected or designated by the host 10 and provided to the host 10 and the user 20 through the host module 100 and the user module 300 . In the virtual reality (VR), the host 10 and the user 20 may perform real-time two-way communication with each other. The host module 100, the control server 200, and the user module 300 of the real-time two-way communication system 1 through video and audio data transmission/reception in the virtual space can exchange information using a network.

본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 상기 호스트 및 적어도 하나 이상의 유저에게 상기 가상현실 컨텐츠의 영상 및 소리를 출력하는 단계(S120)를 포함할 수 있다The real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention may include outputting the video and sound of the virtual reality content to the host and at least one user (S120).

S120 단계에서, 상기 호스트모듈(100)은 카메라 및 마이크를 포함할 수 있다. 상기 호스트모듈(100)은 스크린 및 스피커를 포함할 수 있다. 상기 호스트모듈(100)은 상기 컨텐츠의 영상 및 소리를 상기 호스트(10)에게 출력할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)를 촬영하여 상기 호스트(10)의 영상을 상기 제어서버(200)로 전송할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)의 PC, 모바일 디바이스, IoT 디바이스 중 하나를 포함할 수 있다. 상기 호스트모듈(100)은 상기 호스트(10)가 착용하는 HMD(head mounted display)일 수 있다. 상기 호스트모듈(100)은 상기 가상현실 컨텐츠 영상에 상기 유저(20)의 아바타를 출력할 수 있다. In step S120, the host module 100 may include a camera and a microphone. The host module 100 may include a screen and a speaker. The host module 100 may output the image and sound of the content to the host 10 . The host module 100 may photograph the host 10 and transmit an image of the host 10 to the control server 200 . The host module 100 may include one of a PC, a mobile device, and an IoT device of the host 10 . The host module 100 may be a head mounted display (HMD) worn by the host 10 . The host module 100 may output the avatar of the user 20 on the virtual reality content image.

S120 단계에서, 상기 제어서버(200)는 상기 호스트모듈(100)과 상기 유저모듈(300)이 주고받는 데이터를 중계할 수 있다. 상기 제어서버(200)는 외부 서버와 연결되어 외부 데이터를 가져와 상기 호스트모듈(100) 및 상기 유저모듈(300)에 전송할 수 있다. 상기 제어서버(200)는 상기 호스트(10)가 선택한 컨텐츠를 상기 호스트모듈(100)을 통해 전송하면 상기 컨텐츠를 상기 유저모듈(300)에 전송할 수 있다. 상기 제어서버(200)는 상기 호스트모듈(100)에서 링크가 전송되면 상기 링크에 포함된 데이터를 수신해 상기 호스트모듈(100) 및 상기 유저모듈(300)에 전송할 수 있다. In step S120 , the control server 200 may relay data exchanged between the host module 100 and the user module 300 . The control server 200 may be connected to an external server to bring external data and transmit it to the host module 100 and the user module 300 . When the content selected by the host 10 is transmitted through the host module 100 , the control server 200 may transmit the content to the user module 300 . When the link is transmitted from the host module 100 , the control server 200 may receive data included in the link and transmit it to the host module 100 and the user module 300 .

본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 상기 유저가 입력한 음성을 상기 호스트에 출력하는 단계(S130)를 포함할 수 있다The real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention may include outputting the voice input by the user to the host (S130).

S130 단계에서, 상기 호스트모듈(100)은 상기 호스트(10)에게 가상현실 컨텐츠의 영상 및 소리를 출력하고, 적어도 하나 이상의 유저(20)의 음성을 출력할 수 있다. 상기 유저(20)의 음성은 상기 유저모듈(300)에서 입력받아 상기 제어서버(200)를 통해 전송할 수 있다.In step S130 , the host module 100 may output an image and sound of virtual reality content to the host 10 , and output the voice of at least one user 20 . The voice of the user 20 may be received from the user module 300 and transmitted through the control server 200 .

S130 단계에서, 상기 호스트모듈(100)은 상기 유저(20)의 음성이 출력될 때, 음성을 입력한 상기 유저의 아바타에 시각적 효과를 출력할 수 있다. 예를 들어, 상기 가상현실 컨텐츠에서 복수의 상기 유저(20) 중에서 어느 하나가 상기 유저모듈(300)에 음성을 입력하면 상이 유저모듈(300)은 상기 제어서버(200)로 상기 음성의 데이터를 전송하고, 상기 제어서버(200)는 상기 음성의 데이터를 상기 호스트모듈(100)로 전송하고, 상기 호스트모듈(100)은 상기 음성의 데이터를 소리로 출력할 수 있다. 상기 호스트모듈(100)은 상기 가상현실 컨텐츠 상에 현재 상기 제어서버(200)를 통해 연결된 상기 유저모듈(300)의 수만큼 아바타를 표시할 수 있다. 즉, 상기 아바타는 상기 유저모듈(300)과 1 대 1 대응될 수 있다. 상기 아바타는 상기 제어서버(200)가 생성해 데이터를 상기 호스트모듈(100)로 전송하거나 또는 상기 호스트모듈(100)이 생성하는 것일 수 있다. 상기 아바타는 상기 유저모듈(300)에서도 출력될 수 있다. 상기 호스트모듈(100)은 상기 유저(20)의 음성이 출력될 때 상기 음성을 전송한 상기 유저모듈(300)에 대응되는 아바타에 시각적 효과를 출력할 수 있다. 상기 시각적 효과는 상기 아바타의 밝기 또는 명도 변화, 아이콘, 자막, 깜빡임 등의 효과를 포함할 수 있고, 이에 한정되니 않고 인간이 변화를 인식할 수 있는 모든 시각적 효과를 포함할 수 있다. 상기 호스트모듈(100)은 텍스트를 출력할 수 있다. In step S130 , when the voice of the user 20 is output, the host module 100 may output a visual effect to the avatar of the user who has inputted the voice. For example, in the virtual reality content, when any one of the plurality of users 20 inputs a voice to the user module 300 , the different user module 300 transmits the voice data to the control server 200 . and the control server 200 may transmit the data of the voice to the host module 100, and the host module 100 may output the data of the voice as sound. The host module 100 may display as many avatars as the number of the user modules 300 currently connected through the control server 200 on the virtual reality content. That is, the avatar may correspond to the user module 300 one-to-one. The avatar may be generated by the control server 200 and transmitted to the host module 100 or generated by the host module 100 . The avatar may also be output from the user module 300 . When the user 20's voice is output, the host module 100 may output a visual effect to the avatar corresponding to the user module 300 that has transmitted the voice. The visual effect may include an effect such as a change in brightness or brightness of the avatar, an icon, a caption, a blinking effect, etc., but is not limited thereto, and may include any visual effect through which a human can recognize a change. The host module 100 may output text.

S130 단계에서, 상기 제어서버(200)는 상기 호스트(10)가 제공하는 가상현실 컨텐츠의 영상 및 소리를 적어도 하나 이상의 유저모듈(300)에 전송하고, 상기 유저(20)의 음성 데이터를 수신해 상기 호스트모듈(100)에 전송할 수 있다. In step S130, the control server 200 transmits the video and sound of the virtual reality content provided by the host 10 to at least one user module 300, and receives the user's 20 voice data. may be transmitted to the host module 100 .

S130 단계에서, 상기 제어서버(200)는 복수의 상기 유저모듈(300)에서 상기 유저(20)의 음성 데이터를 개별적으로 수신할 수 있다. 상기 제어서버(200)는 상기 유저(20)의 음성 데이터를 이용해 상기 유저(20)의 음성을 텍스트로 변환하고, 상기 호스트모듈(100) 및 상기 유저모듈(300) 중 적어도 하나에 전송할 수 있다. 상기 호스트모듈(100) 및 상기 유저모듈(300)은 상기 텍스트를 출력할 수 있다. In step S130 , the control server 200 may individually receive the voice data of the user 20 from the plurality of user modules 300 . The control server 200 may convert the user's 20 voice into text using the user's 20 voice data, and transmit it to at least one of the host module 100 and the user module 300 . . The host module 100 and the user module 300 may output the text.

S130 단계에서, 상기 유저모듈(300)은 상기 호스트(10)가 제공하는 가상현실 컨텐츠의 영상 및 소리를 상기 유저(10)에게 출력하고, 상기 유저(10)가 입력한 음성을 상기 제어서버(200)로 전송할 수 있다.In step S130, the user module 300 outputs the image and sound of the virtual reality content provided by the host 10 to the user 10, and transmits the voice input by the user 10 to the control server ( 200) can be transmitted.

S130 단계에서, 본 발명의 일 실시 예에 따르면, 상기 호스트 모듈(100) 또는 상기 유저모듈(300)은 상기 유저(20)의 음성을 변환한 텍스트를 출력할 수 있다. 텍스트가 출력되기 때문에 상기 호스트(10)는 복수의 유저의 문의 사항 등에 대해서 놓치지 않고 대응할 수 있다.In step S130 , according to an embodiment of the present invention, the host module 100 or the user module 300 may output text obtained by converting the user's 20 voice. Since the text is output, the host 10 can respond to inquiries from a plurality of users without missing out.

S130 단계에서, 상기 제어서버(200)는 STT(Speech-to-Text) 기술을 이용해 상기 유저(20)의 음성 인터페이스를 통해 텍스트(문자) 데이터를 추출해낼 수 있다.In step S130 , the control server 200 may extract text (character) data through the user's 20 voice interface using Speech-to-Text (STT) technology.

S130 단계에서, 상기 제어서버(200)는 음향학점 관점에서 말하는 유저, 공간, 노이즈 등의 환경적인 데이터를 이용하고 언어학적 관점에서는 어휘, 문맥, 문법 등을 모델링하기 위한 언어 데이터를 이용해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 음성/언어 데이터로부터 인식 네트워크 모델을 생성하는 오프라인 학습 단계와 사용자가 발성한 음성을 인식하는 온라인 탐색 단계를 통해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 기보유하고 있는 음성과 언어 데이터를 사용해서 상기 유저(20)의 음성을 텍스트로 변환할 수 있다. 상기 제어서버(200)는 디코딩 단계에서는 학습 단계 결과인 음향 모델(Acoustic Model), 언어 모델(Language Model)과 발음 사전(Pronunciation Lexicon)을 이용하여 입력된 특징 벡터를 모델과 비교, 스코어링(Scoring)하여 단어 열을 최종 결정할 수 있다.In step S130, the control server 200 uses environmental data such as a user, space, noise, etc. speaking from an acoustic credit point of view, and uses language data for modeling vocabulary, context, grammar, etc. from a linguistic point of view. 20) can be converted to text. The control server 200 may convert the voice of the user 20 into text through an offline learning step of generating a recognition network model from voice/language data and an online search step of recognizing the voice uttered by the user. The control server 200 may convert the user's 20 voice into text using pre-owned voice and language data. In the decoding step, the control server 200 compares the input feature vector with the model using an acoustic model, a language model, and a pronunciation lexicon, which are the results of the learning step, and scores (Scoring) Thus, the word sequence can be finally determined.

S130 단계에서, 상기 제어서버(200)는 해당 언어의 음운 환경별 발음의 음향적 특성을 확률 모델로 대표 패턴을 생성하여 음향 모델링을 하고, 어휘 선택, 문장 단위 구문 구조 등 해당 언어의 사용성 문제에 대해 문법 체계를 통계적으로 학습하여 언어모델링을 할 수 있다. 상기 제어서버(200)는 발음 사전 구축을 위해서는 텍스트를 소리 나는 대로 변환하는 음소 변환(Grapheme-to-Phoneme) 구현을 할 수 있다. 상기 제어서버(200)는 표준 발음을 대상으로 하는 발음 변환 규칙만으로는 방언이나 사용자의 발화 습관과 어투에 따른 다양한 패턴을 반영하기 어려운 경우가 있어 별도의 사전을 구축할 수 있다.In step S130, the control server 200 performs acoustic modeling by generating a representative pattern using the probabilistic model of the acoustic characteristics of the pronunciation for each phonological environment of the corresponding language, and addresses the usability problems of the corresponding language, such as vocabulary selection and sentence unit syntax structure. Language modeling can be done by statistically learning the grammar system for The control server 200 may implement a phoneme conversion (Grapheme-to-Phoneme) that converts text as it is spoken in order to construct a pronunciation dictionary. The control server 200 may build a separate dictionary because it is difficult to reflect various patterns according to dialects or users' speech habits and tones only with pronunciation conversion rules for standard pronunciations.

S130 단계에서, 상기 제어서버(200)는 딥러닝(Deep Learning)에 의해 고도화된 음향모델 적응 학습에 기반할 수 있다. 상기 제어서버(200)는 Fully connected DNN(Deep Neural Network), CNN(Convolutional Neural Network)에 기반해 상기 유저(20)의 음성을 텍스트로 변환할 수 있다.In step S130, the control server 200 may be based on an advanced acoustic model adaptive learning by deep learning. The control server 200 may convert the voice of the user 20 into text based on a fully connected deep neural network (DNN) and a convolutional neural network (CNN).

S130 단계에서, 상기 제어서버(200)는 상기 유저(20)의 음성 데이터를 CNN을 통해 분석해 발음적 특징을 추출할 수 있다. 상기 제어서버(200)는 상기 발음적 특징을 추출해 상기 유저(20)의 음성을 단어별로 구간을 분할할 수 있다. 상기 제어서버(200)는 단어 또는 형태소별 발음적 특징을 학습한 데이터를 포함할 수 있다. 상기 제어서버(200)는 단어 또는 형태소별 발음적 특징을 학습한 데이터를 갱신할 수 있다. In step S130, the control server 200 may extract the phonetic features by analyzing the voice data of the user 20 through CNN. The control server 200 may extract the phonetic features and divide the voice of the user 20 into sections for each word. The control server 200 may include data obtained by learning the phonetic characteristics of each word or morpheme. The control server 200 may update data obtained by learning the phonetic characteristics of each word or morpheme.

S130 단계에서, 상기 제어서버(200)는 상기 단어 또는 형태소의 발음적 특징에 기반해 상기 유저(20)의 음성 데이터에서 분할된 단어를 추정할 수 있다. 상기 제어서버(200)는 기반해 상기 유저(20)의 음성 데이터에서 분할된 단어를 상기 단어 또는 형태소의 발음적 특징에 따라 확률이 가장 높은 단어로 1차적으로 1차 단어로 결정할 수 있다. In step S130 , the control server 200 may estimate the divided word from the voice data of the user 20 based on the phonetic characteristics of the word or morpheme. The control server 200 may determine the divided word from the voice data of the user 20 based on the word with the highest probability according to the phonetic characteristics of the word or morpheme as the primary word.

S130 단계에서, 상기 제어서버(200)는 특정 단어에 대한 발임이 유사한 단어들과 유사도를 포함하는 발음 유사군 데이터를 포함할 수 있다. 상기 발음 유사군 데이터는 특정 단어가 있으면, 상기 특정 단어와 발음이 유사한 단어들을 유사한 정도에 따라 나열한 데이터를 의미할 수 있다. 상기 제어서버(200)는 문장에 있어서 단어들 간에 앞, 뒤로 쓰이는 확률을 학습한 문장연관 데이터를 포함할 수 있다. In step S130, the control server 200 may include pronunciation similarity group data including a degree of similarity to words having similar pronunciation to a specific word. When there is a specific word, the pronunciation similarity group data may mean data in which words having similar pronunciation to the specific word are listed according to similarity levels. The control server 200 may include sentence-related data obtained by learning the probability of being used before and after words between words in a sentence.

S130 단계에서, 상기 제어서버(200)는 1차적으로 결정한 상기 1차 단어들을 나열할 수 있다. 상기 제어서버(200)는 상기 1차 단어들의 앞, 뒤 단어들과 문장에 함께 쓰일 확률을 상기 문장연관 데이터에 기반해 분석할 수 있다. 상기 제어서버(200)는 상기 문장연관 데이터에 기반해 상기 1차 단어의 앞, 뒤 단어들과 연관 확률이 낮은 단어를 수정 대상 단어로 결정할 수 있다. In step S130, the control server 200 may list the primary words determined primarily. The control server 200 may analyze the probability of being used together in a sentence with words before and after the primary words based on the sentence-related data. The control server 200 may determine a word having a low association probability with words before and after the primary word as a word to be corrected based on the sentence association data.

S130 단계에서, 상기 제어서버(200)는 상기 수정 대상 단어의 발음 유사군에서 유사도가 특정 확률 이상인 단어들로 대체할 수 있다. 본 발명의 일 실시 예에 따르면, 상기 일정 확률은 80%일 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어를 2차 단어로 결정할 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어가 복수인 경우 연관 확률이 가장 높은 단어를 2차 단어로 결정할 수 있다. 상기 제어서버(200)는 대체된 단어 중 연관 확률이 임계 값 이상 높은 단어가 없는 경유 상기 1차 단어를 2차 단어로 결정할 수 있다. 본 발명의 일 실시 예에 따르면, 상기 임계값은 50%일 수 있다.In step S130, the control server 200 may replace the words with a similarity of a certain probability or greater in the pronunciation similarity group of the target word to be modified. According to an embodiment of the present invention, the predetermined probability may be 80%. The control server 200 may determine a word having a high association probability higher than a threshold value among the replaced words as the secondary word. The control server 200 may determine a word having the highest association probability as a secondary word when there are a plurality of words having association probability higher than a threshold value among the replaced words. The control server 200 may determine, as a secondary word, the primary word through which there is no word having an association probability higher than or equal to a threshold value among the replaced words. According to an embodiment of the present invention, the threshold value may be 50%.

S130 단계에서, 상기 제어서버(200)는 상기 연관 확률을 하기 수학식 1에 따라 확률로 연산할 수 있다.In step S130 , the control server 200 may calculate the association probability as a probability according to Equation 1 below.

[수학식 1] [Equation 1]

S130 단계에서, 본 발명의 일 실시 예에 따라, 확률 또는 임계값을 설정하고 상기 제어서버(200)는 상기 수학식 1에서 n을 5 이상으로 설정할 수 있다. n이 4 이하일 경우 기존의 STT 엔진과 정확도 면에서 큰 차이를 보이지 못하였으나, n이 5 이상인 경우 기존의 STT 엔진이 보여주던 오차율이 50%이상 낮아지는 결과를 확인할 수 있었다. In step S130, according to an embodiment of the present invention, a probability or a threshold value is set, and the control server 200 may set n to 5 or more in Equation 1 above. When n was 4 or less, there was no significant difference in accuracy with the existing STT engine, but when n was 5 or more, it was confirmed that the error rate shown by the existing STT engine was lowered by more than 50%.

S130 단계에서, 상기 제어서버(200)는 상기 유저(20)가 상기 텍스트를 수정하면, 상기 문장연관 데이터를 재학습해 갱신할 수 있다. In step S130 , when the user 20 modifies the text, the control server 200 may re-learn and update the sentence-related data.

본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 상기 유저의 입력에 따라 상기 텍스트를 수정 및 삭제하고, 상기 수정 및 삭제 결과를 상기 컨텐츠 영상에 출력하는 단계(S140)를 포함할 수 있다A real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention comprises the steps of correcting and deleting the text according to the user's input, and outputting the result of the correction and deletion to the content image ( S140) may be included.

S140 단계에서, 상기 유저모듈(300)은 상기 유저(10)의 음성을 변환한 텍스트를 상기 유저(20)의 입력에 따라 상기 텍스트를 수정 및 삭제할 수 있다. 상기 유저모듈(300)은 수정 및 삭제 결과 상기 컨텐츠 영상에 출력할 수 잇다. 상기 유저모듈(300)은 상기 수정 및 삭제 결과를 상기 제어서버(200)로 전송할 수 있다. 상기 제어서버(200)는 상기 수정 및 삭제 결과를 상기 호스트모듈(100)에 전송할 수 있다. 상기 호스트모듈(100)은 상기 수정 및 삭제 결과를 반영해 상기 호스트에게 출력할 수 있다.In step S140 , the user module 300 may modify and delete the text obtained by converting the user 10 's voice according to the input of the user 20 . The user module 300 may output the result of correction and deletion to the content image. The user module 300 may transmit the result of the modification and deletion to the control server 200 . The control server 200 may transmit the result of the modification and deletion to the host module 100 . The host module 100 may reflect the result of the modification and deletion and output the result to the host.

S140 단계에서, 본 발명의 일 실시 예에 따르면, 상기 유저(20)는 출력되는 텍스트를 확인하고 상기 호스트(10)가 대응할 필요가 없는 텍스트를 선택해 삭제할 수 있다. 상기 유저(20)는 출력되는 텍스트를 확인하고 상기 텍스트가 상기 유저(20)의 음성을 텍스트로 변환하는 과정에서 오류가 발생한 경우 이를 수정할 수 있다.In step S140 , according to an embodiment of the present invention, the user 20 may check the output text and select and delete text that the host 10 does not need to respond to. The user 20 may check the outputted text, and if an error occurs in the text converting the user's 20 voice into text, it may be corrected.

본 발명의 일 실시 예에 따른 가상공간 내 영상 음성 데이터 송수신을 통한 실시간 양방향 커뮤니케이션 방법은 출력된 복수의 상기 텍스트 중 상기 호스트가 선택한 텍스트를 삭제하는 단계(S150)를 포함할 수 있다The real-time two-way communication method through video and audio data transmission/reception in a virtual space according to an embodiment of the present invention may include deleting the text selected by the host from among the plurality of output texts (S150).

S150 단계에서, 상기 호스트 모듈(100)은 출력된 복수의 상기 텍스트 중 상기 호스트가 선택한 텍스트를 삭제할 수 있다. In step S150 , the host module 100 may delete the text selected by the host from among the plurality of output texts.

S150 단계에서, 본 발명의 일 실시 예에 따르면, 상기 호스트(10)는 출력된 텍스트를 확인하고 상기 유저(20)의 요구사항(예를 들면 질문 또는 주문이 될 수 있다.)에 대응할 수 있다. 상기 호스트(10)는 자신이 대응을 마친 텍스트를 선택해 삭제할 수 있다. In step S150 , according to an embodiment of the present invention, the host 10 may check the output text and respond to the user's 20 request (for example, it may be a question or an order). . The host 10 may select and delete the corresponding text.

이제까지 본 발명에 대하여 그 바람직한 실시예들을 중심으로 살펴보았다. 본 발명이 속하는 기술 분야에서 통 상의 지식을 가진 자는 본 발명이 본 발명의 본질적인 특성에서 벗어나지 않는 범위에서 변형된 형태로 구현될 수 있음을 이해할 수 있을 것이다. 그러므로 개시된 실시예들은 한정적인 관점이 아니라 설명적인 관점에서 고려되어야 한다. 본 발명의 범위는 전술한 설명이 아니라 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been looked at with respect to preferred embodiments thereof. Those of ordinary skill in the art to which the present invention pertains will understand that the present invention can be implemented in a modified form without departing from the essential characteristics of the present invention. Therefore, the disclosed embodiments are to be considered in an illustrative rather than a restrictive sense. The scope of the present invention is indicated in the claims rather than the foregoing description, and all differences within the scope equivalent thereto should be construed as being included in the present invention.

Claims

a host module for outputting images and sounds of virtual reality content to a host and outputting at least one user's voice;
a control server that transmits the video and sound of the virtual reality content provided by the host to at least one user module, receives the user's voice data, and transmits it to the host module; and
At least one user module for outputting the image and sound of the virtual reality content provided by the host to the user and transmitting the voice input by the user to the control server;
The host module is
outputting the avatar of the user to the virtual reality content, and outputting a visual effect to the avatar of the user who has inputted the voice when the voice of the user is output;
The control server,
Receive the user's voice data individually from the plurality of user modules,
converts the user's voice into text using the user's voice data, and transmits it to at least one of the host module and the user module;
The host module and the user module output the text to the image,
The control server extracts phonetic features from the user's voice data through a deep learning model, and determines a word with the highest probability as a primary word according to the phonetic feature,
In a state that includes similarity group data including the degree of similarity to words with similar pronunciation to a specific word, and sentence association data that learns the probability of being used before and after words in a sentence,
The control server determines, based on the sentence association data, a word having a low association probability with words before and after the primary word as a word to be corrected, and a word having a similarity greater than or equal to a certain probability in the pronunciation similarity group of the subject word to be corrected. A real-time two-way communication system through video and audio data transmission and reception in a virtual space that replaces the primary word with words.

delete

According to claim 1,
The host module is
A real-time two-way communication system through video and audio data transmission/reception in a virtual space, characterized in that the text selected by the host is deleted from among the plurality of outputted texts.

5. The method of claim 4,
The user module is
Correcting and deleting the text according to the user's input, outputting the correction and deletion results to the image of the content, and transmitting the correction and deletion results to the control server,
The control server,
Transmitting the modification and deletion results to the host module,
The host module is
A real-time two-way communication system through video and audio data transmission/reception in a virtual space, characterized in that the result of the correction and deletion is reflected and output to the host.

selecting virtual reality content to be provided to a host module and a user module through the host module;
outputting the image and sound of the virtual reality content by using the host module in a host, and outputting the image and sound of the virtual reality content to at least one or more users by using the user module; and
outputting the voice input by the user through the user module to the host through the host module;
The control server transmits the image and sound of the virtual reality content provided by the host to the at least one user module, receives the user's voice data, and transmits it to the host module,
The step of outputting the image and sound of the virtual reality content,
outputting the user's avatar to the virtual reality content;
The step of outputting the voice input by the user to the host,
When the user's voice is output, a visual effect is output to the avatar of the user who has input the voice,
The step of outputting the voice input by the user to the host,
the control server individually receives the plurality of user's voice data, converts the user's voice into text, and outputs the text to an image;
The control server extracts phonetic features from the user's voice data through a deep learning model, and determines a word with the highest probability as a primary word according to the phonetic feature,
In a state that includes similarity group data including the degree of similarity to words with similar pronunciation to a specific word, and sentence association data that learns the probability of being used before and after words in a sentence,
The control server determines, based on the sentence association data, a word having a low association probability with words before and after the primary word as a word to be corrected, and a word having a similarity greater than or equal to a certain probability in the pronunciation similarity group of the subject word to be corrected. A real-time two-way communication method through video and audio data transmission and reception in a virtual space, characterized in that the primary word is replaced with

delete

7. The method of claim 6,
Correcting and deleting the text according to the user's input, and outputting the result of the correction and deletion to the image of the content; .

10. The method of claim 9,
Deleting the text selected by the host from among the plurality of outputted texts; The real-time two-way communication method through video and audio data transmission/reception in a virtual space, further comprising: a.