KR100521310B1

KR100521310B1 - Method and apparatus for supplying RIP-sync Avata Service by Multimodal

Info

Publication number: KR100521310B1
Application number: KR10-2003-0069169A
Authority: KR
Inventors: 나동원; 엄봉수; 김경민
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2003-10-06
Filing date: 2003-10-06
Publication date: 2005-10-14
Also published as: KR20050033200A

Abstract

멀티모달을 이용한 립싱크-아바타 제공 방법 및 장치에 대해 개시한다. 본 발명은 사용자 음성을 이동통신단말에 입력하는 단계와, 멀티모달 서버에서 아날로그 형태의 사용자 음성을 입력받아 디지털 데이터로 변환하고 필터링하고, 웹서버의 페이지-특수문법 데이터베이스에 연계되어 현재 URL을 특수문법과 결합된 음성데이터를 출력시키는 단계와, 상기 음성데이터를 받아 음성인식 서버에서 음성 신호의 특징을 추출하여 상기 음성 신호에 대응하는 아바타를 선택하는 단계와, 그리고 상기 선택된 아바타에 해당하는 아바타를 아바타 서버에서 독출하여 멀티모달 서버를 경유하여 이동통신단말에 전송하는 단계를 포함하여 이루어진다. 본 발명에 따르면, 아바타의 립싱크를 통해 단조롭고 현실적이지 못한 단점을 극복할 수 있으며, 이어폰을 이용한 통화를 수행할 경우에는 디스플레이부의 고유한 기능을 발휘할 수 있다. A method and apparatus for providing lip sync-avatar using multimodal are disclosed. The present invention comprises the steps of inputting the user's voice to the mobile communication terminal, and receives the user's voice in the analog form from the multi-modal server to convert and filter the digital data, and is connected to the page-special grammar database of the web server to specify the current URL Outputting voice data combined with grammar, extracting a feature of a voice signal from a voice recognition server by receiving the voice data, selecting an avatar corresponding to the voice signal, and selecting an avatar corresponding to the selected avatar. Reading out from the avatar server and transmitting to the mobile communication terminal via the multi-modal server. According to the present invention, it is possible to overcome the monotonous and unrealistic disadvantages through the lip sync of the avatar, and to exhibit a unique function of the display unit when making a call using the earphone.

Description

Method and apparatus for supplying lip synch-avatar using multi-modal {Method and apparatus for supplying RIP-sync Avata Service by Multimodal}

본 발명은 멀티모달을 이용한 립싱크-아바타 제공 방법 및 장치에 관한 것으로, 특히 멀티모달을 이용하여 아바타를 출력시킴과 아울러, 음성메시지에 따라 아바타가 립싱크를 수행하는 멀티모달을 이용한 립싱크-아바타 제공 방법 및 장치에 관한 것이다. The present invention relates to a method and apparatus for providing a lip sync-avatar using multi-modal, and in particular, to output an avatar by using a multi-modal, and a method for providing a lip sync-avatar using a multi-modal in which an avatar performs a lip sync according to a voice message. And to an apparatus.

아바타는 네트워크 접속자를 대신하는 애니메이션 캐릭터이다. 이 아바타는 이-메일, 채팅 등에 이용되고 있으며, 사이버 쇼핑몰, 가상교육, 가상오피스 등으로 이용이 확대되고 있다. 이 아바타는 현실세계와 가상공간을 이어주며, 익명과 실명의 중간 정도에 존재한다. 사이버공간의 익명성에 편안함과 자신의 캐릭터를 가질 수 있는 장점이 결합되어 아바타를 이용한 다양한 서비스가 이루어지고 있다. An avatar is an animated character on behalf of a network visitor. This avatar is used for e-mails and chats, and is being used for cyber shopping malls, virtual education, and virtual offices. This avatar connects the real world with the virtual space and exists somewhere between anonymous and blind. The anonymity of cyberspace is combined with the advantages of having one's own comfort and character, and various services using avatars are being made.

일반적으로 아바타는 몇 가지의 캐릭터를 조합하거나 이미 완성된 상태로 제공되지만, 그래픽기술이 향상되면서 서비스 제공자가 이미 만들어 놓은 기성품(Ready-made)을 이용하여, 다양한 개성을 갖는 아바타를 개인적으로 제작하여 ‘나만의 아바타’를 등장시키기도 한다. Generally, avatars are provided with a combination of several characters or already completed, but as the graphics technology is improved, the avatars having various personalities can be personally created using ready-made made by the service provider. It also introduces 'My Avatar'.

그런데, 이 아바타와 함께 전달된 이-메일 확인 또는 채팅 등을 수행할 경우에 텍스트와 평면적으로 고정된 아바타의 화상만을 출력시킴으로 인해, 단조로움과 현실감이 떨어지는 단점이 있었다. 그렇다하여 발신자측에 동영상을 캡쳐할 장비, 즉 카메라 등을 구비하여야 한다는 것은 아바타의 단조로움과 현실감 저하를 극복하는 현명한 방법이 아닐 것이다. However, when performing an e-mail check or a chat delivered with the avatar, only the text and the image of the avatar fixed in a plane are outputted, so that the monotony and the reality are inferior. Therefore, it is not a smart way to overcome the monotony and deterioration of reality of the avatar to have a device for capturing a video on the sender side, that is, a camera.

한편, 최근 정보통신분야의 비약적인 발전으로 인하여 멀티미디어의 발달, 초고속 정보통신망의 구축, 그리고 멀티미디어 통신을 통한 통신판매, 고객 관리, 물류 처리, 제품 홍보 등이 두드러지게 증가하고 있다. 이와 함께 개인용 컴퓨터 및 휴대폰 보급이 가속화됨에 따라 인간과 기계 사이의 대화에 대한 연구의 중요성이 증대되고 있다. On the other hand, due to the recent rapid development of the information and communication field, the development of multimedia, the construction of high-speed information communication network, and communication sales, customer management, logistics processing, product promotion, etc. through multimedia communication have increased significantly. In addition, as personal computers and mobile phones become more prevalent, research on the dialogue between humans and machines is increasing.

즉, 개인용 컴퓨터(PC)의 화면상의 정보를 표현하는 표준적인 언어인 HTML과 그 정보를 전달하는 표준 프로토콜인 HTTP를 기반으로 하는 WWW(World Wide Web)이 급속도로 보급되면서 웹 브라우저만 있으면 인터넷 상의 방대한 양의 정보에 손쉽게 접근할 수 있게 되었다. 하지만 웹 브라우저는 PC와 같이 비교적 큰 화면과 키보드, 마우스를 가진 시스템을 필요로 하므로, 휴대폰이나 PDA(Personal Digital Assistance) 등 작은 화면에 제한된 키보드만 가지고 있는 단말 장치에서 인터넷 상의 정보에 접근하는 것이 쉽지 않았다. 이러한 문제를 해결하기 위하여 휴대 단말 장치의 특성을 고려하여 WAP(Wireless Application Protocol) 또는 WML(Wireless Markup Language)과 같은 새로운 표준이 제안되었다. 하지만 WML을 사용하더라도 휴대 장치의 작은 화면에 한꺼번에 많은 정보를 보여줄 수 없고 또 운전 중과 같이 손을 다른 용도에 사용하고 있을 때에는 버튼 입력이 어렵다는 문제가 있었다. In other words, the world wide web (WWW) based on HTML, a standard language for expressing information on a personal computer (PC), and HTTP, a standard protocol for delivering the information, is rapidly spreading, and a web browser is needed on the Internet. Easily accessible to vast amounts of information. However, since a web browser requires a system with a relatively large screen, a keyboard, and a mouse, such as a PC, it is easy to access information on the Internet from a terminal device having a limited keyboard on a small screen such as a mobile phone or a personal digital assistant (PDA). Did. In order to solve this problem, a new standard such as WAP (Wireless Application Protocol) or WML (Wireless Markup Language) has been proposed in consideration of the characteristics of the mobile terminal device. However, even when using WML, a small screen of a mobile device cannot display a lot of information at once, and there is a problem that button input is difficult when the hand is used for other purposes such as while driving.

그래서, 화면이나 키보드, 마우스 등과 같이 화면 기반의 사용자 인터페이스를 배제하고 음성만으로 인터넷 상의 컨텐츠를 사용할 수 있도록 VoiceXML 표준이 제안되었다. 이 VoiceXML은 음성 입출력 형식을 표준화하고 이를 인터넷과 같은 분산 환경에서 손쉽게 이용할 수 있도록 하기 위한 것으로서, 음성인식 기술을 활용하여 한번에 많은 단어 중 하나를 인식함으로써, 예를 들어 이름을 말하는 것만으로 개인 전화 번호부에서 이름을 선택하거나 증권 서비스에서 회사 이름을 선택하도록 하는 것이 가능하다. Therefore, the VoiceXML standard has been proposed to use screen-based user interfaces such as screens, keyboards, and mice, and to use contents on the Internet by voice only. This VoiceXML is intended to standardize voice input and output formats and to make it easy to use in a distributed environment such as the Internet. It uses voice recognition technology to recognize one of many words at once, for example, a personal phone book by simply saying a name. It is possible to choose a name from or to choose a company name from a securities service.

한편, 휴대폰에 전송된 음성메시지를 청취하고자 할 경우에, 음성메시지 서비스가 시작된 초창기에는 단순히 음성메시지가 도착하였음을 알리는 표시하는 것과, 단순히 녹음내용만을 청취하는 것으로 음성메시지 서비스가 이루어졌다. 차츰, 음성메시지에 개인 아바타가 등장하게 되었으며, 이 아바타를 이용하여 자신만의 개성을 표현하고 있다. On the other hand, if you want to listen to the voice message transmitted to the mobile phone, the voice message service was made by simply indicating that the voice message has arrived at the beginning of the voice message service, and simply listening to the recording contents. Increasingly, personal avatars have appeared in voice messages, and they express their personality using these avatars.

그런데, 통화시 이어폰을 이용할 경우, 특히 장시간에 걸친 통화가 이루어질 경우에는 디스플레이된 아바타가 정지된 상태를 유지하고 있어 휴대폰 디스플레이부의 고유기능을 상실시키고 있다. 이와 같은 음성만의 또는 영상만의 인터페이스의 단점을 보완하기 위해 멀티모달(multimodal) 인터페이스, 즉 하나 이상의 인터페이스 방법을 동시에 이용하는 인터페이스에 대한 연구가 이루어지고 있다. By the way, when using the earphone during the call, especially when the call is made for a long time, the displayed avatar maintains a stationary state, thereby losing the unique function of the mobile phone display unit. In order to make up for the drawbacks of the audio-only or video-only interface, research on a multimodal interface, that is, an interface using one or more interface methods simultaneously, has been conducted.

따라서, 본 발명의 목적은 상기한 기존의 단점을 보완하기 위해, 멀티모달을 이용하여 아바타를 출력시킴과 아울러, 음성메시지에 따라 아바타가 립싱크를 수행하는 멀티모달을 이용한 립싱크-아바타 제공 방법 및 장치를 제공하는데 있다. Accordingly, an object of the present invention is a method and apparatus for providing a lip sync-avatar using a multi-modal that outputs an avatar using multi-modal and performs a lip sync according to a voice message in order to compensate for the above disadvantages. To provide.

즉, 본 발명은 멀티모달을 이용한 음성을 데이터 채널을 통해 제공함과 아울러, 이에 따른 화면에 자기만의 아바타를 자신과 상대방에 표현할 수 있는 일종의 IMS(Instant Messaging Service)를 수행하고자 하는 것이다. In other words, the present invention is to provide a voice using a multi-modal through a data channel, and to perform a kind of instant messaging service (IMS) that can express its own avatar on the screen to itself and the other party.

상기한 본 발명의 목적을 달성하기 위한 멀티모달을 이용한 립싱크-아바타 제공 방법은, 사용자 음성을 이동통신단말에 입력하는 제 1 단계; 멀티모달 서버에서 아날로그 형태의 사용자 음성을 입력받아 디지털 데이터로 변환하고 필터링하고, 웹서버의 페이지-특수문법 데이터베이스에 연계되어 현재 URL을 특수문법과 결합된 음성데이터를 출력시키는 제 2 단계; 상기 음성데이터를 받아 음성인식 서버에서 음성 신호의 특징을 추출하여 상기 음성 신호에 대응하는 아바타를 선택하는 제 3 단계; 및 상기 선택된 아바타에 해당하는 아바타를 아바타 서버에서 독출하여 멀티모달 서버를 경유하여 이동통신단말에 전송하는 제 4 단계를 포함하여 이루어진 것을 특징으로 한다. Lip sync-avatar providing method using a multi-modal for achieving the above object of the present invention, the first step of inputting a user voice to the mobile communication terminal; A second step of receiving an analog user's voice from a multi-modal server, converting the digital data into digital data, filtering the filtered data, and outputting voice data associated with a special grammar by linking to a page-special grammar database of a web server; Receiving a voice data and extracting a feature of a voice signal from a voice recognition server to select an avatar corresponding to the voice signal; And a fourth step of reading the avatar corresponding to the selected avatar from the avatar server and transmitting the avatar to the mobile communication terminal via the multi-modal server.

이 때, 상기 음성 신호의 특징을 추출하는 방법은 웨이브렛 변환을 이용한 것이 바람직하며, 상기 제 1 단계 이전에, 사용자에 의해 URL을 입력받아 왑 게이트웨이를 거쳐 아바타를 제공하는 웹서버에 접속하는 단계; 사용자에 의해 요청된 아바타를 왑 게이트웨이를 통해 이동통신단말에 전송하는 단계; 및 상기 전송된 아바타에 대해, 자신만의 아바타 선택, 아바타의 실행여부, 접속자별 아바타 설정을 포함한 아바타 제작이 이루어지는 단계를 더 진행한다. In this case, the method of extracting the feature of the voice signal is preferably performed using a wavelet transform, and before the first step, accessing a web server providing an avatar through a swap gateway by receiving a URL by a user. ; Transmitting the avatar requested by the user to the mobile communication terminal through the swap gateway; And performing avatar production on the transmitted avatar, including selecting an avatar of the user, whether to execute the avatar, and setting an avatar for each accessor.

한편, 본 발명의 목적을 달성하기 위한 멀티모달을 이용한 립싱크-아바타 제공 장치는, 이동통신단말; 립싱크-아바타를 제공하기 위해 음성데이터를 입력받아 데이터화시켜 출력시키는 멀티모달 서버; 데이터화된 음성에 대해 음성인식기를 이용하여 해당 아바타를 출력시키는 음성인식 서버; 및 상기 해당 아바타를 제공하는 아바타 서버를 포함하여 이루어진 것을 특징으로 한다. On the other hand, lip sync-avatar providing apparatus using a multi-modal for achieving the object of the present invention, a mobile communication terminal; A multi-modal server that receives voice data to provide lip sync-avatar and outputs the data; A voice recognition server for outputting the corresponding avatar using the voice recognizer for the data voice; And an avatar server for providing the corresponding avatar.

이 때, 상기 아바타의 설정을 위해, 아바타의 설정 및 음성에 해당되는 웹의 검색을 수행하는 웹서버와, 상기 웹서버와 이동통신단말 사이의 변환을 수행하는 왑 게이트웨이를 더 마련하여 이루어진다. In this case, a web server for searching the web corresponding to the avatar setting and the voice and a swap gateway for converting between the web server and the mobile communication terminal are further provided for setting the avatar.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예에 대해 설명한다. Hereinafter, with reference to the accompanying drawings will be described a preferred embodiment of the present invention.

도 1은 본 발명의 일실시예에 의한 멀티모달을 이용한 립싱크-아바타 제공하기 위한 시스템을 개략적으로 나타낸 도면이다. 도 1에 도시된 바와 같이, 이동통신단말(1)과, 립싱크-아바타를 제공하기 위해 음성데이터를 입력받아 데이터화시켜 출력시키는 멀티모달 서버(2)와, 데이터화된 음성에 대해 음성인식기를 이용하여 해당 아바타를 출력시키는 음성인식 서버(3)와, 이 음성인식 서버(3)에 해당 아바타를 제공하는 아바타 서버(4)로 구성된다. 여기에, 아바타의 설정 및 음성에 해당되는 웹의 검색을 수행하는 웹서버(5)와, 이 웹서버(5)와 이동통신단말(1) 사이의 변환을 수행하는 왑 게이트웨이(6)로 구성되어 있다. 그리고, 상기 웹서버(5)에는 페이지-특수문법 데이터베이스(7)가 더 마련되게 된다. 1 is a view schematically showing a system for providing a lip sync-avatar using multi-modal according to an embodiment of the present invention. As shown in FIG. 1, a mobile communication terminal 1, a multi-modal server 2 for receiving voice data and outputting it to output data to provide a lip sync-avatar, and using a voice recognizer for the data voice A voice recognition server 3 for outputting the avatar and an avatar server 4 for providing the avatar to the voice recognition server 3. Here, a web server 5 which performs the search of the web corresponding to the setting of the avatar and the voice, and a swap gateway 6 which converts between the web server 5 and the mobile communication terminal 1 are included. It is. The web server 5 is further provided with a page-specific grammar database 7.

상기와 같이 구성된 시스템에서 아바타가 실행되는 과정을 간략하게 설명한다. A process of executing the avatar in the system configured as described above will be briefly described.

사용자가 상대방과 통화가 이루어지게 되면, 사용자가 미리 설정한 아바타를 띄우거나, 상대방에 해당되는 아바타를 띄워 화면상에 출력시킨다. 이어서, 이동통신단말(1) 사용자가 통화중에 입력하는 음성은 멀티모달 서버(2)로 입력되고(S1), 이에 멀티모달 서버(2)에서는 음성을 데이터화시킨다(S2). 데이터화된 음성은 다시 음성인식 서버(3)로 전달되며(S3), 음성인식 서버(3)내에 마련된 음성인식기에 의해 해당 아바타를 선택하게 된다(S4). 이에 아바타 서버(4)는 해당 아바타를 독출하여 사용자 이동통신단말(1)로 전송하여 음성에 해당되는 해당 아바타를 디스플레이시킨다(S5∼S7). 이와 같은 과정이 연속적으로 이루어지면서, 아바타가 립싱크하는 형상으로 출력되게 되는 것이다. When the user makes a call with the counterpart, the user displays an avatar set in advance by the user or an avatar corresponding to the counterpart on the screen. Subsequently, the voice inputted by the user of the mobile communication terminal 1 during the call is input to the multimodal server 2 (S1), and the multimodal server 2 makes the voice data (S2). The voiced data is transmitted to the voice recognition server 3 again (S3), and the corresponding avatar is selected by the voice recognizer provided in the voice recognition server 3 (S4). The avatar server 4 reads the avatar and transmits the avatar to the user mobile communication terminal 1 to display the avatar corresponding to the voice (S5 to S7). As this process is performed continuously, the avatar is output in the shape of lip-syncing.

한편, 왑 게이트웨이(6)를 경유하여 웹서버(5)에 접속함으로써 사용자가 지정한 아바타에 대한 설정을 변경시킬 수도 있다(S8∼S11). On the other hand, by connecting to the web server 5 via the swap gateway 6, it is also possible to change the settings for the avatar specified by the user (S8 to S11).

그리고, 아바타의 립싱크 실행중에 사용자가 검색을 의도로 해당 음성을 입력하게 되면, 멀티모달 서버(2)에서는 웹서버(5)를 연동시켜 해당 왑 메뉴를 사용자 이동통신단말(1)에 출력시킨다. 이는 웹서버(5)와 연동시킨 멀티모달 서버(2)에서 해당 검색 음성을 음성인식 서버(3)로 전달하고 음성인식 결과를 왑 게이트웨이(6)를 경유하여 웹서버(5)의 검색을 수행하고, 그 결과를 왑메뉴로 변환하여 사용자의 이동통신단말(1)에 디스플레이시킴으로써 이루어질 수 있다(S12). Then, when the user inputs the corresponding voice with the intention of searching during the lip syncing of the avatar, the multimodal server 2 interlocks the web server 5 and outputs the corresponding swap menu to the user mobile communication terminal 1. This transfers the corresponding search voice to the voice recognition server 3 from the multi-modal server 2 interworking with the web server 5 and performs the search of the web server 5 via the swap gateway 6. And converting the result into a swap menu and displaying the result in the user's mobile communication terminal 1 (S12).

도 2는 본 발명의 일실시예에 의한 멀티모달을 이용한 립싱크-아바타 제공 방법을 나타낸 흐름도이다. 도 2에 도시된 바와 같이, 먼저 인터넷을 통해 자신만의 아바타를 제작하는 과정부터 설명한다. 이동통신단말(1)을 이용하여 웹서버(5)에 접속한다. 본 실시예에서는 왑 게이트웨이(6)를 거치는 왑방식을 이용한다고 가정하며, 이에 따라 이동통신사업자가 선정한 표준에 따라 이동통신단말(1) 내에는 무선인터넷을 지원하는 브라우저를 탑재되게 된다. 즉, 사용자에 의해 유알엘(URL : Uniform Resource Locator, 이하 URL이라 칭함)을 입력받아(S20), 왑 게이트웨이(6)를 거쳐 HDML(Handheld Device Markup Language)로 제작된 문서를 제공하는 웹서버(5)에 접속된다(S22). 웹서버(5)는 요청된 HDML로 제작된 문서를 왑 게이트웨이(6)에 전달한다. 이에 따라 HDML 브라우저를 탑재한 이동통신단말(1)은 웹서버(5)에서 제공되는 컨텐츠를 서비스 받을 수 있게 된다(S24). 이에 사용자는 이동통신단말(1)을 통해 자신만의 아바타를 선택하거나, 아바타의 실행여부, 접속자별 아바타 설정 등의 정보를 웹서버(5)로 전송한다. 이에 웹서버(5)에서는 사용자로부터 전달된 정보를 저장한 상태를 유지시키게 된다(S26). 2 is a flowchart illustrating a method for providing a lip sync-avatar using multi-modal according to an embodiment of the present invention. As shown in FIG. 2, first, a process of manufacturing a user's own avatar through the Internet will be described. The mobile communication terminal 1 is used to connect to the web server 5. In the present embodiment, it is assumed that a swap method through the swap gateway 6 is used. Accordingly, a browser supporting the wireless Internet is mounted in the mobile communication terminal 1 according to a standard selected by the mobile communication service provider. That is, a web server that receives a URL by a user (URL: Uniform Resource Locator, hereinafter referred to as a URL) (S20) and provides a document produced in HDML (Handheld Device Markup Language) via a swap gateway 6 (5). (S22). The web server 5 delivers the document produced in the requested HDML to the swap gateway 6. Accordingly, the mobile communication terminal 1 equipped with the HDML browser can receive the contents provided from the web server 5 (S24). Accordingly, the user selects his or her own avatar through the mobile communication terminal 1 or transmits information such as whether the avatar is executed or the avatar setting for each user to the web server 5. The web server 5 maintains the state of storing the information transmitted from the user (S26).

이후, 사용자가 상대방과 통화가 이루어지게 되면, 사용자가 미리 설정한 아바타가 디스플레이된다. 이어서, 사용자가 입력한 음성은 EVRC(Enhanced Variable Rate Speech Codec) 기술이 적용되어 현재 URL과 함께 멀티모달 서버(2)로 전송된다(S28). 이 EVRC는 현재 이동통신 시스템에 적용되고 있는 8KQCELP 방식에 의해 8Kbps의 전송속도를 가진 디지털 이동전화의 음질을 획기적으로 높여 줄 수 있는 소프트웨어로서, 통화중 발생하는 잡음을 줄여주고 잡음이 심할 경우 통화자의 목소리를 자동으로 키워주는 등 사람의 목소리를 원음에 가깝게 재생한다. 이 멀티모달 서버(2)에서는 사람의 음성신호를 입력받아 아날로그 형태의 음성신호를 디지털 형식의 음성데이터로 변환시키고 이를 필터링한다. 이 때, 다양한 종류의 아날로그/디지털 변환기(A/D converter)가 사용될 수 있다. 이렇게 원음에 가깝게 재생된 음성은 멀티모달 서버(2)에서 데이터화된다. 그리고, 웹서버(5)의 페이지-특수문법 데이터베이스(7)에 연계되어 현재 URL을 특수문법과 결합된 음성데이터를 음성인식 서버(3)로 전송한다(S30). 음성인식 서버(3)에서는 입력된 음성데이터를 분석하게 되는데, 현재까지 주로 사용되고 있는 음성 신호의 특징 파라메터 추출 방법에는 켑스트럼(Cepstrum), LSF, FB(Filter Bank), 웨이브렛(Wavelet)에 의한 특징 벡터 등이 있다. 이들 중에서 웨이브렛 변환을 이용하여 음성데이터의 특징 파라메터를 추출하여 이에 대응하는 아바타를 검색하게 된다(S32). 이에 아바타 서버(4)는 해당 아바타를 독출하여 사용자 이동통신단말(1)로 전송하여 음성에 해당되는 해당 아바타를 디스플레이시킨다(S34∼S38). Subsequently, when the user makes a call with the other party, an avatar previously set by the user is displayed. Subsequently, the voice input by the user is applied to the enhanced variable rate speech codec (EVRC) technology and transmitted to the multimodal server 2 together with the current URL (S28). This EVRC is a software that can dramatically improve the sound quality of digital mobile phones with 8Kbps transmission rate by 8KQCELP method that is applied to mobile communication system. It reproduces human voices close to the original sound, such as automatically raising voices. The multi-modal server 2 receives a human voice signal, converts the analog voice signal into digital voice data, and filters it. In this case, various types of analog / digital converters may be used. The voice reproduced close to the original sound is dataized by the multimodal server 2. Then, it is linked to the page-special grammar database 7 of the web server 5 and transmits the voice data combined with the special URL to the voice recognition server 3 (S30). The voice recognition server 3 analyzes the input voice data, and the feature parameter extraction method of the voice signal mainly used up to now includes Cepstrum, LSF, FB (Filter Bank), and Wavelet. By feature vectors. Among them, the feature parameter of the voice data is extracted using the wavelet transform to search for an avatar corresponding thereto (S32). The avatar server 4 reads the avatar and transmits the avatar to the user mobile communication terminal 1 to display the avatar corresponding to the voice (S34 to S38).

한편, 아바타의 립싱크 실행중에 사용자가 검색을 의도로 해당 음성을 입력하게 되면(S40), 멀티모달 서버(2)에서는 웹서버(5)의 페이지-특수문법 데이터베이스(7)와 연동하여 입력된 음성에 해당하는 왑 메뉴를 음성인식 서버(3)로 전달한다(S42∼S44). 이에 음성인식 서버(3)에서는 입력된 음성판독 결과에 대응한 URL을 멀티모달 서버(2)를 경유하여 사용자 이동통신단말(1)에 출력시킨다(S46). 이와 같이, 웹서버(5)와 연동시킨 멀티모달 서버(2)에서 해당 검색 음성을 음성인식 서버(3)로 전달하고 음성인식 결과를 왑 게이트웨이(6)를 경유하여 웹서버(5)의 검색을 수행하고, 그 결과를 왑메뉴로 변환하여 사용자의 이동통신단말(1)에 디스플레이시킴으로써 이루어진다. On the other hand, if the user inputs the corresponding voice with the intention of searching during the lip syncing of the avatar (S40), the multi-modal server 2 inputs the voice input in conjunction with the page-special grammar database 7 of the web server 5. The corresponding swap menu is transferred to the voice recognition server 3 (S42 to S44). The voice recognition server 3 outputs the URL corresponding to the input voice reading result to the user mobile communication terminal 1 via the multi-modal server 2 (S46). In this way, the multimodal server 2 interworking with the web server 5 transmits the corresponding search voice to the voice recognition server 3 and retrieves the voice recognition result via the swap gateway 6 to search the web server 5. Is performed, and the result is converted into a swap menu and displayed on the user's mobile communication terminal 1.

본 발명은 상술한 실시예에 한정되지 않으며, 본 발명의 기술적 사상 내에서 당 분야의 통상의 지식을 가진 자에 의하여 많은 변형이 가능함은 명백할 것이다. The present invention is not limited to the above-described embodiments, and it will be apparent that many modifications are possible by those skilled in the art within the technical spirit of the present invention.

상술한 바와 같이, 본 발명에 따른 멀티모달을 이용한 립싱크-아바타 제공 방법 및 장치는, 아바타의 립싱크를 통해 단조롭고 현실적이지 못한 단점을 극복할 수 있으며, 이어폰을 이용한 통화를 수행할 경우에는 디스플레이부의 고유한 기능을 발휘할 수 있다. 또한, 간략한 시스템 구성으로 효과적으로 립싱크-아바타를 이용하여 자신의 모습과 감정 등을 전달할 수 있다. As described above, the method and apparatus for providing a lip sync-avatar using the multi-modal according to the present invention can overcome the monotonous and unrealistic disadvantage through the lip sync of the avatar. Can function. In addition, a simple system configuration can effectively convey the appearance and emotions using the lip sync-avatar.

도 1은 본 발명의 일실시예에 의한 멀티모달을 이용한 립싱크-아바타 제공하기 위한 시스템을 개략적으로 나타낸 도면, 1 is a view schematically showing a system for providing a lip sync-avatar using multi-modal according to an embodiment of the present invention,

도 2는 본 발명의 일실시예에 의한 멀티모달을 이용한 립싱크-아바타 제공 방법을 나타낸 흐름도이다. 2 is a flowchart illustrating a method for providing a lip sync-avatar using multi-modal according to an embodiment of the present invention.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

1 : 이동통신단말 2 : 멀티모달 서버 1: Mobile communication terminal 2: Multi-modal server

3 : 음성인식 서버 4 : 아바타 서버 3: voice recognition server 4: avatar server

5 : 웹서버 6 : 왑 게이트웨이5: web server 6: swap gateway

7 : 페이지-특수문법 데이터베이스 7: page-special grammar database

Claims

A first step of inputting a user voice into a mobile communication terminal;

A second step of receiving an analog user's voice from a multi-modal server, converting the digital data into digital data, filtering the filtered data, and outputting voice data associated with a special grammar by linking to a page-special grammar database of a web server;

Receiving a voice data and extracting a feature of a voice signal from a voice recognition server to select an avatar corresponding to the voice signal; And

A fourth step of reading an avatar corresponding to the selected avatar from an avatar server and transmitting the avatar to a mobile communication terminal via a multi-modal server;

Lip sync-avatar providing method using a multi-modal, characterized in that consisting of.

The method of claim 1, wherein the method of extracting a feature of the speech signal is performed using a wavelet transform.

The method of claim 1, wherein before the first step,

Receiving a URL by a user and accessing a web server providing an avatar through a swap gateway;

Transmitting the avatar requested by the user to the mobile communication terminal through the swap gateway; And

A step of making an avatar including the avatar selection, whether to execute the avatar and setting the avatar for each accessor for the transmitted avatar

Mobile communication terminal;

A multi-modal server that receives voice data to provide a lip sync-avatar and outputs the data-voiced voice through a data process of extracting and parameterizing features of the voice data;

An avatar server configured to store the user's voice data data with the corresponding avatar; And

A voice recognition server that compares the voice of the dataized user stored in the avatar server with a voice recognizer for the dataized voice and outputs the corresponding avatar of the most similar user.

Lip sync-avatar providing apparatus using a multi-modal, characterized in that comprises a.

5. The method of claim 4, further comprising a web server for performing a search of the web corresponding to the avatar setting and the voice and a swap gateway for converting between the web server and the mobile communication terminal. Lip sync-avatar providing apparatus using a multi-modal, characterized in that made.