KR100826778B1

KR100826778B1 - Wireless mobile for multimodal based on browser, system for generating function of multimodal based on mobil wap browser and method thereof

Info

Publication number: KR100826778B1
Application number: KR20060053390A
Authority: KR
Inventors: 천희진; 김민석; 엄봉수
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2006-06-14
Filing date: 2006-06-14
Publication date: 2008-04-30
Also published as: KR20070119153A

Abstract

Disclosed are a browser-based wireless terminal for a multimodal, a browser-based multimodal server and a system for a wireless terminal, and a method of operating the same.

The system according to the present invention is a system for providing a multimodal function for voice and text processing based on a terminal WAP browser, and a multimodal plug-in on a WAP browser. A wireless terminal for receiving a corresponding content by transmitting a voice or keyboard information to the wireless Internet based on a multi-modal plug-in; After recognizing the multi-modal plug-in request, perform data conversion on voice and keyboard information, generate information recognition result based on data conversion, and output visual data of the corresponding content corresponding to the information recognition result on voice and keyboard. A multimodal platform for processing the information into a wireless terminal for processing; And a content server for extracting content for information recognition results and providing the same to a multimodal platform. Therefore, the present invention has an effect that the voice recognition search service can be provided, including the effect that the operation of the mobile device is convenient through two input methods (key input + voice input).

Wap, WAP, Browser, Multi-Modal, Plug-in, Voice Recording, Voice Recognition, Voice Output

Description

WIRELESS MOBILE FOR MULTIMODAL BASED ON BROWSER, SYSTEM FOR GENERATING FUNCTION OF MULTIMODAL BASED ON MOBIL WAP BROWSER AND METHOD THEREOF }

도 1은 종래 멀티모달(Multimodal)을 설명하기 위한 구성도이다.1 is a configuration diagram for explaining a conventional multimodal (Multimodal).

도 2는 본 발명에 따른 멀티모달 개요를 설명하기 위한 구성도이다.2 is a block diagram for explaining the multi-modal outline according to the present invention.

도 3은 본 발명에 따른 멀티모달 기능 중 음성녹음을 설명하는 무선 단말의 왑 브라우저이다.3 is a swap browser of a wireless terminal for explaining voice recording among multi-modal functions according to the present invention.

도 4는 본 발명에 따른 음성녹음 플러그-인을 설명하기 위한 구성도이다.4 is a block diagram for explaining a voice recording plug-in according to the present invention.

도 5는 본 발명에 따른 멀티모달 기능 중 음성인식을 설명하는 무선 단말의 왑 브라우저이다.5 is a swap browser of a wireless terminal for explaining voice recognition among multi-modal functions according to the present invention.

도 6은 본 발명에 따른 음성인식 플러그-인을 설명하기 위한 구성도이다.6 is a block diagram for explaining a voice recognition plug-in according to the present invention.

도 7은 본 발명에 따른 멀티모달 기능 중 음성출력을 설명하는 무선 단말의 왑 브라우저이다.7 is a swap browser of a wireless terminal for explaining a voice output of the multi-modal function according to the present invention.

도 8은 본 발명에 따른 음성출력 플러그-인을 설명하기 위한 구성도이다.8 is a configuration diagram illustrating a voice output plug-in according to the present invention.

<주요 도면에 대한 부호의 설명><Explanation of symbols for main drawings>

201 : 무선 단말 203 : 멀티모달 플랫폼201: wireless terminal 203: multi-modal platform

205 : 컨텐츠 서버 601 : 음성인식 서버(ASR)205 content server 601 voice recognition server (ASR)

801 : 음성변환 서버(TTS)801: voice conversion server (TTS)

본 발명은 무선 인터넷 서비스를 위한 멀티모달(Multimodal) 인터페이스에 관한 것으로, 보다 상세하게는 플러그-인(Plug-In) 방식을 이용하여 단말 프로그램(WIPI)로 작성된 코드의 일부를 왑(WAP) 브라우져의 무선 마크업 언어(WML:wireless application protocol) 내용과 공존할 수 있는 멀티모달을 위한 브라우저 기반의 무선 단말과, 무선 단말을 위한 브라우저 기반의 멀티모달 서버 및 시스템과 이의 운용 방법에 관한 것이다.The present invention relates to a multimodal interface for a wireless Internet service, and more particularly, a part of a code written in a terminal program (WIPI) using a plug-in method (WAP) browser. The present invention relates to a browser-based wireless terminal for multimodality that can coexist with the contents of a wireless markup language (WML), a browser-based multimodal server and system for a wireless terminal, and a method of operating the same.

인간과 컴퓨터 간의 인터페이스를 위해서 현재는 키보드, 마우스 등을 주로 사용하고 있지만 인간에게 보다 자연스러운 방법은 인간 간에 이미 사용하고 있는 음성을 이용하는 방법이다. 이러한 방법은 이미 시도되고 있지만 컴퓨터의 경우 사람간의 대화와는 다르게 입력된 음성의 내용을 이해하는 것이 아니고 발성된 내용을 그대로 문자로 바꾸어 그 내용이 키보드로 입력된 것과 같이 반응하고 있는 상태로 음성 이해의 기술은 아직 활용되고 있지 않다. 그 이유는 음성이해 기술은 인공지능 기술과 마찬가지로 컴퓨터의 지능을 요구하며 현재 기술수준으로는 일반적인 분야에서는 불가능하고 여행계획 등과 같이 극히 제한된 분야에서만 가능하다.Currently, the keyboard and mouse are mainly used for the interface between humans and computers, but the more natural method for humans is to use the voices already used between humans. Although this method has already been tried, the computer does not understand the contents of the input voice differently from the conversation between people, but instead of the spoken contents, it is converted into text and the contents are reacted as if they were input by the keyboard. Technology is not yet in use. The reason for this is that voice understanding technology, like artificial intelligence, requires the intelligence of a computer. Currently, technology is impossible in general fields and only in extremely limited fields such as travel planning.

다만, 근래에는 무선 단말과 음성인식 모듈이 결합한 다양한 기술이 제공되고 있으며, 그 중에서 사용자와 무선 단말 간의 인터페이스를 다양하게 제공하기 위한 멀티모달(Multimodal) 기술이 출현되고 있다. 즉, 기존의 무선 인터넷 서비스는 화면 출력과 키 입력이라는 하나의 입출력을 제공하였으나, 음성 입출력을 부가하고 이를 화면 입출력과 동기화시켜 하나의 통합된 인터페이스를 제공하고 있다.However, in recent years, various technologies combining a wireless terminal and a voice recognition module have been provided, and among them, a multimodal technology for providing various interfaces between a user and the wireless terminal has emerged. That is, the existing wireless Internet service provides one input / output, a screen output and a key input, but provides a single integrated interface by adding a voice input and output and synchronizing it with the screen input and output.

현재 개시되고 있는 멀티모달을 첨부된 도면을 토대로 설명하면 다음과 같다. 먼저, 도 1에 도시된 바와 같이, 작업별 템플릿 라이브러리(111), 작업별 템플릿 라이브러리(111)를 이용하여 사용자가 MXML 문서를 작성할 수 있도록 하는 MXML 편집기(112), 및 작성된 MXML 문서를 저장하고 사용자의 요청에 따라 MXML 문서를 제공하는 MXML 문서 서버를 포함하는 멀티모달 인터넷 서버(110); 멀티모달 인터넷 서버(110)와 인터넷으로 연결되며, MXML 문서를 해석하여 HTML 내용은 화면에 보여 주고 음성 XML내용을 해석하여 음성 합성할 메시지는 음성 합성 엔진을 이용하여 합성하여 스피커 또는 전화 인터페이스를 통해 들려 주고, 음성 인식을 위한 언어모델을 이용하여 음성인식 엔진을 준비하고 사용자가 발성한 음성의 내용을 인식하여 음성 XML에 지정된 동작을 수행하는 MXML 브라우저(121)와, MXML 브라우저(121)의 요구에 따라 음성인식 또는 음성합성을 수행하는 음성 인식/합성 엔진(122)과, 전화(130)를 이용하여 멀티모달 인터넷 클라이언트에 접근하는 경우에 인터페이스를 제공하는 TAPI(Telephony Application Programming Interface : 125) /MTAPI(Multimedia Telephony Application Programming Interface : 126)와, 키보 드, 마우스, 모니터, 마이크, 스피커 등의 I/O(Input/Output) 장치(124), TAPI(125) 및 MTAPI(126)과 상기 MXML 브라우저(121)와의 인터페이스를 제공하는 I/O 인터페이스(123)를 포함하는 멀티모달 인터넷 클라이언트(120)로 이루어진다.Referring to the presently disclosed multi-modal based on the accompanying drawings as follows. First, as shown in FIG. 1, using the task-specific template library 111, the task-specific template library 111, an MXML editor 112 that allows a user to create an MXML document, and stores the created MXML document. A multimodal Internet server 110 including an MXML document server for providing an MXML document at the request of a user; It is connected to the multi-modal Internet server 110 through the Internet, and interprets an MXML document to display HTML content on the screen, and a message to be synthesized by interpreting the voice XML content by using a speech synthesis engine to synthesize the message through a speaker or a telephone interface. The MXML browser 121 and the MXML browser 121 which prepare a speech recognition engine using a language model for speech recognition, recognize the contents of the speech spoken by the user, and perform the operations specified in the speech XML. TAPI (Telephony Application Programming Interface: 125) / which provides an interface when accessing a multimodal Internet client using a phone 130 and a speech recognition / synthesis engine 122 that performs speech recognition or speech synthesis according to the present invention. MTAPI (Multimedia Telephony Application Programming Interface: 126) and I / O (Input / Output) devices such as keyboards, mice, monitors, microphones, speakers, etc. 124, comprises a TAPI (125) and MTAPI (126) and multi-modal Internet client 120 including the I / O interface 123 that provides an interface between the MXML browser 121.

한편, 멀티모달 인터넷 서버(110)는 기존의 웹 서버의 역할을 하는 것으로 MXML 문서를 요청에 따라 제공해 주는 역할을 한다. 멀티모달 인터넷 서버(110)에는 MXML 문서가 저장되는데 이는 MXML 편집기(112)와 작업별 템플릿 라이브러리(111)를 이용해 작성하게 된다. MXML의 경우 일반 HTML과 달리 음성 인식을 위한 언어모델을 작성해 주어야 하는데 이는 언어 처리에 대한 전문적인 지식을 요구하므로 일반 HTML 작성자가 작성하기 어렵다.On the other hand, the multi-modal Internet server 110 serves as an existing web server serves to provide MXML document on request. The multi-modal Internet server 110 stores an MXML document, which is created using the MXML editor 112 and the task-specific template library 111. Unlike regular HTML, MXML needs to create a language model for speech recognition, which requires specialized knowledge of language processing, making it difficult for general HTML authors to write.

또한 멀티모달 인터넷 클라이언트(120)는 기존의 PC/WS(Personal Computer/Work Station)와 같이 CPU를 갖춘 단말기이다. MXML을 해석해서 화면에 보여 주는 MXML 브라우저(121)와 음성 인식/합성 엔진(122), I/O 장치(키보드, 마우스, 모니터, 마이크, 스피커 등(124), TAPI 및 MTAP와 MXML 브라우저(121)와의 인터페이스 모듈, 즉 I/O 인터페이스(123)를 가지고 있다.In addition, the multi-modal Internet client 120 is a terminal having a CPU, such as a conventional personal computer / work station (PC / WS). MXML browser 121, speech recognition / synthesis engine 122, I / O devices (keyboard, mouse, monitor, microphone, speaker, etc., 124) for interpreting and displaying MXML on screen, TAPI, MTAP, and MXML browser (121). ) Has an interface module, that is, an I / O interface 123.

이와 같이 구성되는 멀티모달 기능은 인터넷 브라우징에서 키보드, 마우스, 모니터 등의 인터페이스 방식 이외에 사람에게 편리한 음성을 인터페이스로 추가함으로써 편리하게 인터넷 브라우징을 할 수 있다. 그리고 기존의 고정된 표현만을 사용하는 방식이 아니라 메타 문법 기능을 통해 구현되는 일상생활에서 사용하는 자유로운 형태의 대화 방식을 채용할 수 있다.The multi-modal function configured as described above can conveniently browse the Internet by adding a voice that is convenient to a person as an interface in addition to the interface method such as a keyboard, a mouse, and a monitor in the Internet browsing. In addition, it is possible to adopt a free form of conversational method used in daily life that is realized through meta grammar function, instead of using the existing fixed expression only.

그러나, 전술한 바와 같이 단말 애플리케이션 예컨대, WIPI에 멀티모달 기능 을 부가하는 것은, 음성 입출력을 처리하는 서버 플랫과 이와 연동할 수 있는 단말 어플리케이션을 토대로 구성된다. 따라서, 단말 애플리케이션을 별도로 개발해야 하는 문제가 있으며, 이로 인해 그 보급 속도가 매우 느리다는 지적이 있다.However, as described above, the addition of a multimodal function to a terminal application, for example, WIPI, is based on a server flat that processes voice input and output and a terminal application that can interoperate with it. Therefore, there is a problem in that a terminal application must be developed separately, which leads to a very slow spreading speed.

본 발명은 이와 같은 문제점을 해결하기 위해 창출된 것으로, 본 발명의 목적은 WAP 브라우저에 멀티모달 기능을 부가하여 단말의 종류에 관계없이 무선 인터넷 서비스를 이용하는 사용자에게 편리한 입출력 서비스를 제공할 수 있는 멀티모달을 위한 브라우저 기반의 무선 단말과, 무선 단말을 위한 브라우저 기반의 멀티모달 서버 및 시스템과 이의 운용 방법을 제공함에 있다.The present invention was created to solve the above problems, and an object of the present invention is to add a multi-modal function to a WAP browser and provide a convenient input / output service to a user using a wireless Internet service regardless of the type of terminal. A browser-based wireless terminal for a modal, a browser-based multi-modal server and system for a wireless terminal and its operation method.

본 발명의 다른 목적은, 플러그-인(Plug-In) 방식을 이용하여 위피(WIPI)로 작성된 코드의 일부를 왑(WAP) 브라우저의 WML 내용과 공존하는 형태를 제공함으로써, 단말 애플리케이션 기반의 서비스 보다 보급 속도를 증대시키고, 컨텐츠 개발의 편의성을 제공할 수 있는 멀티모달을 위한 브라우저 기반의 무선 단말과, 무선 단말을 위한 브라우저 기반의 멀티모달 서버 및 시스템과 이의 운용 방법을 제공함에 있다.It is another object of the present invention to provide a form in which a portion of code written in WIPI coexists with WML contents of a WAP browser by using a plug-in method, thereby providing a terminal application-based service. The present invention provides a browser-based wireless terminal for multi-modality, a browser-based multi-modal server and system for wireless terminal, and an operation method thereof.

본 발명의 또 다른 목적은, 플러그-인(Plug-In) 방식을 음성녹음, 음성인식, 음성출력의 컴포너트로 세분화하여 단말 브라우저 기반의 멀티모달 기능을 제공함으로써, 기존 단말 및 신규 단말에서도 멀티모달 기능이 손쉽게 적용될 수 있는 멀티모달을 위한 브라우저 기반의 무선 단말과, 무선 단말을 위한 브라우저 기반의 멀티모달 서버 및 시스템과 이의 운용 방법을 제공함에 있다.It is still another object of the present invention to provide a terminal browser-based multi-modal function by subdividing a plug-in method into a component of voice recording, voice recognition, and voice output. The present invention provides a browser-based wireless terminal for a multi-modal that can be easily applied to the modal function, a browser-based multi-modal server and system for the wireless terminal, and an operation method thereof.

상기 목적을 달성하기 위한 본 발명의 제1 관점에 따른 무선 단말을 위한 브라우저 기반의 멀티모달 서버는, 단말 왑(WAP) 브라우저 기반에서 음성 및 텍스트 처리를 위한 멀티모달 기능을 제공하기 위한 서비스 서버에 있어서, 무선 단말의 멀티모달 플러그-인 접속 상태를 인지한 후, 무선 인터넷상으로 수신된 음성 또는 자판 정보에 대한 데이터 변환을 수행하고, 데이터 변환에 기반하여 정보 인식 결과를 생성하며, 상기 정보 인식결과에 대응하는 해당 컨텐츠를 수신한 후, 상기 컨텐츠의 비쥬얼 데이터를 음성 및 자판 정보로 가공 처리하여 상기 무선 단말로 전송하는 멀티모달 서버를 포함하는 것을 특징으로 한다.A browser-based multimodal server for a wireless terminal according to the first aspect of the present invention for achieving the above object, to a service server for providing a multi-modal function for voice and text processing in a terminal swap (WAP) browser-based Recognizing a multimodal plug-in connection state of a wireless terminal, performing data conversion on voice or keyboard information received on the wireless Internet, generating an information recognition result based on the data conversion, and recognizing the information. And receiving a corresponding content corresponding to the result, and processing the visual data of the content into voice and keyboard information and transmitting the processed data to the wireless terminal.

본 발명의 바람직한 실시 예에 따르면 상기 멀티모달 플러그-인은, 음성녹음을 위한 플러그-인, 음성인식을 위한 플러그-인, 음성출력을 위한 플러그-인 중 어느 하나 이상으로 구성되는 것을 특징으로 한다.According to a preferred embodiment of the present invention, the multi-modal plug-in may include one or more of a plug-in for voice recording, a plug-in for voice recognition, and a plug-in for voice output. .

또한 상기 멀티모달 서버는, 상기 음성인식 플러그-인 기동을 위해 상기 무선 단말로부터 요청된 음성인식 명령에 응답하여, 정해진 음성인식 알고리즘에 따라 음성인식 처리를 수행하는 것을 특징으로 한다.In addition, the multi-modal server, in response to a voice recognition command requested from the wireless terminal for the voice recognition plug-in activation, characterized in that for performing a voice recognition process according to a predetermined voice recognition algorithm.

또한 상기 멀티모달 서버는, 상기 음성출력 플러그-인 기동을 위해 상기 무선 단말로부터 녹음되거나 또는 텍스트화된 정보를 재생 출력하는 것을 특징으로 한다.In addition, the multi-modal server, characterized in that for reproducing the voice output plug-in to reproduce the output or recorded information from the wireless terminal.

또한 상기 멀티모달 플랫폼은, 상기 텍스트화된 정보의 재생 출력을 위해 TTS 서버와 연동하는 것을 특징으로 한다.In addition, the multi-modal platform, it characterized in that the interworking with the TTS server for the reproduction output of the textified information.

상기 목적을 달성하기 위한 본 발명의 제2 관점에 따른 멀티모달을 위한 브라우저 기반의 무선 단말은, 단말 왑(WAP) 브라우저 기반에서 음성 및 텍스트 처리를 위한 멀티모달 기능을 제공하기 위한 무선 단말에 있어서, 왑 브라우저(WAP Browser)상에서 멀티모달 플러그-인(Multimodal Plug-In)을 요청하고, 상기 멀티모달 플러그-인을 토대로 무선 인터넷상으로 음성 또는 자판 정보를 전송한 후, 상기 음성 또는 자판 정보에 대응하는 해당 컨텐츠를 수신하기 위한 플러그-인 운용 모드를 포함하는 것을 특징으로 한다.In accordance with a second aspect of the present invention, a browser-based wireless terminal for multi-modality provides a wireless terminal for providing a multi-modal function for voice and text processing based on a terminal swap (WAP) browser. Requests a multimodal plug-in on a WAP browser, transmits voice or keyboard information to the wireless Internet based on the multi-modal plug-in, and then transmits the voice or keyboard information to the voice or keyboard information. And a plug-in operation mode for receiving corresponding corresponding content.

본 발명의 바람직한 실시 예에 따르면 상기 플러그-인은, 상기 무선 단말의 왑 브라우저(WAP Browser) 상에서 오브젝트(OBJECT) 태그를 이용하여 플러그-인 요청을 수행하는 것을 특징으로 한다.According to an embodiment of the present invention, the plug-in may perform a plug-in request using an object tag on a WAP browser of the wireless terminal.

또한 상기 오브젝트 태그는, 상기 무선 단말에서 멀티모달 플랫폼으로 메시지 전송 시 WML 문서 내로 포함시키는 것을 특징으로 한다.The object tag may be included in a WML document when the wireless terminal transmits a message to a multimodal platform.

상기 목적을 달성하기 위한 본 발명의 제3 관점에 따른 무선 단말을 위한 브라우저 기반의 멀티모달 기능 제공 시스템은, 단말 왑(WAP) 브라우저 기반에서 음성 및 텍스트 처리를 위한 멀티모달 기능을 제공하기 위한 시스템에 있어서, 왑 브라우저(WAP Browser)상에서 멀티모달 플러그-인(Multimodal Plug-In)을 요청하고, 멀티모달 플러그-인을 토대로 무선 인터넷상으로 음성 또는 자판 정보를 전송하여 해당 컨텐츠를 수신하는 무선 단말; 상기 멀티모달 플러그-인 요청을 인지한 후, 상기 음성 및 자판 정보에 대한 데이터 변환을 수행하고, 데이터 변환에 기반하여 정보 인식 결과를 생성하며, 상기 정보 인식결과에 대응하는 상기 해당 컨텐츠의 비쥬얼 데이터를 음성 및 자판 정보로 가공 처리하여 상기 무선 단말로 전송하는 멀티모달 플랫폼; 및 상기 정보 인식결과에 대한 컨텐츠를 추출하여 상기 멀티모달 플랫폼으로 제공하기 위한 컨텐츠 서버로 구성되는 것을 특징으로 한다.A system for providing a browser-based multimodal function for a wireless terminal according to a third aspect of the present invention for achieving the above object is a system for providing a multimodal function for voice and text processing in a terminal swap (WAP) browser-based. A wireless terminal for requesting a multimodal plug-in on a WAP browser and transmitting voice or keyboard information over the wireless Internet based on the multimodal plug-in to receive the corresponding content. ; After recognizing the multi-modal plug-in request, data conversion for the voice and keyboard information is performed, an information recognition result is generated based on the data conversion, and visual data of the corresponding content corresponding to the information recognition result. A multi-modal platform for processing a voice and keyboard information and transmitting the same to the wireless terminal; And a content server for extracting content for the information recognition result and providing the same to the multi-modal platform.

한편 상기 목적을 달성하기 위한 본 발명의 제4 관점에 따른 무선 단말을 위한 브라우저 기반의 멀티모달 기능 중 음성녹음 방법은, a) 음성녹음 플러그-인(Plug-In)이 기동된 후, 상기 무선 단말로부터 음성녹음을 위한 버튼이 입력되면, 이동전화번호(MDN)와, 음성녹음 파라미터를 상기 멀티모달 플랫폼으로 전송하는 단계; b) 상기 무선 단말로 입력되는 사용자 음성을 EVRC 스트리밍으로 상기 멀티모달 플랫폼에 전달하는 단계; c) 상기 멀티모달 플랫폼이 기 지정된 오디오 타입(Audio Type) 방식으로 상기 사용자 음성의 형태를 변환하고, 이를 상기 이동전화번호(MDN)와 매핑시켜 저장하는 단계; 및 d) 상기 무선 단말로부터 녹음 메시지 올리기 신호를 인지한 후, 상기 이동전화번호(MDN)를 키(KEY)로 하여 상기 멀티모달 플랫폼으로 저장된 사용자 음성정보를 상기 컨텐츠 서버로 이송하는 단계로 이루어진 것을 특징으로 한다.On the other hand, the voice recording method of the browser-based multi-modal function for the wireless terminal according to the fourth aspect of the present invention for achieving the above object, a) after the voice recording plug-in (Plug-In) is activated, the wireless If a button for voice recording is input from the terminal, transmitting a mobile telephone number (MDN) and voice recording parameters to the multi-modal platform; b) delivering user voice input to the wireless terminal to the multimodal platform via EVRC streaming; c) converting, by the multi-modal platform, the type of the user's voice into a predetermined audio type, and mapping and storing it in the mobile phone number (MDN); And d) recognizing a recording message upload signal from the wireless terminal, and transferring user voice information stored in the multi-modal platform to the content server using the mobile telephone number (MDN) as a key. It features.

또는 본 발명의 동일 관점에 따른 무선 단말을 위한 브라우저 기반의 멀티모달 기능 중 음성인식 방법은, a) 음성인식 플러그-인(Plug-In)이 기동되어 상기 무선 단말이 상기 멀티모달 플랫폼과 통신을 개설한 후, 이동전화번호(MDN)를 포함하여 음성인식 파라미터를 상기 멀티모달 플랫폼으로 전달하는 단계; b) 상기 멀티모달 플랫폼이 자동으로 Server-side EPD(End Point Detection)를 사용하여 음성의 끝점을 검출하여 상기 컨텐츠 서버로 전송하는 단계; c) 상기 멀티모달 플랫폼이 EVRC Format으로 전송된 사용자의 음성 파일을 PCM Format으로 변경한 후, 음성인식 처리를 위한 음성인식 서버(ASR)로 상기 음성인식 파라미터와 기 녹음된 음성 파일을 전달하는 단계; d) 상기 음성인식 서버(ASR)이 음성 인식 결과를 상기 멀티모달 플랫폼으로 제공하고, 상기 멀티모달 플랫폼이 상기 이동전화번호(MDN)과 인식 결과에 대한 임시 매핑 항목을 만들어 저장하는 단계; e) 상기 멀티모달 플랫폼이 상기 무선 단말의 단말 브라우저상으로 음성인식 결과를 텍스트로 제공하는 단계; 및 f) 상기 음성인식 결과에 대한 사용자 승인에 기초하여, 상기 컨텐츠 서버가는 상기 멀티모달 플랫폼에 이동전화번호(MDN)를 Key로 설정하고, 상기 음성인식 결과에 대한 최근 검색 결과를 조회하는 단계로 이루어진 것을 특징으로 한다.Or voice recognition method of the browser-based multi-modal function for a wireless terminal according to the same aspect of the present invention, a) Voice recognition plug-in (Plug-In) is activated to enable the wireless terminal to communicate with the multi-modal platform After opening, delivering a voice recognition parameter to the multi-modal platform including a mobile telephone number (MDN); b) the multi-modal platform automatically detects the end point of the voice using Server-side End Point Detection (EPD) and transmits it to the content server; c) converting the voice file of the user transmitted in the EVRC format to the PCM format by the multi-modal platform, and delivering the voice recognition parameter and the pre-recorded voice file to a voice recognition server (ASR) for voice recognition processing; ; d) providing, by the voice recognition server (ASR), a voice recognition result to the multimodal platform, and the multimodal platform creating and storing a temporary mapping item for the mobile phone number (MDN) and the recognition result; e) providing, by the multimodal platform, a voice recognition result as text on a terminal browser of the wireless terminal; And f) setting, by the content server, a mobile telephone number (MDN) as a key in the multi-modal platform based on the user's approval of the voice recognition result, and querying a recent search result for the voice recognition result. Characterized in that made.

또는 본 발명의 동일 관점에 따른 무선 단말을 위한 브라우저 기반의 멀티모달 기능 중 음성출력 방법은, a) 음성출력 플러그-인(Plug-In)이 기동된 후, 상기 무선 단말이 상기 멀티모달 플랫폼과 통신을 개설하고, 이동전화번호(MDN)와 음성출력 파라미터를 상기 멀티모달 플랫폼으로 전달하는 단계; b) 상기 멀티모달 플랫폼이 상기 음성출력 파라미터에 포함된 URL 정보로부터 상기 컨텐츠 서버로 저장된 텍스트 파일이나 음성 파일을 읽어 오는 단계; c) 상기 URL 정보가 텍스트 파일인 경우 상기 멀티모달 플랫폼은 해당 텍스트를 음성으로 변환시키기 위한 TTS 서버로 전달하는 단계; d) 상기 TTS 서버에서 변환된 음성을 상기 멀티모달 플랫폼으로 제공하며, 상기 멀티모달 플랫폼은 상기 무선 단말에서 재생할 수 있는 형태인 EVRC Format으로 변환하는 단계; 및 e) 상기 멀티모달 플랫폼이 상기 EVRC로 변환된 음성을 무선 단말의 왑 브라우저상의 플러그-인으로 스트리밍 전송하는 단계로 이루어진 것을 특징으로 한다.Alternatively, a voice output method of a browser-based multimodal function for a wireless terminal according to the same aspect of the present invention includes a) after a voice output plug-in is activated, the wireless terminal is connected to the multi-modal platform. Establishing a communication and delivering a mobile telephone number (MDN) and voice output parameters to the multimodal platform; b) the multimodal platform reading a text file or a voice file stored in the content server from the URL information included in the voice output parameter; c) when the URL information is a text file, transmitting the text to a TTS server for converting the text into voice; d) providing the voice converted by the TTS server to the multi-modal platform, wherein the multi-modal platform is converted into an EVRC format that can be reproduced by the wireless terminal; And e) the multimodal platform streaming the voice converted into the EVRC to a plug-in on a WP browser of a wireless terminal.

이하, 본 발명의 바람직한 실시 예를 첨부된 예시도면에 의거 상세히 설명하면 다음과 같다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

먼저, 본 발명은 플러그-인 방식을 통해 단말 브라우저를 제공받는다. 즉, 단말 출시 시점에 멀티모달 기능이 적용되지 않더라도 출시 이후에 설치 가능하고, 사용자가 용이하게 플러그-인을 설치할 수 있다. 이는 단말 애플리케이션과 달리 플러그-인은 액티브 엑스(Active X) 컨트롤과 유사하게 동작하므로, 사용자가 별도의 추가적인 작업을 하지 않아도 설치가 가능하다. 따라서, 본 발명은 이러한 플러그-인 방식을 통해 단말 브라우저상에서 멀티모달을 제공한다.First, the present invention is provided with a terminal browser through a plug-in method. That is, even if the multi-modal function is not applied at the time of terminal release, it can be installed after the release, and the user can easily install the plug-in. Unlike the terminal application, the plug-in works similarly to the Active X control, so the user can install it without any additional work. Accordingly, the present invention provides multimodal on the terminal browser through this plug-in method.

본 발명에서 제공되는 멀티모달 플러그-인은 음성녹음 플러그-인, 음성인식 플로그 인, 음성출력 플러그-인의 3가지 형태로 제공된다. 각각의 플러그-인은 단말로부터 해당 항목이 요청될 경우, 멀티모달 플랫폼에 의해 해당 플러그-인이 가동된다.The multi-modal plug-in provided in the present invention is provided in three forms: a voice recording plug-in, a voice recognition plug-in, and a voice output plug-in. Each plug-in is operated by the multi-modal platform when the corresponding item is requested from the terminal.

도 2는 본 발명의 실시 예로 나타낸 단말 브라우저 기반의 멀티모달 기능을 제공하기 위한 구성도이다. 도시된 바와 같이, 왑 브라우저(WAP Browser)상에서 멀티모달 플러그-인(Multimodal Plug-In)을 요청하고, 멀티모달 플러그-인을 토대로 무선 인터넷상으로 음성 또는 자판 정보를 전송하여 해당 컨텐츠를 수신하는 무선 단말(201), 상기 멀티모달 플러그-인 요청을 인지한 후, 상기 음성 및 자판 정보에 대한 데이터 변환을 수행하고, 데이터 변환에 기반하여 정보 인식 결과를 생성하며, 상기 정보 인식결과에 대응하는 상기 해당 컨텐츠의 비쥬얼 데이터를 음성 및 자판 정보로 가공 처리하여 상기 무선 단말(201)로 전송하는 멀티모달 플랫폼(203)과, 상기 정보 인식결과에 대한 컨텐츠를 추출하여 상기 멀티모달 플랫폼(203)으로 제공하기 위한 컨텐츠 서버(205)로 구성된다.2 is a block diagram for providing a terminal browser-based multi-modal function shown as an embodiment of the present invention. As shown in the figure, a multimodal plug-in is requested on a WAP browser, and voice or keyboard information is transmitted over the wireless Internet based on the multimodal plug-in to receive the corresponding content. After recognizing the multi-modal plug-in request, the wireless terminal 201 performs data conversion on the voice and keyboard information, generates an information recognition result based on the data conversion, and corresponds to the information recognition result. The multi-modal platform 203 processes the visual data of the corresponding content into voice and keyboard information and transmits it to the wireless terminal 201, and extracts the content of the information recognition result to the multi-modal platform 203. It consists of a content server 205 for providing.

상기 멀티모달 플러그-인은 음성녹음 플러그-인, 음성인식 플러그-인, 음성출력 플러그-인으로 구성되며, 왑 브라우저상에서 어느 하나의 플러그-인이 요청된다. 상기 멀티모달 플랫폼(203)은 상기 음성녹음 플러그-인 기동을 위해, 무선 단말(201)로부터 입력되는 음성에 대한 EVRC 포맷을 WAV 포맷으로 변경하여 임시 저장하며, 이를 상기 컨텐츠 서버(205)로 전송한다.The multi-modal plug-in is composed of a voice recording plug-in, a voice recognition plug-in, and a voice output plug-in, and any one plug-in is requested on a swap browser. The multi-modal platform 203 temporarily changes the EVRC format for the voice input from the wireless terminal 201 to WAV format and temporarily stores the voice recording plug-in for activation. do.

또한 상기 음성인식 플러그-인은 무선 단말(201)로부터 요청된 음성인식 명령에 응답하여, 상기 멀티모달 플랫폼(203)이 정해진 음성인식 알고리즘에 따라 음성인식 처리를 수행하고 그 결과를 상기 컨텐츠 서버(205)로 제공한다. 또한 상기 음성출력 플러그-인 기능은 상기 멀티모달 플랫폼(203)이 상기 무선 단말(201)로부터 녹음되거나 또는 컨텐츠의 일부로 기 녹음된 정보들을 재생할 수 있으며 또한, 사용자 선택에 따라 TTS 서버를 통해 텍스트 정보를 음성정보로 변환하여 상기 무 선 단말(201)로 제공한다.In addition, the voice recognition plug-in, in response to the voice recognition command requested from the wireless terminal 201, the multi-modal platform 203 performs a voice recognition process according to a predetermined voice recognition algorithm, and the result is the content server ( 205). In addition, the voice output plug-in function may reproduce the information recorded by the multi-modal platform 203 from the wireless terminal 201 or as part of the content, and also through the TTS server according to user selection. Is converted into voice information and provided to the wireless terminal 201.

이하, 본 발명의 동작을 설명한다. 우선, 상기 무선 단말(201)이 플러그-인을 사용하기 위해, 왑 브라우저(WAP Browser) 상에서 오브젝트(OBJECT) 태그를 상기 멀티모달 플랫폼(203)으로 전송한다. 오브젝트 태그는 플러그-인의 설치 상태를 확인하거나, 버전에 따른 업데이트를 수행하기 위한 일련의 제어명령이다. 오프젝트 태그는 상기 무선 단말(201)에서 멀티모달 플랫폼(203)으로 메시지 전송 시 WML 문서 내로 오프젝트 태그를 포함시키는 것으로, The operation of the present invention will be described below. First, in order to use the plug-in, the wireless terminal 201 transmits an object tag to the multimodal platform 203 on a WAP browser. The object tag is a series of control commands for checking the installation status of a plug-in or performing an update according to a version. The object tag is to include the object tag into the WML document when transmitting a message from the wireless terminal 201 to the multi-modal platform 203,

와 같은 형식으로 구현될 수 있다.It can be implemented in the form

상기한 오프젝트 태그는 무선 단말(201)의 WAP 브라우저에서 해당 문서를 호출하면서 플러그-인이 설치되어 있는지 확인하고, 설치되어 있지 않거나 낮은 버전으로 설치되어 있을 경우 사용자 확인 후 바로 설치하게 된다. 플러그-인이 설치되 고 나면 플러그-인은 WAP 브라우저의 화면 일부를 차지하면서 동작하게 된다. 따라서 사용자는 무선 단말(201)을 통해 멀티모달 기능을 수행하며, 멀티모달은 전술한 바와 같이 음성에 대한 녹음, 인식, 출력 기능을 가지며, 이를 위해 사용자는 음성녹음 플러그-인, 음성인식 플러그-인, 음성출력 플러그-인을 각각으로 기동시킨다.The object tag checks whether the plug-in is installed while calling the corresponding document in the WAP browser of the wireless terminal 201, and if it is not installed or is installed in a lower version, the object tag is installed immediately after user confirmation. Once the plug-in is installed, the plug-in takes up part of the screen of the WAP browser. Therefore, the user performs a multi-modal function through the wireless terminal 201, and the multi-modal has a function of recording, recognizing, and outputting voice as described above. Start the audio and audio output plug-ins respectively.

도 3은 본 발명의 일실시 예에 따른 멀티모달 플러그-인 중 음성녹음 플러그-인 기능을 나타내는 무선 단말(201)의 왑 브라우저이다. 도시된 바와 같이, 음성 녹음 플러그-인은 브라우저상에서 음성을 녹음하여 멀티모달 플랫폼으로 전달하는 기능을 한다. 멀티모달 플러그-인은 "녹음" 버튼(301)과 "듣기" 버튼(303), 그리고 상태 표시 필드(305)로 구성된다.3 is a swap browser of a wireless terminal 201 showing a voice recording plug-in function among multi-modal plug-ins according to an exemplary embodiment of the present invention. As shown, the voice recording plug-in functions to record voices on a browser and deliver them to a multimodal platform. The multimodal plug-in consists of a "record" button 301, a "listen" button 303, and a status display field 305.

사용자가 "녹음" 버튼(301)을 누르면 상태표시 필드는 시간의 경과를 표시하면서 녹음이 시작된다. 이때 "녹음" 버튼(301)은 "중지" 버튼(307)으로 모양이 바뀐다. 녹음이 완료되면 "중지" 버튼(307)을 눌러 녹음을 중지한다. 그리고 "듣기" 버튼(303)을 누르면 녹음된 음성을 재생하여 들을 수 있다. 또한 무선 단말(201)의 WAP 화면상으로 제공되는 '녹음 메시지 올리기' 메시지에 따라 해당 키 버튼을 동작시키면 녹음된 음성을 상기 컨텐츠 서버(205)로 전달할 수 있다.When the user presses the "record" button 301, the status display field starts recording while indicating the passage of time. At this time, the "record" button 301 is changed to a "stop" button 307. When recording is completed, press the "stop" button 307 to stop recording. And press the "listen" button 303 can play by listening to the recorded voice. In addition, when the corresponding key button is operated according to the 'recording message upload' message provided on the WAP screen of the wireless terminal 201, the recorded voice may be transmitted to the content server 205.

한편, 상기 멀티모달 플랫폼(203)은 무선 단말(201)로부터 제공되는 음성 메시지를 녹음하게 되며, 다수의 오디오 타입(Audio Type)을 통해 상기 컨텐츠 서버(205)에서 요구하는 타입으로 변환한다. 컨텐츠 서버(205)는 서버에 따라 다수 종류의 메시지 포맷을 요구할 수 있으며, 멀티모달 플랫폼(203)은 'evrc : Enhanced Variable Rate Codec Format', 'alaw : A-Law Format', 'mulaw : Mu-Law Format', 'pcm : Intel PCM Format', 'alaw-wav : A-Law Wav Format', 'mulaw-wav : Mu-Law Wav Format', 'pcm-wav : PCM Wav Format' 중 어느 하나의 포맷을 지원한다. 멀티모달 플랫폼(203)은 이와 같은 포맷을 설정하기 위해, 음성녹음 플러그-인 파라미터를 설정한다.On the other hand, the multi-modal platform 203 is to record the voice message provided from the wireless terminal 201, and converts to the type required by the content server 205 through a plurality of audio types (Audio Type). The content server 205 may request a plurality of message formats according to the server, and the multimodal platform 203 may use 'evrc: Enhanced Variable Rate Codec Format', 'alaw: A-Law Format', and 'mulaw: Mu- Law Format ',' pcm: Intel PCM Format ',' alaw-wav: A-Law Wav Format ',' mulaw-wav: Mu-Law Wav Format ',' pcm-wav: PCM Wav Format ' Support. The multimodal platform 203 sets a voice recording plug-in parameter to set this format.

음성녹음 플러그-인 파라미터는 음성녹음 입력의 최대 길이를 초 단위로 설정하기 위한 'MaxLength' 파라미터와, 무선 단말(201)로 안내 메시지 예컨대, "메시지를 녹음하세요"와 같은 메시지 출력을 설정하기 위한 'UseIntro' 파라미터를 포함한다. 상기 'MaxLength' 파라미터의 디폴트(Default)는 60초이고, 상기 'UseIntro' 파라미터의 디폴트(Default)는 메시지 출력을 지시하기 위한 '1'로 설정된다. 이와 같은 파라미터는 상기 무선 단말(201)의 왑 브라우저상에서 사용자로부터 지시되며, 지시된 파라미터는 멀티모달 플랫폼(203)으로 제공된다. 도 4는 이와 같은 음성녹음 절차를 설명하는 구성도이다.The voice recording plug-in parameter includes a 'MaxLength' parameter for setting the maximum length of the voice recording input in seconds and a message output such as "record a message" such as a prompt message to the wireless terminal 201. Contains the 'UseIntro' parameter. The default value of the 'MaxLength' parameter is 60 seconds, and the default value of the 'UseIntro' parameter is set to '1' to indicate a message output. Such parameters are indicated by the user on the swap browser of the wireless terminal 201, and the indicated parameters are provided to the multimodal platform 203. Figure 4 is a block diagram illustrating such a voice recording procedure.

먼저, S401 단계에서 사용자가 '녹음' 버튼(301)을 누르면 이동전화번호(MDN)와, 전술한 파라미터 즉 Audio Type, MaxLength 파라미터를 상기 멀티모달 플랫폼(203)으로 전달한다. 상기 무선 단말(201)은 사용자의 음성을 EVRC 스트리밍으로 멀티모달 플랫폼에 전달한다. 상기한 Audio Type 파라미터는 컨텐츠에서 요구하는 형식에 준하며, 상기 MaxLength 파라미터는 저장되는 사용자의 음성 용량에 따라 설정된다.First, when the user presses the 'record' button 301 in step S401 transfers the mobile phone number (MDN) and the above-mentioned parameters, that is, the Audio Type, MaxLength parameters to the multi-modal platform 203. The wireless terminal 201 delivers the user's voice to the multimodal platform by EVRC streaming. The Audio Type parameter is based on the format required by the content, and the MaxLength parameter is set according to the voice capacity of the user to be stored.

S403 단계에서 상기 멀티모달 플랫폼(203)은 Audio Type에서 지정된 방식으로 음성 형태를 변환하고, 이를 이동전화번호(MDN)와 매핑시켜 임시로 저장한다. 그리고, S405 단계로 진입하여 사용자는 왑 브라우저(WAP Browser) 상으로 제공되는 '녹음 메시지 올리기' 버튼을 선택한다. 이는 다음 페이지를 요청하는 것으로 상기 컨텐츠 서버(205)는 페이지 요청을 수신한다. 이때 음성 파일 자체는 컨텐츠 서버(205)가 알지 못한다.In step S403, the multi-modal platform 203 converts the voice type in a manner specified by the Audio Type, and maps it to a mobile phone number (MDN) to temporarily store it. In operation S405, the user selects a 'record message upload' button provided on a WAP browser. This requests the next page, and the content server 205 receives the page request. The voice file itself is not known to the content server 205 at this time.

S407 단계로 진입하여, 상기 컨텐츠 서버(205)는 녹음된 음성을 가져오기 위해 상기 이동전화번호(MDN)를 상기 멀티모달 플랫폼(203)으로 전달하고 기 저장된 음성을 요청한다. 상기 컨텐츠 서버(205)는 이를 전송 받아 로컬에 저장하고 활용한다. 그리고, S409 단계에서 상기 컨텐츠 서버(205)는 다음 페이지로 이동한다.In step S407, the content server 205 transfers the mobile telephone number (MDN) to the multi-modal platform 203 and requests a pre-stored voice to obtain a recorded voice. The content server 205 receives it, stores it locally and utilizes it. In operation S409, the content server 205 moves to the next page.

도 5는 음성인식 플러그-인 기능을 설명하기 위한 무선 단말(201)의 왑 브라우져이다.5 is a swap browser of the wireless terminal 201 for explaining the voice recognition plug-in function.

음성 인식 플러그-인은 "녹음" 버튼과 입력 박스로 형성되는 Edit 필드로 구성된다. 사용자가 "녹음" 버튼을 누르면 "검색어를 말씀하세요"와 같은 안내 메시지가 출력되며, 이후 음성 입력 모드로 전환된다. 예컨대, 사용자가 '이효리'를 검색하기 위해 해당 검색어를 발성할 경우, 그 음성이 녹음되어 스트리밍으로 멀티모달 플랫폼(203)으로 전달된다. 상기 멀티모달 플랫폼(203)으로 전달된 음성은 음성 인식기를 거쳐 문자로 변환되어 멀티모달 플러그-인으로 전달된다. 이때 플러그-인의 Edit 필드는 이를 화면에 표시한다.The speech recognition plug-in consists of an Edit field consisting of an "record" button and an input box. When the user presses the "record" button, a prompt message such as "speak search word" is output, and then the voice input mode is switched. For example, when a user utters a corresponding search word to search for 'Hyo Lee', the voice is recorded and transmitted to the multimodal platform 203 by streaming. The voice transmitted to the multimodal platform 203 is converted into text through a speech recognizer and transmitted to a multimodal plug-in. The Edit field of the plug-in displays it on the screen.

한편 사용자는 Edit 필드 포커스를 옮겨서 키 패드로 입력할 수도 있다. 또한 음성 인식 결과가 모호하므로 여러 개의 후보가 선택되는 경우가 있는데, 이때는 Edit 필드가 콤보(Combo) 박스로 변경되면서 음성 인식 결과에 따른 복수개의 후보를 표시한다. 이후, 사용자는 무선 단말(201)의 WAP 브라우저로 구비된 "GO" 버튼을 눌러서 검색 결과를 조회할 수 있다.The user can also move the Edit field focus to enter the keypad. In addition, since the speech recognition result is ambiguous, several candidates may be selected. In this case, the Edit field is changed to a combo box to display a plurality of candidates according to the speech recognition result. Thereafter, the user can search the search results by pressing the "GO" button provided by the WAP browser of the wireless terminal 201.

이와 같이 음성인식은 멀티모달 플랫폼(203)에서 음성인식 알고리즘에 의해 음성인식이 이루어지며, 무선 단말(201)은 음성인식을 위한 파라미터를 멀티모달 플랫폼(203)으로 제공한다. 음성인식을 위한 파라미터는 'GrammarURI'이며, 음성인식의 Grammar를 지정한다. Grammar는 HTTP URL일 수도 있고, 정적(Static) Grammar를 가리킬 수도 있다.In this way, the voice recognition is performed by the voice recognition algorithm in the multi-modal platform 203, the wireless terminal 201 provides a parameter for the voice recognition to the multi-modal platform 203. The parameter for speech recognition is 'GrammarURI', which specifies Grammar for speech recognition. Grammar can be an HTTP URL or can point to a static Grammar.

HTTP URL의 경우에는 JGSF, ABNF GRXML 형식의 Grammar를 지원하며, 컨텐츠 서버(205)가 이 Grammar 파일을 동적으로 제공해야 한다.(Dynamic Grammar 사용) 예컨대, 'http://cp.nate.com/music_gr.jgsf'와 같은 형식이 될 수 있다. 정적 Grammar의 경우에는 멀티모달 서버에 있는 음성 인식기에 이미 Grammar가 등록되어 있는 경우이다. 이 방식은 효율이 높기 때문에 보통 대어휘 Grammar에서 이 방식을 사용한다. 정적 Grammar는 'static::grammar1' 과 같은 형식이다.In case of HTTP URL, Grammar in JGSF, ABNF GRXML format is supported, and content server 205 must provide this Grammar file dynamically (using Dynamic Grammar). For example, 'http://cp.nate.com/ music_gr.jgsf '. In the case of static Grammar, Grammar is already registered in the speech recognizer in the multimodal server. Because this method is highly efficient, it is usually used in large vocabulary Grammar. Static Grammar is of the form 'static :: grammar1'.

이외에, 음성인식 플러그-인을 위한 파라미터는 N-best를 사용할 것인지를 결정하기 위한 'UseNbest' 파라미터, 안내 메시지출력을 위한 'UseIntro' 파라미터, 음성인식 입력의 최대 길이를 초 단위로 나타내는 'MaxLength' 파라미터가 사용된다. 상기 'UseNbest' 파라미터의 디폴트(Default)는 N-best를 사용하지 않음을 나타내는 '0' 값이고, 상기 'UseIntro' 파라미터의 디폴트(Default)는 안내 메시지 출력을 위한 '1' 값이며, 상기 'MaxLength' 파라미터의 디폴트(Default)는 10초로 예시할 수 있다.In addition, the parameters for the voice recognition plug-in are 'UseNbest' parameter for determining whether to use N-best, 'UseIntro' parameter for prompt message output, and 'MaxLength' indicating the maximum length of voice recognition input in seconds. The parameter is used. The default value of the 'UseNbest' parameter is '0' indicating not using N-best, and the default value of the 'UseIntro' parameter is '1' for outputting a guide message. Default of the MaxLength 'parameter may be exemplified by 10 seconds.

도 6은 본 발명의 실시 예로 나타낸 음성인식 플러그-인의 동작을 설명하기 위한 구성도이다. 전술되지 않은 ASR 서버는 음성인식 서버(601)이며, 상기 멀티모달 플랫폼(203)과 연동하여, 사용자의 음성 인식을 처리한다.6 is a configuration diagram for explaining the operation of the voice recognition plug-in shown in the embodiment of the present invention. The ASR server, which has not been described above, is a voice recognition server 601, and works in conjunction with the multi-modal platform 203 to process voice recognition of a user.

먼저 S601 단계에서 음성인식 플러그-인은 상기 멀티모달 플랫폼(203)과 통신을 개설한다. 상기 무선 단말(201)은 이동전화번호(MDN)를 포함하여 음성인식을 위한 파라미터 즉, GrammarURI, UseNbest, MaxLength 등의 파라미터를 상기 멀티모달 플랫폼(203)으로 전달한다. 이는 스트리밍으로 '이효리'라는 사용자 발성을 전달하는 것으로, 상기 멀티모달 플랫폼(203)은 자동으로 Server-side EPD(End Point Detection)를 사용하여 음성의 끝점을 검출하여 상기 컨텐츠 서버(205)로 전송한다.First, in step S601, the voice recognition plug-in establishes communication with the multimodal platform 203. The wireless terminal 201 transmits parameters for voice recognition, ie, GrammarURI, UseNbest, MaxLength, etc., to the multimodal platform 203 including a mobile phone number (MDN). This is to deliver a user voice called 'Hyo Lee' by streaming, and the multi-modal platform 203 automatically detects an end point of a voice using a server-side EPD (End Point Detection) and transmits it to the content server 205. do.

S603 단계에서 상기 멀티모달 플랫폼(203)은 EVRC Format으로 전송된 사용자의 음성 파일을 PCM Format으로 변경한 후, 상기 음성인식 서버(601)로 GrammarURI 파라미터와 기 녹음된 음성 파일을 전달한다. 상기 음성인식 서버(601)는 인식 결과를 상기 멀티모달 플랫폼(203)으로 제공한다. 그리고, S605 단계로 진입하여 상기 멀티모달 플랫폼(203)은 이동전화번호(MDN)과 인식 결과에 대한 임시 매핑 항목을 만들어 저장한다.In step S603, the multi-modal platform 203 changes the voice file of the user transmitted in the EVRC format to the PCM format, and delivers the GrammarURI parameter and the pre-recorded voice file to the voice recognition server 601. The speech recognition server 601 provides the recognition result to the multimodal platform 203. In step S605, the multi-modal platform 203 creates and stores a temporary mapping item for the MDN and the recognition result.

상기 멀티모달 플랫폼(203)은 S607 단계에서, 상기 단말 브라우저상의 플러그-인을 통해 음성인식에 대한 결과 즉, '이효리'를 텍스트로 전달하고, 플러그-인은 이를 Edit 필드(입력 박스)에 표시한다. S609 단계에서 사용자가 텍스트화된 검색결과를 토대로 'GO'버튼을 눌러서 검색을 요청한다. 검색 요청 신호는 상기 컨텐 츠 서버(205)로 제공된다. S609 단계에서 상기 음성인식 플러그-인은 ECMA Script를 통해 입력 결과를 전달할 수 없으므로 상기 컨텐츠 서버(205)는 검색 항목이 무엇인지 알 수 없는 상태이다. 따라서 상기 컨텐츠 서버(205)는 상기 멀티모달 플랫폼(203)에 이동전화번호(MDN)를 Key로 설정하여 최근 검색 결과를 조회한다. 그리고, S611 단계에서 상기 컨텐츠 서버(205)는 이 검색 결과로 다음 검색 결과 페이지를 생성하여 무선 단말(201)의 왑 브라우저로 전송한다.In step S607, the multi-modal platform 203 delivers the result of speech recognition, that is, 'Hyo Lee' as text through a plug-in on the terminal browser, and the plug-in displays it in an Edit field (input box). do. In step S609, the user requests a search by pressing a 'GO' button based on the textified search result. The search request signal is provided to the content server 205. In step S609, the voice recognition plug-in cannot transmit the input result through the ECMA Script, so the content server 205 may not know what the search item is. Accordingly, the content server 205 sets a mobile phone number (MDN) as a key in the multi-modal platform 203 to query a recent search result. In step S611, the content server 205 generates the next search result page based on the search result and transmits the next search result page to the swap browser of the wireless terminal 201.

도 7은 본 발명의 실시 예로 나타낸 음성출력 플러그-인 기능을 설명하는 무선 단말의 왑(WAP) 브라우저이다. 먼저, 음성출력은 녹음된 음성을 출력할 수 있으며, 또는 TTS 서버를 통해 텍스트를 음성으로 변환 출력할 수 있다. 무선 단말(201)의 왑 브라우저는 음성출력 플러그-인 상태를 디스플레이하기 위해 스피커 모양의 아이콘을 제공한다.7 is a WAP browser of a wireless terminal for explaining a voice output plug-in function according to an embodiment of the present invention. First, the voice output may output the recorded voice, or may convert the text into voice through the TTS server. The swap browser of the wireless terminal 201 provides a speaker-shaped icon to display the voice output plug-in status.

음성출력은 도시된 바와 같이, 텍스트로 구성된 뉴스 또는 전자책(e-Book) 등과 같은 텍스트 정보를 음성으로 변환출력한다. 이와 같은 음성변환 출력은 TTS 서버에서 수행하며, 이를 위해 무선 단말(201)은 컨텐츠에 따른 파라미터를 상기 멀티모달 플랫폼(203)으로 전송한다. 파라미터는 먼저, 플러그-인의 실행과 동시에 자동으로 음성을 출력할 것인지를 결정하기 위한 'AutoPlay' 파라미터와, 텍스트를 음성으로 변환하기 위한 TTS 서버를 사용할 것인지를 결정하는 'UseTTS' 파라미터, 상기 'UseTTS' 파라미터 설정에서 음성파일 사용 설정 즉, TTS 서버를 사용하지 않을 경우, evrc, alaw, mulaw, pcm, alaw-wav, mulaw-wav, pcm-wav 파일 중 어느 하나의 음성 파일을 선택하기 위한 'AudioType' 파라미터, 변환하기 위한 음성파일의 위치 정보를 나타내는 'AudioURI' 파라미터로 구성된다.As illustrated, the voice output converts text information such as news or e-books composed of text into voice. Such voice conversion output is performed by the TTS server, and for this purpose, the wireless terminal 201 transmits a parameter according to the content to the multimodal platform 203. The first parameter is a 'AutoPlay' parameter for determining whether to output audio automatically when the plug-in is executed, and a 'UseTTS' parameter for determining whether to use a TTS server for converting text to voice, and the 'UseTTS' 'Use audio file in parameter setting, that is,' AudioType 'to select one of evrc, alaw, mulaw, pcm, alaw-wav, mulaw-wav and pcm-wav files when not using TTS server. Parameter, and 'AudioURI' parameter indicating the location of the voice file to be converted.

상기 'AudioType' 파라미터는 상기 'UseTTS' 파라미터가 1인 경우에는 text/plain이거나 text/xml+ssml이어야 하며, 상기 text/plain은 일반 텍스트를, text/xml+ssml은 SSML(Speech Synthesis Markup Language)을 의미한다. 또한 상기 'AudioURI' 파라미터는 상기 'UseTTS' 파라미터가 TTS 서버를 사용하도록 설정된 '1'인 경우에는 일반 텍스트를 명시하거나 SSML 파일의 HTTP URL을 명시한다. 그리고 상기 'UseTTS' 파라미터가 TTS 서버를 사용하지 않음으로 설정되는 '0'인 경우에는 음성 파일의 HTTP URL을 명기한다.The 'AudioType' parameter should be text / plain or text / xml + ssml when the 'UseTTS' parameter is 1, the text / plain is plain text, and text / xml + ssml is Speech Synthesis Markup Language (SSML). Means. In addition, the 'AudioURI' parameter specifies plain text or an HTTP URL of an SSML file when the 'UseTTS' parameter is '1' configured to use a TTS server. When the 'UseTTS' parameter is '0' that is set to not use the TTS server, the HTTP URL of the voice file is specified.

도 8은 본 발명의 실시 예로 나타낸 음성출력 플러그-인 절차를 설명하기 위한 도면이다. 본 실시 예에서는 음성인식 서버인 TTS 서버(801)가 포함되며, 상기 멀티모달 플랫폼(203)과 연동한다.8 is a view for explaining a voice output plug-in procedure according to an embodiment of the present invention. In this embodiment, the TTS server 801, which is a voice recognition server, is included and interworked with the multi-modal platform 203.

도시된 바와 같이, S801 단계에서 음성출력 플러그-인은 상기 멀티모달 플랫폼(203)과 통신을 개설하고, 이동전화번호(MDN)와 AutoPlay, UseTTS, AudioType, AudioURI 등의 파라미터를 상기 멀티모달 플랫폼(203)으로 전달한다. S803 단계에서 상기 멀티모달 플랫폼(203)은 AudioURI 파라미터에 포함된 URL 정보로부터 상기 컨텐츠 서버(205)로 저장된 텍스트 파일이나 음성 파일을 읽어 온다. 그리고, S805 단계에서 상기 URL 정보가 텍스트 파일인 경우 상기 멀티모달 플랫폼(203)은 상기 TTS 서버(801)로 전달하여 음성으로 변환한다.As shown, in step S801, the voice output plug-in establishes communication with the multi-modal platform 203, and transmits a mobile phone number (MDN) and parameters such as AutoPlay, UseTTS, AudioType, and AudioURI to the multi-modal platform ( 203). In step S803, the multimodal platform 203 reads a text file or a voice file stored in the content server 205 from URL information included in an AudioURI parameter. In operation S805, when the URL information is a text file, the multi-modal platform 203 transfers the speech to the TTS server 801.

상기 TTS 서버(801)는 음성 변환된 정보를 상기 멀티모달 플랫폼(203)으로 제공하며, 상기 멀티모달 플랫폼(203)은 S807 단계를 통해 상기 무선 단말(201)에 서 재생할 수 있는 형태인 EVRC Format으로 변환한다. 그리고 상기 멀티모달 플랫폼(203)은 S809 단계에서 상기 EVRC로 변환된 음성을 무선 단말(201)의 왑 브라우저상의 플러그-인으로 스트리밍 전송한다. 사용자는 무선 단말(201)을 통해 텍스트 정보를 EVRC로 변환된 음성으로 청취한다.The TTS server 801 provides the voice-converted information to the multimodal platform 203, and the multimodal platform 203 can play back in the wireless terminal 201 through step S807. Convert to In operation S809, the multimodal platform 203 streams the voice converted into the EVRC to a plug-in on a swap browser of the wireless terminal 201. The user listens to text information converted to EVRC through the wireless terminal 201.

이상 설명된 바와 같이, 본 발명에 따른 멀티모달을 위한 브라우저 기반의 무선 단말과, 무선 단말을 위한 브라우저 기반의 멀티모달 서버 및 시스템과 이의 운용 방법은, 단말 출시 시점에 해당 기능이 적용되지 않더라도 출시 이후에 언제라도 설치 가능하고, 사용자가 플러그-인을 설치하기가 용이함을 이용하여, 음성녹음 플러그-인 기능과, 음성인식 플러그-인 기능과, 음성출력 플러그-인 기능을 각각으로 기동하여 멀티모달을 실현함으로써, 무선 인터넷상에서 검색어에 대해서 음성인식 입력이 이루어져 사용자는 두 가지 입력 방식(키 입력 + 음성 입력)을 통해 모바일 기기의 조작이 편리한 효과를 포함하여 음성인식 검색 서비스를 제공받을 수 있는 효과가 있다.As described above, the browser-based wireless terminal for the multi-modal, the browser-based multi-modal server and system for the wireless terminal and the operation method thereof according to the present invention, even if the function is not applied at the time of terminal release It can be installed at any time later, and the user can easily install the plug-in, so that the voice recording plug-in function, the voice recognition plug-in function, and the voice output plug-in function can be started separately. By realizing the modal, the voice recognition input is performed on the search word on the wireless Internet, and the user can receive the voice recognition search service including the effect of easy operation of the mobile device through two input methods (key input + voice input). It works.

또한 본 발명에서는 WAP 브라우저상에서 음성을 녹음할 수 있고, 이를 수신자가 WAP 브라우저상에서 재생할 수 있는 멀티모달 메시징 서비스를 제공함으로써, 모바일 싸이월드와 같이 WAP 브라우저상에서 커뮤니티 서비스가 활성화되는 효과가 있다.In the present invention, by providing a multi-modal messaging service that can record the voice on the WAP browser, the receiver can play on the WAP browser, there is an effect that the community service is activated on the WAP browser, such as mobile cyworld.

또한 본 발명에서는 많은 양의 정보를 텍스트 정보와 함께 핵심 내용은 음성으로 읽어주는 방식으로 서비스를 제공함에 따라, LCD 크기가 제한된 무선 단말의 가독성을 증대시키는 효과가 있다.In addition, in the present invention, by providing a service by reading a large amount of information together with text information and the core content by voice, there is an effect of increasing the readability of the wireless terminal limited in LCD size.

이상에서 본 발명을 특정한 바람직한 실시 예에 대하여 도시하고 설명하였으나, 본 발명은 상기한 실시 예에 한정하지 아니하며, 특허청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 분야에서 통상의 지식을 가진 자라면 누구든지 다양한 변형이 가능할 것이다.While the invention has been shown and described with respect to specific preferred embodiments thereof, the invention is not limited to the embodiments described above, and is commonly used in the art to which the invention pertains without departing from the spirit of the invention as claimed in the claims. Anyone with knowledge will be able to make various variations.

Claims

A service server for providing a multi-modal function for voice and text processing on a terminal swap (WAP) browser basis,

After recognizing a multimodal plug-in connection state of a wireless terminal, performing data conversion for voice or keyboard information of a user, which is formed and received from the wireless terminal through a multimodal function based on the terminal swap (WAP) browser. And generating an information recognition result based on the data conversion, and receiving the corresponding content corresponding to the information recognition result, processing the visual data of the content into voice and keyboard information, and transmitting the multimodal to the wireless terminal. Browser-based multi-modal server for a wireless terminal, characterized in that it comprises a server.

The method of claim 1, wherein the multi-modal plug-in,

Browser-based multi-modal server for a wireless terminal, characterized in that any one or more of the plug-in for voice recording, the plug-in for voice recognition, the plug-in for voice output.

The method according to claim 1 or 2,

The multi-modal server in response to a voice recognition command requested by the wireless terminal for the voice recognition plug-in activation, and performs a voice recognition process according to a predetermined voice recognition algorithm. Multi modal server.

The method according to claim 1 or 2,

The multi-modal server is a browser-based multi-modal server for a wireless terminal, characterized in that for reproducing the recorded or textual information from the wireless terminal for the voice output plug-in activation.

The method of claim 4, wherein the multi-modal platform,

Browser-based multi-modal server for a wireless terminal, characterized in that for interworking with the TTS server for the reproduction output of the textified information.

A wireless terminal for providing a multi-modal function for voice and text processing on a terminal swap (WAP) browser basis,

Requesting a multimodal plug-in on the WAP browser and performing a multi-modal function based on the WAP browser in a connected state based on the multi-modal plug-in; And a plug-in operation mode for receiving a corresponding content corresponding to the voice or keyboard information after forming and transmitting the voice or keyboard information of the user through the mobile terminal.

The method of claim 6 wherein the multi-modal plug-in,

Browser-based wireless terminal for multi-modal, characterized in that any one of a plug-in for voice recording, a plug-in for voice recognition, a plug-in for voice output.

The method of claim 6 or 7, wherein the plug-in is

And a plug-in request using an object tag on a WAP browser of the wireless terminal.

The method of claim 8, wherein the object tag,

Browser-based wireless terminal for multi-modal, characterized in that included in the WML document when transmitting a message from the wireless terminal to the multi-modal platform.

A system for providing a multi-modal function for voice and text processing on a terminal swap (WAP) browser basis,

Requesting a Multimodal Plug-In on the WAP browser, and through a multimodal function based on the WAP browser in a connected state based on the multimodal plug-in. A wireless terminal for forming and transmitting voice or keyboard information of a user and receiving corresponding contents corresponding to the voice or keyboard information of the user;

After recognizing the multi-modal plug-in request, data conversion for the voice and keyboard information is performed, an information recognition result is generated based on the data conversion, and visual data of the corresponding content corresponding to the information recognition result. A multi-modal platform for processing a voice and keyboard information and transmitting the same to the wireless terminal; And

A browser-based multi-modal function providing system for a wireless terminal, characterized in that the content server for extracting the content for the information recognition result to provide to the multi-modal platform.

The method of claim 10 wherein the multi-modal plug-in,

System for providing a browser-based multi-modal function for a wireless terminal, characterized in that any one or more of a plug-in for voice recording, a plug-in for voice recognition, a plug-in for voice output.

The method of claim 10 or 11,

The multi-modal platform is a browser-based multi-modal for the wireless terminal, characterized in that for changing the EVRC format for the voice input from the wireless terminal to the WAV format for transmitting the voice recording plug-in to the content server Function providing system.

The method of claim 10 or 11,

The multi-modal platform, in response to a voice recognition command requested from the wireless terminal for starting the voice recognition plug-in, performs a voice recognition process according to a predetermined voice recognition algorithm and provides the result to the content server. Browser-based multi-modal function providing system for a wireless terminal.

The method of claim 10 or 11,

The multi-modal platform is a browser-based multi-modal function providing system for a wireless terminal, characterized in that for outputting the recorded or textual information from the wireless terminal for the voice output plug-in activation.

The method of claim 14, wherein the multi-modal platform,

A browser-based multi-modal function providing system for a wireless terminal, characterized in that the interworking with the TTS server for the reproduction output of the textified information.

The method of claim 10, wherein the plug-in,

The browser-based multi-modal function providing system for a wireless terminal, characterized in that performed by transmitting an object (OBJECT) tag on the multi-modal platform on a WAP browser of the wireless terminal.

The method of claim 16, wherein the object tag,

Browser-based multi-modal function providing system for a wireless terminal, characterized in that included in the WML document when transmitting a message from the wireless terminal to the multi-modal platform.

In the browser-based multi-modal function providing method for a wireless terminal,

a) transmitting a mobile telephone number (MDN) and a voice recording parameter to the multi-modal platform when a button for voice recording is input from the wireless terminal after the voice recording plug-in is activated. ;

b) delivering user voice input to the wireless terminal to the multimodal platform via EVRC streaming;

c) converting, by the multi-modal platform, the type of the user's voice into a predetermined audio type, and mapping and storing it in the mobile phone number (MDN); And

d) recognizing a signal for uploading a recorded message from the wireless terminal, and transferring user voice information stored in the multi-modal platform to the content server using the mobile telephone number (MDN) as a key. Browser-based multi-modal function providing method for a wireless terminal.

The method of claim 18, wherein the voice recording parameter,

Audio Type parameter and MaxLength parameter, the Audio Type parameter is based on the format required by the content, the MaxLength parameter provides a browser-based multi-modal function for the wireless terminal, characterized in that set according to the user's voice capacity to be stored Way.

The voice recording parameter of claim 19,

A method of providing a browser-based multimodal function for a wireless terminal, further comprising a 'UseIntro' parameter for setting a message output.

The method of claim 18, wherein the shape conversion of the user's voice,

'evrc: Enhanced Variable Rate Codec Format', 'alaw: A-Law Format', 'mulaw: Mu-Law Format', 'pcm: Intel PCM Format', 'alaw-wav: A-Law Wav Format', 'mulaw -wav: Mu-Law Wav Format ', or' pcm-wav: PCM Wav Format 'format conversion, browser-based for the wireless terminal, characterized in that the selective conversion in response to the message format request of the content server To provide multimodal functionality.

a) the voice recognition plug-in (Plug-In) is activated to establish a communication with the multi-modal platform, the wireless terminal, and transmits the voice recognition parameters including the mobile number (MDN) to the multi-modal platform step;

b) the multi-modal platform automatically detects the end point of the voice using Server-side End Point Detection (EPD) and transmits it to the content server;

c) converting the voice file of the user transmitted in the EVRC format to the PCM format by the multi-modal platform, and delivering the voice recognition parameter and the pre-recorded voice file to a voice recognition server (ASR) for voice recognition processing; ;

d) providing, by the voice recognition server (ASR), a voice recognition result to the multimodal platform, and the multimodal platform creating and storing a temporary mapping item for the mobile phone number (MDN) and the recognition result;

e) providing, by the multimodal platform, a voice recognition result in text on a terminal browser of the wireless terminal; And

f) based on a user's approval of the voice recognition result, the content server sets a mobile telephone number (MDN) as a key in the multi-modal platform, and inquires a recent search result for the voice recognition result. Method for providing a browser-based multi-modal function for a wireless terminal, characterized in that.

The method of claim 22,

g) generating, by the content server, a next search result page based on the search result, and transmitting the generated search page to a swap browser of the wireless terminal.

The method of claim 22, wherein the speech recognition parameter,

As a 'GrammarURI' parameter for selecting a voice recognition scheme, a voice recognition algorithm (Grammar) is specified, and the algorithm is an HTTP URL or a static algorithm (Grammar). How to provide browser-based multimodal functionality.

The method of claim 24, wherein the HTTP URL,

A method for providing a browser-based multimodal function for a wireless terminal, which supports Grammar in the JGSF and ABNF GRXML formats, and the content server dynamically provides the Grammar file.

The method of claim 24, wherein the static algorithm,

Registered to the speech recognizer mounted on the multi-modal platform, wherein the static algorithm is applied in a large vocabulary Grammar browser-based multi-modal function providing method for a wireless terminal.

The method of claim 24, wherein the speech recognition parameter,

The parameters for speech recognition plug-in are 'UseNbest' parameter for deciding whether to use N-best, 'UseIntro' parameter for prompt message output, and 'MaxLength' parameter representing the maximum length of speech recognition input in seconds. Browser-based multi-modal function providing method for a wireless terminal characterized in that it further comprises.

a) after the voice output plug-in is activated, the wireless terminal establishes communication with the multi-modal platform, and transmits a mobile telephone number (MDN) and a voice output parameter to the multi-modal platform; ;

b) the multimodal platform reading a text file or a voice file stored in the content server from the URL information included in the voice output parameter;

c) when the URL information is a text file, transmitting the text to a TTS server for converting the text into voice;

d) providing the voice converted by the TTS server to the multi-modal platform, wherein the multi-modal platform is converted into an EVRC format that can be reproduced by the wireless terminal; And

e) the multi-modal platform streaming the voice converted to the EVRC to the plug-in on the wireless browser of the wireless terminal, characterized in that the browser-based multi-modal function providing method for the wireless terminal.

The method of claim 28, wherein the audio output parameter,

In the 'UseTTS' parameter for determining whether to use the TTS server for converting text to speech, the 'AutoT' parameter for determining whether to output audio automatically upon execution of the plug-in, and the 'UseTTS' parameter setting Method for providing a browser-based multi-modal function for a wireless terminal, characterized in that the 'AudioType' parameter for setting the use of the voice file, 'AudioURI' parameter indicating the location information of the voice file to be converted.

The method of claim 29,

The audio file usage setting of the 'AudioType' parameter may select one of the evrc, alaw, mulaw, pcm, alaw-wav, mulaw-wav, and pcm-wav files when the TTS server is not used. Browser-based multi-modal function providing method for a wireless terminal characterized in that.

The method of claim 29,

When the 'UseTTS' parameter is set to use a TTS server, the 'AudioType' parameter is plain text or Speech Synthesis Markup Language (SSML), and the 'AudioURI' parameter indicates that the 'UseTTS' parameter indicates a TTS server. When it is set to use, it specifies the plain text or the HTTP URL of the SSML file, and when the 'UseTTS' parameter is set to not use the TTS server, it specifies the HTTP URL of the voice file. Browser-based multimodal function providing method for a wireless terminal.