KR20060009151A

KR20060009151A - Incantation speech recongnition wireless site navigation system & service

Info

Publication number: KR20060009151A
Application number: KR1020040056630A
Authority: KR
Inventors: 최중인
Original assignee: 최중인
Priority date: 2004-07-20
Filing date: 2004-07-20
Publication date: 2006-01-31

Abstract

It is very inconvenient to search for wireless Internet sites / content via mobile phones. This is because the wireless Internet / content address (url) must be entered in several steps using the keypad buttons in the wireless search box or through text-based wireless navigation. This paper proposes a voice navigation system and service through a voice recognition multi-modal platform that goes directly through voice commands. The key idea behind this service, however, is that the voice commands form a unique incantation that represents each site / content, acting as a brand for that wireless site / content and inducing fun, and unlike regular words, the syllable uniqueness of the order. Differentiation can also increase speech recognition rates. For example, if you want to access a wireless site / content, you can say "alohamora" (an order to open the doors of Harry Potter and the Sorcerer's Stone) on your mobile phone. .

To implement this, a distributed speech recognition multimodal platform such as the following representation should be utilized. In other words, when a mobile phone terminal application records a voice order (incantation) and sends the data to a voice recognition multimodal platform, the platform recognizes it and connects it to the corresponding wireless site.

Spelling, incantation, mobile phone applications, speech recognition, voice navigation, distributed speech recognition (DSR), automatic speech recognition (ASR), text-to-speech (TTS)

Description

Incantation speech recongnition wireless site navigation system & service

도1. 전체 시스템의 구성Figure 1. Configuration of the entire system

도2. DSR의 형태Figure 2. Type of DSR

도3. ASR서버Figure 3. ASR Server

도4. WA서버Figure 4. WA server

도5. DSR APIFigure 5. DSR API

본 발명은 분산형 음성인식 기술을 휴대폰에 적용해 사용자가 키패드 버튼을 눌러서 원하는 무선사이트/컨텐츠를 찾아가는 대신, 그 컨텐츠에 해당하는 음성 명령을 휴대폰에 대고 말함으로써 바로 찾아가는 기술 및 이를 활용한 서비스이다. 종래의 키패드 방식의 네비게이션은 여러 단계를 거치거나 직접 버튼을 눌러 입력해야하는 불편함을 가지고 있으며, 또한 음성 인식을 통한 종래의 네비게이션 기술은 단일 모드로서 음성채널을 사용하여야 하기 때문에 음성 기반의 컨텐츠에 국한 되어 사용해야 했고 데이터 채널 기반의 무선 사이트/컨텐츠 검색에는 부적합하였다. 이를 보완하는 방법으로 음성 엔진을 단말에 장착하거나 단말에서 녹음을 해서 음성 서버로 보내는 기술이 제안되었으나 음성인식 성능이 나빠 상용화되지 못하였다. 이와 같은 문제점을 해결하는 방법으로, 우선 음성과 데이터를 동시에 지원하는 멀티모달 기반에 분산형 음성인식 기술을 적용 플랫폼을 제안하고 이를 바탕으로 사용의 편이성을 증대하는 음성 네비게이션에 음성 인식 성능을 높이고 사용자의 재미를 증대시키는 서비스 모델을 제안하고자 하는 것이다.The present invention is a technology that utilizes distributed voice recognition technology to a mobile phone, instead of going to a wireless site / content desired by pressing a keypad button, the user directly speaks the voice command corresponding to the content to the mobile phone and uses the service. . Conventional keypad-based navigation has the inconvenience of inputting through several steps or pressing a direct button. Also, the conventional navigation technology using voice recognition is limited to speech-based content because a voice mode must be used as a single mode. It was not suitable for data site based wireless site / content search. In order to compensate for this, a technique of attaching a voice engine to a terminal or recording a terminal to a voice server has been proposed. As a solution to this problem, we first propose a platform to apply distributed speech recognition technology to a multi-modal base that supports voice and data at the same time. We would like to propose a service model that enhances the fun of the company.

먼저 데이터 기반의 무선사이트/컨텐츠를 네비게이션할 수 있도록 음성 입력/데이터 출력형태의 멀티모달 플랫폼 기반으로. 도1에서 볼 수 있듯이 휴대폰 단말단에 음성을 녹음 압축 전송하면 분산형 음성인식 엔진에서 이를 전처리하여 고성능 음성인식알고리즘이 탑재된 ASR 서버로 보내진다. 이러한 과정을 음성/데이터를 휴대폰 화면을 통하여 같이 제공할 수 있으며 또한 음성인식의 성능을 높힐 수 있다. 본 제안에서는 이러한 플랫폼을 바탕으로 새로운 형태의 음성네비게이션 서비스를 제안하고자 하는데, 음성 명령을 단순 범용 단어나 이름 (예: 뉴스, 증권, '야후' '피망 모바일 고스톱' 등)이 아닌 재미있는 주문(incantation)형 명령어를 사용하여 네비게이션하도록 하는 것이다. 예를 들어 '피망 모바일 고스톱' 서비스에 접속하기 위해서 "알로호모라"(해리포터와 마법사의 돌에 나오는 문을 열때 쓰는 주문)" 라고 말하면 바로 이 주문으로 등록된 피망 모바일 고스톱 서비스 사이트에 바로 접속되는 것이다. 이러한 주문은 각 서비스 제공자가 만들어 이 멀티 모달 음성인식 플랫폼에 Grammar에 등록하면, 이 음성명령이 주어질 경우, 그 명령어에 등록된 사이트의 URL로 바로 접속되는 것이다.First, based on a multimodal platform in the form of voice input / data output to navigate data-based wireless sites / content. As shown in FIG. 1, when a voice is recorded and transmitted to a mobile phone terminal, the distributed voice recognition engine is preprocessed and sent to an ASR server equipped with a high performance voice recognition algorithm. This process can provide voice / data through the mobile phone screen and can also improve the performance of voice recognition. This proposal proposes a new type of voice navigation service based on such a platform, where voice commands are not simply general words or names (e.g. news, securities, 'Yahoo' and 'green pepper mobile go-stop'). To navigate using) command. For example, if you say "Aloha Mora" (an order used to open the doors of Harry Potter and the Sorcerer's Stone) to access the Bell Pepper Mobile GoStop service, you will be immediately connected to the Bell Pepper Mobile GoStop service site registered with this order. These orders are created by each service provider and registered with Grammar on this multi-modal voice recognition platform, and given this voice command, they are directed to the URL of the site registered with that command.

본 발명의 전체 시스템의 구성은 다음과 같다. 도1 참조The configuration of the entire system of the present invention is as follows. See Figure 1

단말에는 DSR API가 있어서 Backend의 DSR 서버와 통신하는 구조이며 DSR 서버는 ASR서버, TTS서버, VXML interpreter 서버와 연동하여 Contents서버에서의 서비스를 지원하게 된다.The terminal has a DSR API to communicate with Backend's DSR server, and the DSR server supports services in the Contents server by interworking with ASR server, TTS server, and VXML interpreter server.

가. 시스템의 구성end. System configuration

1. DSR1. DSR

도2는 본 시스템의 DSR의 형태를 나타낸다. 이는 VoiceXML과 단말 어플리케이션의 연동을 통한 음성출력/화면출력을 동시에 수행한다는 것이다. 한편으로 데이터 채널을 이용한 단말에서의 음성입력/키 입력을 동시에 수행 가능하므로 DSR기반의 Multimodal UI를 지원한다고 할 수 있다. 아울러 단말과의 유기적인 연동이 가능하다. 즉, 음성엔진의 고성능화가 가능하고 음성통화용 코덱(EVRC)을 이용, 단말 H/W변경 없이 튜닝이 용이하게 된다.2 shows a form of the DSR of the present system. This means that voice output and screen output are simultaneously performed through interworking between VoiceXML and a terminal application. On the other hand, since it is possible to perform voice input / key input at the terminal using a data channel at the same time, it can be said that it supports DSR-based multimodal UI. In addition, organic interworking with the terminal is possible. That is, the performance of the voice engine can be improved, and tuning is easy without changing the terminal H / W by using the voice call codec (EVRC).

2. ARS서버2. ARS Server

ASR서버는 라이브러리 형태로 제공되는 음성인식엔진의 서비스를 구현하고 동작시킨다. TCP상의 서버형태로 구현되며 음성인식 서비스를 필요로 하는 응용프로그램은 클라이언트가 되어 ASR의 서비스를 이용한다.ASR server implements and operates the service of voice recognition engine provided in library form. Implemented in the form of a server over TCP, an application program that requires a voice recognition service becomes a client and uses the services of ASR.

ASR서버는 클라이언트에 서비스를 제공하기 위해 1개 이상의 TCP포트를 사용 하며, 복수개의 TCP소켓연결을 허용하여 복수 클라이언트가 동시에 서비스를 이용할 수 있게 한다.The ASR server uses more than one TCP port to provide services to clients and allows multiple clients to simultaneously use the service by allowing multiple TCP socket connections.

ASR서버는 가상의 음성인식 채널을 통해 음성인식 서비스를 제공한다. 채널은 TCP소켓과는 별도로 ASR서버의 서비스를 이용하는 하나의 단위이며, 가각 고유한 ID로 구별된다. ASR서버는 하나의 TCP소켓으로 복수개의 채널을 이용할 수 있도록 지원하며, 최대 사용가능 채널 개수는 ASR서버에 의해 제한된다The ASR server provides a voice recognition service through a virtual voice recognition channel. Channel is a unit that uses the service of ASR server separately from TCP socket and is distinguished by unique ID. ASR server supports using multiple channels with one TCP socket. The maximum number of available channels is limited by ASR server.

ASR서버는 요청받은 음성인식 작업을 끝내고 결과를 클라이언트에 보내기 전에 그 결과 데이타를 자신의 서버에 저장한다. 이 데이타는 음성인식엔진의 지속적인 튜닝에 사용된다.The ASR server saves the result data on its server before completing the requested voice recognition task and sending the result to the client. This data is used for continuous tuning of the voice recognition engine.

ASR서버는 콘솔창을 통해 실행하고, 실행된 프로그램은 정상적인 경우 종료되지 않고 계속 수행된다. 이때 실행된 콘솔창을 통해 현재의 채널 상태와 I/O, 음성인식 결과를 확인할 수 있다. (도3 참조)The ASR server runs through the console window, and the executed program does not terminate normally and continues to run. At this time, you can check the current channel status, I / O, and voice recognition results through the executed console window. (See Fig. 3)

3. DSR API3. DSR API

DSR API는 다음과 같은 기능을 수행한다.The DSR API performs the following functions.

(1) 음성 녹음 생성(1) create a voice recording

녹음요청 시그널을 받으면 WIPI의 녹음관련 API를 사용하여 마이크로부터 입력되는 음원을 녹음한다. 녹음된 음원은 단말기의 DSP를 이용하여 EVRC로 엔코딩하여 저장한다.When the recording request signal is received, WIPI's recording API is used to record the sound input from the microphone. The recorded sound source is encoded and stored in EVRC using the DSP of the terminal.

(2) 음성 시작점 끝점 검출(2) Voice start point endpoint detection

사용자는 발성을 시작할 때 한번 버튼을 누르며, 이후 버튼을 떼더라도 계속 녹음을 받는다. 사용자가 발성을 일정시간 이상 하지 않을 경우에 자동으로 녹음은 종료된다. 이렇게 캡쳐된 음원은 Zero Crossing Rate와 dB를 기준으로 하여 시작점과 끝점을 검출하게 된다.The user presses the button once at the beginning of the talk and continues recording even after the button is released. Recording stops automatically when the user does not talk for a certain time. The captured sound source detects the start point and the end point based on the Zero Crossing Rate and dB.

(3) 음성 Level 표시(3) Voice Level Display

음성 녹음이 진행될때는 단말기 화면의 특정위치에 녹음되는 음성의 볼륨을 나타내는 조그만 아이콘을 표시한다.When the voice recording is in progress, a small icon indicating the volume of the voice recorded at a specific position on the terminal screen is displayed.

주게 된다.Given.

나. 인터페이스I. interface

이러한 플랫폼을 기반으로하여 제공되는 주문형 음성네비게이션 서비스의 한 예로서 단말인터페이스는 다음과 같다.An example of an on-demand voice navigation service provided based on such a platform is as follows.

음성인식이 가능한 상태일때 입술모양의 아이콘을 표시한다. 사용자가 음성입력버튼 ("통화"버튼)을 누를 경우 녹음이 되고 있는 음성의 볼륨을 화면의 하단에 아이콘 형식으로 표현한다. 그러면 원하는 무선 사이트/컨텐츠에 해당되는 음성명령을 주문외우듯이 말하면 된다. 예를 들어 "알로호모라" 라는 주문을 말하고 음성녹음이 완료되고 이 음성을 DSR로 넘겨 인식을 하고 있는 동안에는 음성인식 중이라는 아이콘을 화면 하단에 배치함으로서 사용자가 음성인식의 진행상황을 파악할 수 있도록 한다.음성녹음은 별도의 인식hotkey가 있어 off상태에서 on상태가 되도록 눌리면 확인음과 함께 개시된다. 이때 음성녹음 잔여시간을 표시하는 progressive바가 약 2,3초 표시되고 끝점 검출도 진행된다.When voice recognition is enabled, a lip icon is displayed. When the user presses the voice input button ("call" button), the volume of the voice being recorded is expressed in the form of an icon at the bottom of the screen. You can then speak your voice command for the desired wireless site / content. For example, if you say "Alo Homora" and the voice recording is completed and the voice is transferred to the DSR, the voice recognition icon is placed at the bottom of the screen so that the user can check the progress of the voice recognition. Voice recording is started with confirmation sound when it is pressed from off state to on state because there is a separate recognition hotkey. At this time, the progressive bar indicating the remaining time of voice recording is displayed for about 2 or 3 seconds and the end point detection is also performed.

인식에 성공하면 해당 사이트/컨텐츠 제목 예를 들면 "피망 모바일 고스톱" 이 화면에 나타나 사용자로 하여금 인식결과를 feedback하도록 하고, 만약 인식결과가 틀릴 경우 접속전에 다시 시도 할 수있도록 한다. 그냥 놔두면 바로 해당 사이트/컨텐츠로 접속된다.If the recognition is successful, the relevant site / content title, for example, "Peep Mobile GoStop" appears on the screen, allowing the user to feed back the recognition result, and if the recognition result is incorrect, the user can try again before connecting. If you leave it alone, you will be directed to the site / content.

본 발명의 가장 큰 장점은 사용자가 음성 명령을 통하여 단한번에 무선사이트/컨텐츠를 찾아간다는 것이고, 음성인식 성능을 높혀 상용서비스로 활성화 하기 위하여 음성명령을 재미있는 주문,주술 형태로 만들어 사용한다는 것이다. 이는 마술, 주술등과 관련 오락성을 증대시켜 많은 사용자가 이를 외어서 주술하듯이 입력을 하는 것을 재미잇어 할 것이며, 또한 주문의 길이와 음절의 고유성/차별성으로 음성인식률을 매우 높힐 수 있을 것이다.이렇게 만들어져 널리 퍼진 주문은 해당 무선 사이트/컨텐츠를 브랜드화 할 수 있고 주요 광고 마케팅에 활용할 수 있다. 예를 들면 "피망 모바일 고스톱" 서비스를 출시하면서 광고문구를 "휴대폰에서 음성인식 통화버튼을 누르시고 '알로호모라' 라고 말씀해보세요. 그러면 새로운 서비스가 여러분 앞에 나올 것입니다." 라는 식이 될 것이다.The greatest advantage of the present invention is that the user visits the wireless site / content at a time through the voice command, and uses the voice command in an interesting order and spell form to increase the voice recognition performance and to activate it as a commercial service. This will increase magic, magic, and related entertainment, making it fun to enter a lot of users by spelling it out, and also increasing the speech recognition rate by the length of spells and the uniqueness / differentiation of syllables. Created and widespread orders can brand the wireless site / content and be used for major advertising marketing. For example, while launching the "Green Pepper Mobile GoStop" service, call the ad, "Press the voice recognition call button on your mobile phone and say 'Alo Homora" and the new service will be in front of you. " Would be

Claims

Wireless Internet Voice Navigation Technology Using Multi-modal Voice Recognition Platform

Wireless internet voice navigation service through incantation type voice command

Divisional voice command registration service for wireless Internet site / contents