KR100585711B1

KR100585711B1 - Method for audio and voice synthesizing

Info

Publication number: KR100585711B1
Application number: KR1020030010113A
Authority: KR
Inventors: 김숙향
Original assignee: 엘지전자 주식회사
Priority date: 2003-02-18
Filing date: 2003-02-18
Publication date: 2006-06-07
Also published as: KR20040074727A

Abstract

본 발명은 다양한 백그라운드 오디오를 재생하면서 음성을 녹음하여 합성할 수 있도록 하는 오디오 및 음성 합성 방법에 관한 것이다.The present invention relates to an audio and speech synthesis method for recording and synthesizing speech while playing various background audio.

본 발명은 VXML 서버에서 VXML 문서를 이용하여 오디오와 음성의 합성을 수행함에 있어서, 상기 VXML 문서에서 오디오와 음성의 합성을 지시하는 엘리먼트(synthesizer)를 검출하여, 상기 VXML 문서의 오디오 재생 파일 엘리먼트(Source)에서 지정하는 파일(file1)을 재생하는 단계와; 상기 VXML 문서의 음성 저장 파일 엘리먼트(dest)에서 지정하는 파일(file2)로 사용자의 음성을 입력받아 저장하는 단계와; 상기 오디오 파일(file1)과 음성 파일(file2)을 합성하여, 상기 VXML 문서의 합성 파일 엘리먼트(synthesis_file)에서 지정하는 파일(file3)로 저장하는 단계를 포함하여 이루어짐으로써 달성할 수 있다.The present invention provides a method of synthesizing audio and speech using a VXML document in a VXML server. Playing the file (file1) specified in the source); Receiving and storing a user's voice as a file (file2) designated by the voice storage file element (dest) of the VXML document; And synthesizing the audio file (file1) and the voice file (file2), and storing them in a file (file3) specified by the synthesis file element (synthesis_file) of the VXML document.

Description

METHOD FOR AUDIO AND VOICE SYNTHESIZING}

도 1은 일반적인 VXML 시스템의 구성을 보인 블록도. 1 is a block diagram showing the configuration of a general VXML system.

도 2는 종래의 VXML 1.0에서 정의된 47가지 태그의 기능 및 의미들을 보인 예시도. 2 is a diagram illustrating the functions and meanings of 47 tags defined in the conventional VXML 1.0.

도 3은 본 발명에 의해 백그라운드 오디오를 서비스하면서 음성을 녹음하기 위한 합성 엘리먼트(<synthesizer>)의 구성을 보인 예시도.Figure 3 is an exemplary view showing the configuration of a synthesis element (<synthesizer>) for recording voice while serving background audio according to the present invention.

도 4는 본 발명에 의한 엘리먼트(<synthesizer>)를 사용하여 음악을 재생하면서 음성을 녹음하는 VXML 컨텐츠를 보인 예시도.4 is an exemplary view showing VXML content for recording voice while playing music using an element (<synthesizer>) according to the present invention.

본 발명은 오디오 및 음성 합성을 위한 엘리먼트에 관한 것으로 특히, VXML 서버에서 다양한 백그라운드 오디오를 재생하면서 음성을 녹음하여 합성할 수 있도록 하는 오디오 및 음성 합성 방법에 관한 것이다. TECHNICAL FIELD The present invention relates to elements for audio and speech synthesis, and more particularly, to an audio and speech synthesis method that allows a VXML server to record and synthesize speech while playing various background audio.

음성 인식 기술이 실용화 수준으로 발전함에 따라 음성 포털을 비롯한 음성 인식 응용 분야가 새로운 이슈로 떠오르고 있다. IMT-2000 사업의 준비 과정에서도 음성 인식 관련 기술은 상당히 중요한 테마로 자리를 차지하고 있다. 이 같은 분위기 속에서 최근에는 VoiceXML(Voice eXtensible Markup Language, 이하 VXML)문서 형식이 음성 합성과 음성 인식을 결합한 음성 서비스를 실현하기 위한 가장 유력한 방안으로서 기대를 모으고 있다.As the speech recognition technology has been developed to a practical level, voice recognition applications including voice portals are emerging as new issues. In the preparation of the IMT-2000 project, voice recognition-related technology has become an important theme. In this atmosphere, the VoiceXML (Voice eXtensible Markup Language) document format has recently been expected to be the most influential way to realize voice services combining speech synthesis and speech recognition.

참고로, VXML은 XML 문법을 이용하여 정의된 문서 형식으로, 이 문서 형식에 맞추어 작성된 이른바 VXML 문서는 음성 어플리케이션에서 대화의 진행 방식을 지정하는 일종의 시나리오 역할을 하게 된다.For reference, VXML is a document format defined using XML grammar, and the so-called VXML document written in accordance with this document format serves as a scenario for specifying how a conversation proceeds in a voice application.

VXML이 실제로 활용되기 위해서는 음성 플랫폼, VXML 컨텐츠, VXML 인터프리터의 세 가지 구성 요소가 필요하다. In order to actually use VXML, three components are required: voice platform, VXML content, and VXML interpreter.

도1은 일반적인 VXML 시스템의 구성을 보인 블록도로서, 문서서버(Document Server, 110)는 VXML 인터프리터의 요청에 따라 VXML 문서를 전송해주는 소프트웨어 요소로서, 홈페이지 운영에 사용되고 있는 웹서버를 그대로 이용할 수 있으며, 웹서버는 클라이언트가 요청한 문서를 전송해 주는 일을 하고 있다. 주로 HTML 관련 문서들을 전송하는 일을 하고 있지만 전송 문서 형식에 특별한 제약은 없으므로 수정 없이 VXML 관련 문서 전송에 이용될 수 있다.1 is a block diagram showing the configuration of a general VXML system, a document server (Document Server) 110 is a software element for transmitting a VXML document in response to a request of the VXML interpreter, it is possible to use a web server used for homepage operation as it is. The web server is responsible for sending the documents requested by the client. It mainly works to transmit HTML related documents, but there are no special restrictions on the format of sent documents, so it can be used to send VXML related documents without modification.

웹서버를 이용할 경우 VXML 음성 서비스를 쉽게 인터넷 컨텐츠에 연결시킬 수 있는 장점이 있다. 특히 웹 서버의 CGI 기능을 이용, 음성 입력 정보를 새 문서 생성에 반영할 수 있다. 이때 ASP나 PHP 등을 이용해 데이터베이스를 연동시킬 경우 주식 시세나 날씨, 스포츠 중계 등의 실시간 처리가 가능하다.Using a web server has the advantage of easily connecting VXML voice services to Internet content. In particular, by using the CGI function of the web server, voice input information can be reflected in generating a new document. At this time, if the database is linked using ASP or PHP, real-time processing such as stock quote, weather, and sports relaying is possible.

그러나, 아쉽게도 기존 인터넷 콘텐츠의 대부분을 차지하고 있는 HTML 관련 문서들은 구조나 내용이 크게 달라 그대로 이용할 수 없다. 즉, 웹서버 자체는 그대로 이용할 수 있지만 콘텐츠는 새로 구축해야 하는 것이다.Unfortunately, HTML-related documents, which occupy most of the existing Internet content, are very different in structure and content and cannot be used as they are. In other words, the web server itself can be used as it is, but the content must be rebuilt.

다음, 음성 플랫폼(Implementation Platform, 130)은 음성 입출력 및 통신 기능을 지원하는 하드웨어와 음성 인식 및 음성 합성 기능을 지원하는 소프트웨어로 구성되는 음성 단말 역할의 요소로서 VXML 인터프리터의 지시에 따라 음성 합성, 음성 인식, 오디오 파일 출력, 음성 입력 녹음, DTMF(전화기 버튼) 입력 등의 음성 입출력을 수행하여 결과를 보고할 뿐만 아니라, 사용자의 무응답, 각종 제한 시간 초과, 사용자 접속 해제 등 음성 서비스 과정에서 일어날 수 있는 각종 사건을 감지하여 알리는 역할도 한다.Next, the Implementation Platform 130 is an element of a voice terminal composed of hardware supporting voice input / output and communication functions, and software supporting voice recognition and voice synthesis functions, and according to the instructions of the VXML interpreter. It not only reports the result by voice input / output such as recognition, audio file output, voice input recording, DTMF (telephone button) input, but also can occur during voice service process such as user's non-response, various timeouts, user disconnection, etc. It also detects and notifies various events.

다음, VXML 인터프리터 컨텍스트(120)는 인터프리터 초기화, 사용자 프로필 정보 처리, 인터프리터 모니터링, 사용자 접속 관리 등 인터프리터 수행에 필요한 부수적 기능들을 제공함으로써 동일한 인터프리터 모듈이 다양한 수행 환경에 쉽게 이식될 수 있도록 한다.Next, the VXML interpreter context 120 provides additional functions necessary for interpreter execution, such as interpreter initialization, user profile information processing, interpreter monitoring, and user access management, so that the same interpreter module can be easily ported to various execution environments.

전체 시스템 구성 요소에서 가장 핵심적인 요소는 역시 VXML 인터프리터 모듈이다. VXML 인터프리터는 내장된 XML 분석기(parser)를 이용하여 VXML 문서의 구조를 분석한 후, 문서에 지시한 내용을 해석하여 그 내용에 따라 제어 구조를 실행, 음성 플랫폼에 음성 입출력 지시, 음성 플랫폼으로부터의 각종 이벤트 발생 처리, VXML 인터프리터 컨텍스트로부터의 요구 처리, 문서 서버를 이용한 새로운 작업 문서로의 전환 등의 다양한 작업을 총괄 지휘하여 VXML 음성 서비스를 가능하게 하는 사령탑 역할을 하게 된다.The key element of the overall system component is also the VXML interpreter module. The VXML interpreter analyzes the structure of the VXML document using the built-in XML parser, interprets the contents of the document, and executes the control structure according to the contents. It will serve as a command tower to enable VXML voice service by directing various tasks such as processing various events, handling requests from the VXML interpreter context, and switching to a new working document using a document server.

일반적으로 음성 인터넷에서 사용되는 마크업 랭귀지(Markup language)에는 VoiceXML 1.0과 VoiceXML 2.0이 있다. In general, the markup languages used in the voice Internet include VoiceXML 1.0 and VoiceXML 2.0.

상기 VoiceXML 1.0의 경우에는 <prompt> 엘리먼트(element)를 통해 음성을 합성하거나, <audio> 엘리먼트를 통해 미리 녹음된 오디오 파일을 실행하여 사용자에게 음성을 서비스한다. 참고로, 도2에는 VXML 1.0에서 정의된 47가지 태그의 기능 및 의미에 대하여 도시되어 있다.In the case of VoiceXML 1.0, voice is synthesized through a <prompt> element, or a voice file is provided to a user by executing a pre-recorded audio file through a <audio> element. For reference, FIG. 2 illustrates functions and meanings of 47 tags defined in VXML 1.0.

상기 VoiceXML 2.0에서는 Speech Synthesis Markup Language Specification에서 정의된 엘리먼트들과 <prompt> 엘리먼트를 통해 음성을 합성하거나, <audio> 엘리먼트를 통해 미리 녹음된 오디오 파일을 실행하여 사용자에게 음성을 서비스 한다. In VoiceXML 2.0, voice is synthesized through elements defined in the Speech Synthesis Markup Language Specification and a <prompt> element, or a voice file is provided to a user by executing a pre-recorded audio file through the <audio> element.

그러나, 음성 인터넷 서비스에 있어서, 상기 VoiceXML 1.0 또는 2.0에 사용되는 엘리먼트(element)들 만으로는 딱딱한 음성 또는 하나의 오디오만을 제공함으로써, 음성 인터넷 사용자들은 노래방과 같이 반주에 맞춰 노래를 부르며 녹음하는 것처럼, 오디오를 들으면서 음성을 입력해 두 오디오 파일을 합성하는 것과 같은 다양한 음성 서비스를 제공할 수 없다는 단점이 있다. However, in the voice Internet service, the elements used in the VoiceXML 1.0 or 2.0 provide only a hard voice or a single audio, so that voice Internet users can record audio by singing along with accompaniment, such as karaoke. There is a disadvantage in that it is not possible to provide various voice services such as synthesizing two audio files by inputting voice while listening.

이에 따라, 본 발명은 종래의 문제점을 개선하기 위하여 음성 인터넷을 위해 사용되는 마크업 랭귀지(Markup language)인 VoiceXML을 이용하여 다양한 백그라운드 오디오를 재생하면서 음성을 녹음하여 합성할 수 있도록 하는 오디오 및 음성 합성 방법을 제공함에 목적이 있다. Accordingly, the present invention provides audio and speech synthesis that enables recording and synthesis of speech while reproducing various background audio using VoiceXML, which is a markup language used for the voice Internet, to improve the conventional problems. The purpose is to provide a method.

본 발명은 상기의 목적을 달성하기 위하여, VXML 서버에서 VXML 문서를 이용하여 오디오와 음성의 합성을 수행함에 있어서, 상기 VXML 문서에서 오디오와 음성의 합성을 지시하는 엘리먼트(synthesizer)를 검출하여, 상기 VXML 문서의 오디오 재생 파일 엘리먼트(Source)에서 지정하는 파일(file1)을 재생하는 단계와; 상기 VXML 문서의 음성 저장 파일 엘리먼트(dest)에서 지정하는 파일(file2)로 사용자의 음성을 입력받아 저장하는 단계와; 상기 오디오 파일(file1)과 음성 파일(file2)을 합성하여, 상기 VXML 문서의 합성 파일 엘리먼트(synthesis_file)에서 지정하는 파일(file3)로 저장하는 단계를 포함하여 이루어진 것을 특징으로 한다.In order to achieve the above object, the present invention provides a method for synthesizing audio and speech using a VXML document in a VXML server, and detecting an element indicating a synthesis of audio and speech in the VXML document. Playing the file (file1) specified in the audio playback file element (Source) of the VXML document; Receiving and storing a user's voice as a file (file2) designated by the voice storage file element (dest) of the VXML document; And synthesizing the audio file (file1) and the voice file (file2), and storing it as a file (file3) designated by the synthesis file element (synthesis_file) of the VXML document.

이하, 본 발명의 바람직한 실시예를 첨부한 도면을 참조하여 설명한다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention will be described.

도3은 본 발명에 의해 백그라운드 오디오를 서비스하면서 음성을 녹음하기 위한 합성 엘리먼트(<synthesizer>)의 구성을 보인 예시도로서, 여기서 <synthesizer> 엘리먼트는 DTD(Data Type Definition)에서 <form> 엘리먼트의 자(子)엘리먼트로써 위치하는 것으로, <synthesizer> 엘리먼트에 대한 DTD를 살펴보면 <prompt>, <audio>, '%event.handler'에 속하는 엘리먼트 들이 있다.FIG. 3 is a diagram illustrating a configuration of a synthesis element (<synthesizer>) for recording voice while serving background audio according to the present invention, wherein the <synthesizer> element is a symbol of a <form> element in a DTD (Data Type Definition). Located as a child element, if you look at the DTD for the <synthesizer> element, there are elements belonging to <prompt>, <audio>, and '% event.handler'.

도3을 참조하여 <synthesizer> 엘리먼트의 특성을 나타내는 속성(Attribute)들을 살펴보면, 'id', 'type', 'beep', 'maxtime', 'source', 'dest', 'synthesis_file' 속성으로 구성되는데, 각 속성의 의미를 설명하면 다음과 같다.Referring to the attributes representing the characteristics of the <synthesizer> element with reference to FIG. 3, the attributes are composed of 'id', 'type', 'beep', 'maxtime', 'source', 'dest', and 'synthesis_file' attributes. The meaning of each property is explained as follows.

먼저, 'id'는 <synthesizer>의 이름(name)을 나타내고, 'type'은 음성과 같이 합성할 오디오 파일의 MIME 포맷을 나타낸다. 이 값은 사용자가 녹음할 음성 파 일의 MIME 포맷과 합성할 파일의 MIME 포맷에도 똑같이 적용된다.First, 'id' represents the name of the <synthesizer>, and 'type' represents the MIME format of the audio file to be synthesized, such as voice. This value applies equally to the MIME format of the voice file you are recording and the MIME format of the file to be synthesized.

다음, 'beep'은 이 값이 'true'이면 녹음에 들어가기 앞서 'beep'음을 발생시키는 것으로, 디폴트(Default) 값은 'false'이다.Next, 'beep' generates this 'beep' sound before entering the recording when this value is 'true'. The default value is 'false'.

다음, 'maxtime'은 녹음할 수 있는 최대 시간을 의미하고, 'dest'는 사용자가 녹음한 음성 파일의 URI(Uniform Resource Identifier)를 나타내며, 'source'는 음성과 같이 합성할 오디오 파일의 URI를 나타낸다.Next, 'maxtime' refers to the maximum time that can be recorded, 'dest' refers to the Uniform Resource Identifier (URI) of the voice file recorded by the user, and 'source' refers to the URI of the audio file to be synthesized, such as voice. Indicates.

다음, 'synthesis_file'은 미리 지정한 오디오 파일과 사용자가 녹음한 음성을 함께 합성하여 저장될 파일의 URI를 나타내며, 상기에서 'type', 'source', 'dest', 'synthesis_file'은 반드시 필요한 속성들이다.Next, 'synthesis_file' represents a URI of a file to be stored by synthesizing a predetermined audio file and a voice recorded by a user, and 'type', 'source', 'dest', and 'synthesis_file' are essential attributes. .

그리고, <synthesizer> 엘리먼트의 자(子)엘리먼트로 오는 <prompt>, <audio>는 음성을 녹음하기 전에 사용자에게 음성이나 오디오를 들려주기 위해 사용되며, '%event.handler'는 음성을 녹음할 경우 발생될 수 있는 에러에 대한 처리를 하기 위해 사용된다.Then, <prompt> and <audio> coming to the child element of the <synthesizer> element are used to give voice or audio to the user before recording the voice, and '% event.handler' Used to deal with errors that may occur.

다시 말해, VXML 인터프리터(120)는 VXML 문서를 해석할 때 <synthesizer> 엘리먼트가 있으면, 'dest' 속성에서 지정한 이름으로 음성을 녹음한 다음, 'source' 속성에서 지정된 오디오 파일과 합성하도록, VXML 서버의 구성 요소인 음성 플랫폼(Implementation Platform)(130)에서 요청하고, 음성 플랫폼은 'synthesis_file' 속성에서 지정된 이름으로 저장된다.In other words, the VXML interpreter 120, when interpreting a VXML document, if there is a <synthesizer> element, records the voice with the name specified in the 'dest' attribute and then synthesizes it with the audio file specified in the 'source' attribute. The request is made in the implementation of the voice platform (Implementation Platform) 130, the voice platform is stored with the name specified in the 'synthesis_file' attribute.

도4는 본 발명에 의한 엘리먼트(<synthesizer>)를 사용하여 음악을 재생하면서 음성을 녹음하는 VXML 컨텐츠를 보인 예시도로서, 사용자는 가령 "At the tone, please say your greeting"이라는 문장을 들은 후에 'beep' 소리가 나면 음성을 녹음할 수 있다. 이때 사용자는 'source' 속성에 지정된 'sample2.wav' 오디오를 들으면서 녹음을 하게 된다.FIG. 4 is an exemplary view showing VXML content recording a voice while playing music using an element (<synthesizer>) according to the present invention. The user hears a sentence "At the tone, please say your greeting", for example. If you hear a beep, you can record your voice. At this time, the user records while listening to the 'sample2.wav' audio specified in the 'source' property.

이때, 녹음될 음성은 'dest' 속성에 지정된 'sample1.wav' 파일로 저장이 되고, VXML 서버(미도시)는 'sample1.wav'와 'sample2.wav'를 합성하여 'sample3.wav'라는 오디오 파일을 최종 생성하게 된다. 만약, 사용자가 아무 음성도 입력하지 않으면 'I didn't hear anything, please try again.'이라는 문장을 음성 출력한다.At this time, the voice to be recorded is stored as a 'sample1.wav' file specified in the 'dest' property, and the VXML server (not shown) synthesizes 'sample1.wav' and 'sample2.wav' to display 'sample3.wav'. The final audio file is created. If the user does not input any voice, the voice output 'I didn't hear anything, please try again.'

상기와 같이 본 발명은 기존에 정의되어 있지 않은 일종의 음악 재생과 음성 합성을 동시에 수행할 수 있는 태그를 정의한 것으로, 상기 <synthesizer> 엘리먼트에 의해 사용자는 VXML 문법을 적용하여 쉽게 노래방과 같은 효과를 적용할 수 있게 된다.As described above, the present invention defines a tag capable of simultaneously performing a kind of music reproduction and voice synthesis, which is not previously defined. The <synthesizer> element allows a user to easily apply an effect such as karaoke by applying a VXML grammar. You can do it.

상기에서 상세히 설명한 바와 같이 본 발명은 음성 인터넷 사용자들에게 딱딱한 음성이나 한가지의 오디오만 서비스하는 것이 아니라 다양한 백그라운드 오디오를 재생하면서 음성을 녹음하여 합성할 수 있도록 하는 효과가 있다.As described in detail above, the present invention has an effect of not only providing a hard voice or a single audio service to voice internet users, but also recording and synthesizing the voice while playing various background audio.

Claims

In performing the synthesis of audio and speech using a VXML document in a VXML server,

Detecting a synthesizer indicative of the synthesis of audio and voice in the VXML document, and playing a file designated by an audio reproduction file element of the VXML document;

Receiving and storing a user's voice as a file (file2) designated by the voice storage file element (dest) of the VXML document;

And synthesizing the audio file (file1) and the voice file (file2), and storing them as a file (file3) designated by the synthesis file element (synthesis_file) of the VXML document. .

The method of claim 1, wherein the VXML server,

And generating a beep when a beep element is present in the element of the VXML document for audio synthesis, and recording up to a time specified by a maxtime element.

The audio and speech synthesis method of claim 1, wherein each of the elements 'dest', 'source', and 'synthesis_file' is expressed in a 'URI' format.

The method of claim 1, wherein the VXML document,

Elements (<prompt>, <audio>) for presenting voice or audio to the user; An audio and speech synthesis method comprising an element (% event.handler) for error handling that may occur during voice recording.

The method of claim 1, wherein the VXML document,

An id element representing a name of a synthesizer to synthesize the audio and voice; And a type element indicating a mime format of an audio file to be synthesized, such as the voice.