KR100612780B1

KR100612780B1 - Speech and music reproduction apparatus

Info

Publication number: KR100612780B1
Application number: KR1020040038415A
Authority: KR
Inventors: 가와시마다까히로
Original assignee: 야마하 가부시키가이샤
Priority date: 2003-05-29
Filing date: 2004-05-28
Publication date: 2006-08-17
Also published as: HK1069433A1; CN1310209C; TW200427297A; KR20040103433A; CN1573921A; TWI265718B

Abstract

본 발명에 따른 음성·악곡 재생 장치는, 스크립트 데이터(11), 사용자 음색 파라미터(12), 및 사용자 프레이즈 합성 사전 데이터(13)와 연계함과 함께, 디폴드 음색 파라미터(18) 및 디폴트 합성 사전 데이터(19)에 기초하여 합성 음성 신호 등을 재생하는 미들웨어와, 상기 합성 음성 신호에 기초하여 원하는 음성이나 악곡을 재생하는 음원(20) 및 스피커에 의해 구성된다. 또한, 스크립트 데이터로서 여러 이벤트가 기술되는 HV 스크립트를 이용한 경우, 그 이벤트 종별에 의해 원하는 파형 데이터, 및 음표 정보를 포함하는 악곡 프레이즈 데이터, 및 포먼트 프레임 데이터가 적당하게 조합되어 재생된다. The sound and music reproducing apparatus according to the present invention is associated with the script data 11, the user tone parameter 12, and the user phrase synthesis dictionary data 13, and also includes a dipole tone parameter 18 and a default synthesis dictionary. A middleware for reproducing a synthesized voice signal or the like based on the data 19, a sound source 20 and a speaker for reproducing a desired voice or music based on the synthesized voice signal. In addition, in the case of using an HV script in which various events are described as the script data, the desired waveform data, the music phrase data including the note information, and the formant frame data are appropriately combined and reproduced by the event type.

HV 스크립트, 포먼트 프레임 데이터, 미들웨어, 문자열, 음성 변환 장치HV script, formant frame data, middleware, strings, speech converter

Description

Voice and music playback device {SPEECH AND MUSIC REPRODUCTION APPARATUS}

도 1은 본 발명의 제1 실시예에 따른 음성 재생 장치의 구성을 도시하는 블록도. BRIEF DESCRIPTION OF THE DRAWINGS Fig. 1 is a block diagram showing the configuration of an audio reproducing apparatus according to a first embodiment of the present invention.

도 2는 발음 단위와 프레이즈 ID의 할당 관계를 도시하는 도면. 2 is a diagram showing an allocation relationship between a pronunciation unit and a phrase ID;

도 3은 프레이즈 합성 사전 데이터의 내용예를 도시하는 도면. 3 is a diagram showing an example of the contents of phrase synthesis dictionary data;

도 4는 SMAF 파일의 포맷예를 도시하는 도면. 4 is a diagram illustrating a format example of a SMAF file.

도 5는 HV 오서링 툴의 일례를 도시하는 기능 블록도. 5 is a functional block diagram illustrating an example of an HV authoring tool.

도 6은 음성 재생 장치를 적용한 휴대 통신 단말기의 구성을 도시하는 블록도. Fig. 6 is a block diagram showing the configuration of a portable communication terminal to which an audio reproducing apparatus is applied.

도 7은 사용자 프레이즈 합성 사전 데이터의 작성 처리를 도시하는 흐름도. 7 is a flowchart showing a process of creating user phrase synthesis dictionary data;

도 8은 사용자 프레이즈 합성 사전 데이터의 재생 처리를 도시하는 흐름도. 8 is a flowchart showing a reproduction process of user phrase synthesis dictionary data.

도 9는 SMAF 파일의 작성 처리를 도시하는 흐름도. 9 is a flowchart showing a process of creating an SMAF file.

도 10은 SMAF 파일의 재생 처리를 도시하는 흐름도. 10 is a flowchart showing a reproduction process of an SMAF file.

도 11은 본 발명의 제2 실시예에 따른 음성·악곡 재생 장치의 구성을 도시하는 블록도. Fig. 11 is a block diagram showing the construction of an audio / music reproducing apparatus according to a second embodiment of the present invention.

도 12는 각 이벤트와 파형 데이터 및 악곡 프레이즈 데이터의 할당 관계예를 도시하는 도면. Fig. 12 is a diagram showing an example of the allocation relationship between each event, waveform data, and music phrase data.

도 13은 제2 실시예에 따른 음성·악곡 재생 처리를 도시하는 흐름도. Fig. 13 is a flowchart showing an audio / music reproducing process according to the second embodiment.

도 14는 제2 실시예에 따른 음성·악곡 재생 장치를 구비한 휴대 전화의 구성을 도시하는 블록도. Fig. 14 is a block diagram showing the configuration of a mobile telephone including a voice and music reproducing apparatus according to the second embodiment.

도 15는 본 발명의 제3 실시예에 따른 음성·악곡 재생 장치의 구성을 도시하는 블록도. Fig. 15 is a block diagram showing the construction of an audio / music reproducing apparatus according to a third embodiment of the present invention.

도 16은 도 15에 도시하는 음성·악곡 재생 장치의 동작을 도시하는 흐름도. FIG. 16 is a flowchart showing the operation of the audio / music reproducing apparatus shown in FIG. 15; FIG.

<도면의 주요 부분에 대한 부호의 설명> <Explanation of symbols for the main parts of the drawings>

1 : 음성 재생 장치1: voice playback device

11 : 스크립트 데이터11: script data

12 : 사용자 음색 파라미터12: User Voice Parameter

13 : 사용자 프레이즈 합성 사전 데이터13: User Phrase Synthesis Dictionary Data

14 : 어플리케이션 소프트웨어14: application software

15 : 미들웨어 API15: middleware API

16 : 컨버터16: converter

17 : 드라이버17: driver

18 : 디폴트 음색 파라미터18: Default Voice Parameter

19 : 디폴트 합성 사전 데이터19: Default Synthetic Dictionary Data

20 : 음원20: sound source

본 발명은, 음성 및 악곡 재생 장치에 관한 것으로, 특정한 언어를 음성 합성에 의해 재생함과 함께, 문자 정보를 음성 또는 악곡으로 변환하여 재생하는 음성 및 악곡 재생 장치에 관한 것이다. BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a speech and music reproducing apparatus, and more particularly, to a speech and music reproducing apparatus that reproduces a specific language by speech synthesis and converts text information into speech or music.

종래, 전자 메일 등의 문자열 정보(character string information)를 음성으로 변환하여 출력하는 문자열 음성 변환 장치가 고안되어 있다. 일본 특허 공개 공보 「특개 2001-7937호」는, 종래의 문자열 음성 변환 장치의 일례를 나타내고 있고, 여기서 문자열 정보를 문절 단위로 구획하여, 음성 출력함과 동시에 그 내용을 디스플레이에 표시하고 있다. Background Art Conventionally, a string-to-speech apparatus for converting character string information such as electronic mail into voice and outputting the same has been devised. Japanese Patent Laid-Open Publication No. 2001-7937 shows an example of a conventional character string-to-speech apparatus, where the character string information is divided into sentence units, and the contents are displayed on the display while the audio is output.

또한, 악곡 프레이즈나 음성 프레이즈 등을 샘플링하여 작성한 파형 데이터(또는 샘플링 데이터)를 재생하는 방법이나, SMF(standard MIDI file) 혹은 SMAF(synthetic music mobile application file) 등의 음표 정보에 의해 1개의 악곡 프레이즈를 구성하고, 그 악곡 프레이즈를 재생하는 방법이 알려져 있다. 예를 들면, 일본 특허 공개 공보 「특개 2001-51688호」에는, 전자 메일 중 문자열 정보와 악음 정보를 분리하고, 각각을 발음 재생할 수 있는 전자 메일 판독 장치가 개시되어 있다. In addition, a single phrase phrase can be generated by reproducing waveform data (or sampling data) created by sampling a music phrase or voice phrase, or by musical note information such as SMF (standard MIDI file) or SMAF (synthetic music mobile application file). A method of constructing the music phrase and reproducing the music phrase is known. For example, Japanese Patent Laid-Open Publication No. 2001-51688 discloses an electronic mail reading device capable of separating string information and musical sound information from electronic mail and pronunciation of each.

그러나, 종래의 문자열 음성 변환 장치에서는, 문자열 정보를 문절(clause or phrase) 단위로 구획하여 음성 출력하는 것이기 때문에, 그 음성 출력은 발음 단위(utterance unit)(또는 문자 단위)의 음성의 집합으로, 그 발음 단위의 이음매(연결 부분)의 재생에서 통상의 구어체 언어의 발음에 비하여 청취자에게 위화감을 준다. 즉, 종래의 문자열 음성 변환 장치에는 문절 전체에 대하여 품질이 좋은 음성으로 그 음색을 변화시켜서 출력하는 것, 즉 사람의 구어체 언어에 가까운 자연스러운 음성을 출력할 수 없다고 하는 문제점이 있다. However, in the conventional string-to-speech apparatus, since the string information is divided into phrases or phrases and outputted as speech, the speech output is a set of sounds in an pronunciation unit (or character unit). The reproduction of the joints (connection parts) in the pronunciation unit gives the listener a sense of discomfort compared to the pronunciation of the spoken language. In other words, the conventional string-to-speech converter has a problem in that it is possible to output the sound by changing its tone with high-quality voice for the whole sentence, that is, cannot output a natural voice close to the human spoken language.

또한, 상기한 문제점을 해소하기 위해서, 예를 들면 문절(이하, 「프레이즈」라고 칭함)마다의 음성을 미리 샘플링하여 음성 데이터로서 보존해두고, 재생 시에는 해당하는 음성 파형으로서 출력하는 방법이 생각된다. 그러나, 이 방법으로는 음성 출력의 품질을 향상하기 위해 샘플링 주파수를 높일 필요가 있고, 이 때문에 대용량의 음성 데이터를 보존할 필요성이 발생하여, 휴대 전화(portable telephone or cellular phone) 등의 비교적 기억 용량이 한정된 장치에서는, 기술적인 곤란함이 있었다. In order to solve the above-mentioned problems, for example, a method of pre-sampling audio for each sentence (hereinafter referred to as a phrase) is stored in advance as audio data, and outputted as a corresponding audio waveform during reproduction. do. However, with this method, it is necessary to increase the sampling frequency in order to improve the quality of the voice output, which causes the necessity of preserving a large amount of voice data, and thus a relatively storage capacity such as a portable telephone or cellular phone. In this limited apparatus, there was a technical difficulty.

또한, 악곡이나 음성을 샘플링하여 작성한 파형 데이터를 재생하는 상기 종래 방법이나 SMF 또는 SMAF 등의 음표 정보로 1개의 악곡 데이터를 구축하고, 그 악곡 데이터를 재생하는 종래 방법에는 악곡이나 음성의 재생 타이밍은 텍스트 파일 형식으로 기술되어 있지 않고, 따라서 문자열 정보에 기초한 음성 재생과 파형 데이터 재생 혹은 악곡 데이터 재생과의 조합을 사용자의 의도대로 실행하는 것이 곤란하였다. In addition, in the conventional method of reproducing waveform data created by sampling a piece of music or voice, or a single piece of music data using note information such as SMF or SMAF, and reproducing the piece of music data, the timing of reproducing the piece of music or voice It is not described in the form of a text file, and therefore it is difficult to perform a combination of voice reproduction based on character string information, waveform data reproduction or music data reproduction as desired by the user.

본 발명은, 상기한 문제를 해결하기 위해서 이루어진 것으로, 문자열 정보 등으로 이루어지는 원하는 문절(또는 프레이즈)을 품질이 좋은 음성으로서 그 음색을 변화시켜서 재생 출력할 수 있는 음성 재생 장치를 제공하는 것을 목적으로 한 다. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems, and an object of the present invention is to provide a speech reproducing apparatus capable of reproducing and outputting a desired sentence (or phrase) composed of character string information or the like as a sound having a good quality. do.

또한, 본 발명의 다른 목적은, 사용자가 음성 재생이나 파형 데이터 재생 및 악곡 데이터 재생의 조합을 간이하게 행할 수 있으므로써, 사용자의 의도에 충실한 음성 및 악곡의 재생을 행할 수 있는 음성·악곡 재생 장치를 제공하는 것이다. In addition, another object of the present invention is a voice / music reproducing apparatus which enables a user to easily perform a combination of voice reproducing, waveform data reproducing, and music data reproducing, thereby reproducing voice and music faithful to the user's intention. To provide.

본 발명에 따른 음성 재생 장치는, 미리 소정의 발음 단위에 대응하는 포먼트 프레임 데이터(formant frame data)로 이루어지는 데이터베이스를 합성 사전(composition dictionary) 데이터로서 기억하고, 발음 단위를 나열하여 구성하는 문자열에 관한 정보가 주어지면, 상기 합성 사전 데이터를 이용하여 음성 합성을 행한다. 여기서, 포먼트 프레임 데이터를 임의의 사용자 프레이즈 데이터(user phrase data)로 치환하여, 문자열 정보가 부여되었을 때에, 그 사용자 프레이즈 데이터로 치환된 합성 사전 데이터를 이용하여 음성 합성을 행한다. 또한, 사용자 프레이즈 데이터에는 포먼트 프레임 데이터를 가공하는 음색 파라미터가 부가되어 있다. 또한, 음성 합성 시에 사용자 프레이즈 데이터를 포함한 소정의 데이터 교환 포맷을 이용한다. 이 데이터 교환 포맷은, 예를 들면 SMAF 파일 포맷으로, 사용자 프레이즈 데이터뿐만 아니라, 각종 청크나 악곡 재생 정보를 포함시킬 수 있다. The audio reproducing apparatus according to the present invention stores a database of formant frame data corresponding to a predetermined pronunciation unit in advance as composition dictionary data, and stores the database in a character string arranged by arranging the pronunciation units. Given information, speech synthesis is performed using the synthesis dictionary data. Here, the formant frame data is replaced with arbitrary user phrase data, and when character string information is given, speech synthesis is performed using the synthesis dictionary data substituted with the user phrase data. The tone phrase parameter for processing formant frame data is added to the user phrase data. In addition, a predetermined data exchange format including user phrase data is used in speech synthesis. This data exchange format is a SMAF file format, for example, and can contain not only user phrase data but also various chunks and music reproduction information.

구체적으로는, 상기한 음성 재생 장치는, 미리 소정의 발음 단위에 대응하는 포먼트 프레임 데이터를 보존하는 디폴트 합성 사전 데이터와, 그 포먼트 프레임 데이터를 사용자 프레이즈 데이터로 치환하는 미들웨어 어플리케이션 프로그램 인터페이스(API), 컨버터, 드라이버, 및 음원으로 구성된다. 이에 의해, 문자열 정보로 이루어지는 원하는 프레이즈를 품질이 좋은 음성으로서 재생하고, 또한 음색을 적절하게 변화시켜서 재생할 수 있다. Specifically, the audio reproducing apparatus includes a default synthesis dictionary data for storing formant frame data corresponding to a predetermined pronunciation unit in advance, and a middleware application program interface (API) for replacing the formant frame data with user phrase data. ), Converter, driver, and sound source. As a result, a desired phrase made of character string information can be reproduced as a high quality voice, and the tone can be appropriately changed and reproduced.

본 발명에 따른 음성·악곡 재생 장치는, 문자의 발음 또는 미리 기억한 발음용 데이터의 재생 지시를 기술한 스크립트 데이터(즉, HV 스크립트; HV-Script)를 기억한다. 이 스크립트 데이터에 기초하여, 상기 문자에 대응하는 음성 신호를 생성하여 원하는 음성을 발생함과 함께, 발음용 데이터에 대응하는 발음 신호를 생성하여 원하는 음성 또는 악음을 발생한다. 여기서, 발음용 데이터는 예를 들면 음성 또는 악곡의 샘플링에 의해 생성되는 파형 데이터에 의해 구성되고, 그 파형 데이터에 기초하여 합성 발음 신호가 생성된다. 또한, 발음용 데이터를 음표 정보를 포함하는 악곡 데이터로 한 경우, 그 악곡 데이터에 기초하여 음표 정보에 대응한 악음 신호가 생성된다. 또한, 상기 문자의 발음을 특징짓는 포먼트 제어 파라미터(또는, 포먼트 프레임 데이터)를 기억한 경우, 그 포먼트 제어 파라미터에 기초하여 음성 신호가 생성된다. 상기 스크립트 데이터는 사용자에 의해 임의로 작성될 수 있도록 해도 된다. 이 경우, 스크립트 데이터는 텍스트 입력에 의해 작성되는 소정의 파일 형식을 취한다. The audio / music reproducing apparatus according to the present invention stores script data (i.e., HV script; HV-Script) that describes the pronunciation of a character or an instruction for reproducing previously stored pronunciation data. Based on the script data, a voice signal corresponding to the character is generated to generate a desired voice, and a pronunciation signal corresponding to the pronunciation data is generated to generate a desired voice or musical sound. Here, the pronunciation data is constituted by, for example, waveform data generated by sampling of voice or music, and a synthesized pronunciation signal is generated based on the waveform data. When the pronunciation data is used as musical data including musical note information, a musical tone signal corresponding to musical note information is generated based on the musical musical data. In addition, when the formant control parameter (or formant frame data) characterizing the pronunciation of the character is stored, an audio signal is generated based on the formant control parameter. The script data may be arbitrarily created by a user. In this case, the script data takes a predetermined file format created by text input.

구체적으로는, HV 스크립트에 기술되는 각종 이벤트를 해석하고, 그 이벤트의 종별이 파형 데이터를 나타내는 경우에는, 그 파형 데이터를 판독하여 재생하고, 한편 이벤트의 종별이 악곡 프레이즈 데이터를 나타내는 경우에는, 그 악곡 프레이즈 데이터의 재생 처리를 행한다. 이 경우, 악곡 프레이즈 데이터 내의 시간 정보에 기초하여 그 음표 정보를 판독하여 재생한다. 또한, 그 외의 이벤트에 대응하여, 합성 사전 데이터를 이용하여 입력된 문자열을 포먼트 프레임 열로 변환하여, 음성 합성을 행한다. 이와 같이, 사용자는 음성 재생, 파형 데이터 재생, 및 악곡 데이터 재생을 용이하게 조합하여 실행할 수 있다. Specifically, when various types of events described in the HV script are analyzed and the type of the event indicates waveform data, the waveform data is read and reproduced, and when the type of the event indicates music phrase data, A music phrase data playback process is performed. In this case, the note information is read and reproduced based on the time information in the music phrase data. In response to other events, the input character string is converted into formant frame strings using the synthesis dictionary data to perform voice synthesis. In this manner, the user can easily perform a combination of voice reproduction, waveform data reproduction, and music data reproduction.

<실시예><Example>

본 발명의 실시예에 대하여 첨부 도면을 참조하여 상세히 설명한다. Embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 제1 실시예에 따른 음성 재생 장치의 구성을 도시하는 블록도이다. Fig. 1 is a block diagram showing the construction of an audio reproducing apparatus according to the first embodiment of the present invention.

즉, 도 1에 도시하는 음성 재생 장치(1)는, 어플리케이션 소프트웨어(application software)(14), 미들웨어 API(middleware application program interface)(15), 컨버터(16), 드라이버(17), 디폴트 음색 파라미터(default tone color parameter)(18), 디폴트 합성 사전 데이터(default composition dictionary data)(19), 및 음원(20)을 구비하고, 스크립트 데이터(script data)(11), 사용자 음색 파라미터(12), 및 사용자 프레이즈 합성 사전 데이터(user phrase composition dictionary data)(13)(가변 길이)를 입력받아 음성을 재생하는 구성으로 되어 있다. That is, the audio reproducing apparatus 1 shown in FIG. 1 includes an application software 14, a middleware application program interface 15, a converter 16, a driver 17, and default tone parameters. (default tone color parameter) 18, default composition dictionary data 19, and a sound source 20, script data 11, user tone parameter 12, And a user phrase composition dictionary data 13 (variable length) is input to reproduce speech.

음성 재생 장치(1)는 FM(frequency modulation) 음원의 자원(resource)을 이용한 CSM(composite sinusoidal modeling: 복합 정현파 모델) 음성 합성 방식에 의한 포먼트 합성(formant composition)에 기초하여 음성을 재생하는 방법을 기본적으로 이용하고 있다. 또한, 본 실시예에서는 사용자 프레이즈 합성 사전 데이터(13)를 정의하고, 음성 재생 장치(1)가 그 사용자 프레이즈 합성 사전 데이터(13)를 참조하여 음색 파라미터에 대하여 음소(phoneme) 단위로 사용자 프레이즈를 할당한다. 이와 같이 음색 파라미터에 사용자 프레이즈 합성 사전 데이터(13)가 할당되어 있는 경우, 음성 재생 장치(1)는 재생 시에 디폴트 합성 사전 데이터에 등록되어 있는 음소를 사용자 프레이즈로 치환하고, 그 치환 데이터에 기초하여 음성 합성을 실행한다. 또한, 상기한 「음소」란 발음의 최소 단위로, 일본어의 경우, 모음(vowel)과 자음(consonant)에 의해 구성된다. The speech reproducing apparatus 1 is a method for reproducing speech based on formant composition by composite sinusoidal modeling (CSM) speech synthesis using a resource of a frequency modulation (FM) sound source. Is basically used. Also, in the present embodiment, user phrase synthesis dictionary data 13 is defined, and the audio reproducing apparatus 1 refers to the user phrase synthesis dictionary data 13 to perform user phrases in phoneme units on tone parameters. Assign. When user phrase synthesis dictionary data 13 is assigned to the tone color parameter as described above, the audio reproducing apparatus 1 substitutes a user phrase for a phoneme registered in the default synthesis dictionary data at the time of reproduction, and based on the substitution data. Speech synthesis. In addition, the above-mentioned "phoneme" is a minimum unit of pronunciation, and is composed of vowels and consonants in Japanese.

다음으로, 음성 재생 장치(1)의 상세 구성에 대하여 설명한다. Next, the detailed configuration of the audio reproducing apparatus 1 will be described.

도 1에서, 스크립트 데이터(11)는 「HV(human voice): 상기 방법에 의해 합성되는 음성)」를 재생하기 위한 데이터 포맷을 정의하는 것이다. 즉, 스크립트 데이터(11)는 운률 기호(intonation symbol)를 포함한 발음 문자열, 발음하는 소리를 설정하기 위한 이벤트 데이터(event data), 및 상기 어플리케이션 소프트웨어(14)를 제어하기 위한 이벤트 데이터 등을 포함하는 음성 합성을 실행하기 위한 데이터 포맷을 나타내고, 사용자에 의한 수동 입력을 쉽게 하기 위해서 텍스트 입력 형식으로 되어 있다. In Fig. 1, the script data 11 defines a data format for reproducing " HV (human voice): voice synthesized by the above method ". That is, the script data 11 includes a pronunciation string including an intonation symbol, event data for setting a sound to be pronounced, event data for controlling the application software 14, and the like. A data format for performing speech synthesis is shown, and a text input format is provided to facilitate manual input by a user.

이 스크립트 데이터(11)에서의 데이터 포맷의 정의는 언어 의존성이 있어, 여러가지 언어에 의한 정의가 가능하지만, 본 실시예에서는, 일본어에 의한 정의를 예로 들 수 있다. The definition of the data format in the script data 11 is language dependent and can be defined in various languages. However, in the present embodiment, the definition in Japanese is exemplified.

사용자 프레이즈 합성 사전 데이터(13) 및 디폴트 합성 사전 데이터(19)는, 실제의 사람의 소리를 발음 문자 단위(예를 들면, 일본어의 「あ」, 「い」 등)로 샘플링하여 분석함으로써, 8조의 포먼트 주파수, 포먼트 레벨, 및 피치를 파라미터로 하여 추출하고, 그 파라미터를 미리 포먼트 프레임 데이터로서 발음 문자 단위와 대응시켜서, 발음 문자 단위로 보존하고 있는 데이터베이스에 상당한다. 사용자 프레이즈 합성 사전 데이터(13)는 미들웨어 외에 구축된 데이터베이스로, 이 데이터베이스에 대하여 사용자가 임의로 포먼트 프레임 데이터를 등록해 둘 수 있으므로써, 사용자 프레이즈 합성 사전 데이터(13)의 등록 내용을 미들웨어 API(15)를 통하여 디폴트 합성 사전 데이터(19)의 보존 내용과 완전하게 치환할 수 있다. 즉, 디폴트 합성 사전 데이터(19)의 내용을 사용자 프레이즈 합성 사전 데이터(13)의 내용으로 완전하게 치환할 수 있다. 한편, 디폴트 합성 사전 데이터(19)는 미들웨어 내에 구축된 데이터베이스이다. The user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19 are analyzed by sampling and analyzing the actual human sound in phonetic character units (for example, Japanese "あ", "い", etc.). The formant frequency, formant level, and pitch of the pair are extracted as parameters, and the parameters are previously associated with phonetic character units as formant frame data, and correspond to a database stored in phonetic character units. The user phrase synthesis dictionary data 13 is a database built in addition to the middleware, and the user can register the formant frame data arbitrarily with the database, so that the contents of the user phrase synthesis dictionary data 13 can be registered in the middleware API ( 15) it is possible to completely replace the contents of the default synthesis dictionary data 19. That is, the contents of the default synthesis dictionary data 19 can be completely replaced by the contents of the user phrase synthesis dictionary data 13. On the other hand, the default synthesis dictionary data 19 is a database built in the middleware.

사용자 프레이즈 합성 사전 데이터(13) 및 디폴트 합성 사전 데이터(19)로서는, 각각 남성의 음질용과 여성의 음질용의 2 종류를 구비하는 것이 적합하다. 각 프레임의 주기에 따라 음성 재생 장치(1)의 음성 출력이 변화하게 되지만, 사용자 프레이즈 합성 사전 데이터(13) 및 디폴트 합성 사전 데이터에 등록되는 포먼트 프레임 데이터의 프레임 주기는 예를 들면 20㎳로 설정되어 있다. As the user phrase synthesis dictionary data 13 and the default synthesis dictionary data 19, it is suitable to have two types, respectively, for male sound quality and for female sound quality. Although the audio output of the audio reproducing apparatus 1 changes with the period of each frame, the frame period of the formant frame data registered in the user phrase synthesis dictionary data 13 and the default synthesis dictionary data is, for example, 20 ms. It is set.

사용자 음색 파라미터(12) 및 디폴트 음색 파라미터(18)는 음성 재생 장치(1)의 음성 출력에서의 음질을 제어하는 파라미터군이다. 즉, 사용자 음색 파라미터(12) 및 디폴트 음색 파라미터(18)는, 예를 들면 8조의 포먼트 주파수 및 포먼트 레벨의 변경(즉, 사용자 프레이즈 합성 사전 데이터(13) 및 디폴트 합성 사전 데이터(19)에 등록되어 있는 포먼트 주파수, 포먼트 레벨로부터의 변화량의 지정), 및 포먼트 합성을 위한 기본 파형의 지정을 할 수 있고, 이에 따라 여러가지 음색을 만들어낼 수 있다. The user timbre parameters 12 and the default timbre parameters 18 are groups of parameters for controlling the sound quality in the audio output of the audio reproducing apparatus 1. That is, the user timbre parameter 12 and the default timbre parameter 18 are, for example, eight sets of formant frequency and formant change (i.e., user phrase synthesis dictionary data 13 and default synthesis dictionary data 19). Designation of the formant frequency, change amount from formant level), and basic waveforms for formant synthesis can be specified, and various tones can be produced accordingly.

디폴트 음색 파라미터(18)는, 미리 미들웨어 내에 디폴트값으로서 설정되어 있는 음색 파라미터 세트이다. 사용자 음색 파라미터(12)는 사용자가 임의로 작성 할 수 있는 파라미터로, 미들웨어의 외측에 설정 보존되고, 상기 미들웨어 API(15)를 통하여 디폴트 음색 파라미터(18)의 내용을 확장하는 것이다. The default tone color parameter 18 is a tone color parameter set previously set as a default value in the middleware. The user tone parameter 12 is a parameter that can be arbitrarily created by the user, and is stored and stored outside the middleware, and extends the contents of the default tone parameter 18 through the middleware API 15.

어플리케이션 소프트웨어(14)는 스크립트 데이터(11)를 재생하기 위한 소프트웨어이다. The application software 14 is software for reproducing the script data 11.

미들웨어 API(15)는, 소프트웨어로 이루어지는 어플리케이션 소프트웨어(14)와, 미들웨어로 이루어지는 컨버터(16), 드라이버(17), 디폴트 음색 파라미터(18), 및 디폴트 합성 사전 데이터(19) 사이의 인터페이스를 구성한다. The middleware API 15 constitutes an interface between the application software 14 made of software, the converter 16 made of the middleware, the driver 17, the default timbre parameters 18, and the default synthesis dictionary data 19. do.

컨버터(16)는 스크립트 데이터(11)를 해석하여, 드라이버(17)를 이용하여 최종적으로 프레임 데이터가 연속하여 형성되는 포먼트 프레임 데이터열로 변환하는 것이다. The converter 16 interprets the script data 11 and converts the script data 11 into a formant frame data string in which frame data is finally formed continuously using the driver 17.

드라이버(17)는 스크립트 데이터(11)에 포함되는 발음 문자열과 디폴트 합성 사전 데이터(19)에 기초하여 포먼트 프레임 데이터열을 생성하고, 음색 파라미터를 해석하여 그 포먼트 프레임 데이터열을 가공하는 것이다. The driver 17 generates formant frame data strings based on the pronunciation strings and default synthesis dictionary data 19 included in the script data 11, and interprets the tone color parameters to process the formant frame data strings. .

음원(20)은 컨버터(16)의 출력 데이터에 대응한 합성 발음 신호를 출력하는 것으로, 그 합성 발음 신호가 스피커에 출력되어 보음(報音)된다. The sound source 20 outputs a synthesized pronunciation signal corresponding to the output data of the converter 16, and the synthesized pronunciation signal is output to the speaker for viewing.

다음으로, 본 실시예에 따른 음성 재생 장치(1)의 기술적 특징에 대하여 설 명한다. Next, technical features of the audio reproducing apparatus 1 according to the present embodiment will be described.

사용자 음색 파라미터(12)에는 임의의 발음 단위에 대하여 사용자 프레이즈 합성 사전 데이터(13)가 기억하는 프레이즈 ID를 할당하기 위한 파라미터가 포함되어 있다. 도 2는 발음 단위와 프레이즈 ID와의 할당의 일례를 나타낸 것으로, 여기서는 모라(mora)와 프레이즈 ID의 할당 관계를 나타내고 있다. 또한, 일본어의 경우, 모라란 박자를 의미하여, 예를 들면 가나(假名) 문자 단위에 상당한다. The user timbre parameter 12 includes parameters for allocating phrase IDs stored in the user phrase synthesis dictionary data 13 for any pronunciation unit. Fig. 2 shows an example of the assignment between the pronunciation unit and the phrase ID. Here, the allocation relationship between the mora and the phrase ID is shown. In addition, in the case of Japanese, Mora means a beat, and corresponds to a kana character unit, for example.

발음 단위마다 프레이즈 ID를 할당함으로써, 사용자 음색 파라미터(12)에서 지정된 발음 단위는, 디폴트 합성 사전 데이터(19)가 아니라 사용자 프레이즈 합성 사전 데이터(13)를 참조하여 규정되게 된다. 또한, 사용자 음색 파라미터(12)에서, 1개의 음색 파라미터 내에서 임의의 발음 단위수를 지정할 수 있도록 하는 것이 바람직하다. By allocating phrase IDs for each pronunciation unit, the pronunciation unit specified in the user timbre parameter 12 is defined by referring to the user phrase synthesis dictionary data 13 instead of the default synthesis dictionary data 19. In addition, in the user tone color parameter 12, it is desirable to be able to specify an arbitrary number of pronunciation units in one tone color parameter.

또한, 상기한 사용자 음색 파라미터(12)에서의 발음 단위마다의 프레이즈 ID 할당은 본 실시예의 일례를 나타낸 것으로, 발음 단위에 대응되는 것이면, 다른 방법을 채용해도 된다. The phrase ID assignment for each pronunciation unit in the user tone parameter 12 described above shows an example of the present embodiment, and other methods may be employed as long as they correspond to the pronunciation unit.

다음으로, 사용자 프레이즈 합성 사전 데이터(13)의 상세에 대하여 설명한다. 도 3은 사용자 프레이즈 합성 사전 데이터(13)의 내용예를 나타낸다. 사용자 프레이즈 합성 사전 데이터(13)는 8조의 포먼트 주파수, 포먼트 레벨, 및 피치로 이루어지는 포먼트 프레임 데이터를 기억한다. 도 3에서의 「프레이즈」란, 예를 들면 일본어의 「おはよう」 등, 하나의 의미 혹은 음절적인 통합을 갖는 어구를 나타낸 것으로, 이 「프레이즈」의 정의 규모는 특별히 규정할 필요는 없고, 단어, 음절, 및 문장 등, 임의의 규모의 어구여도 된다. Next, the details of the user phrase synthesis dictionary data 13 will be described. 3 shows an example of the contents of the user phrase synthesis dictionary data 13. The user phrase synthesis dictionary data 13 stores formant frame data including eight sets of formant frequencies, formant levels, and pitches. The phrase "phrase" in FIG. 3 is a phrase that has a meaning or syllable integration, such as Japanese word "ohashi," for example, and the definition scale of this phrase does not need to be specifically defined. Arbitrary scales, such as syllables and sentences, may be used.

사용자 프레이즈 합성 사전 데이터(13)를 작성하는 툴로서는, 통상의 사운드 파일(*.wav, *.aif 등의 확장자 부여 파일)을 분석하여 8조의 포먼트 주파수, 포먼트 레벨, 및 피치로 이루어지는 포먼트 프레임 데이터를 생성하는 분석 엔진을 탑재할 필요가 있다. As a tool for creating the user phrase synthesis dictionary data 13, a conventional sound file (an extension file such as * .wav, * .aif, etc.) is analyzed and a form including eight sets of formant frequencies, formant levels, and pitches is included. It is necessary to mount an analysis engine for generating fragment frame data.

스크립트 데이터(11)에는, 음질 변경을 지시하는 이벤트 데이터가 포함되어 있지만, 이 이벤트 데이터에 의해 사용자 음색 파라미터(12)를 지정할 수 있다. The script data 11 includes event data for instructing sound quality change, but the user timbre parameter 12 can be specified by this event data.

예를 들면, 스크립트 데이터(11)의 기술예로서는, 일본어의 히라가나 및 영 숫자를 사용한 경우, 「TJK12みなさんX10あか」를 설정할 수 있다. 이 예에서는, 「K」가 디폴트 음색 파라미터(18)를 지정하는 이벤트 데이터를 나타내고, 「X」가 사용자 음색 파라미터(12)를 지정하는 이벤트 데이터를 나타낸다. 「K12」는 복수 종류의 디폴트 음색 파라미터 내에서부터 임의의 특정한 디폴트 음색 파라미터를 지정하는 것이다. 또한, 「X10」은 복수 종류의 사용자 음색 파라미터 내에서부터 도 2에 도시하는 사용자 음색 파라미터를 지정하는 것이다. For example, as the description example of the script data 11, when Japanese hiragana and alphanumeric characters are used, "TJK12 MINAX1010" can be set. In this example, "K" represents event data specifying the default timbre parameter 18, and "X" represents event data specifying the user timbre parameter 12. "K12" designates arbitrary specific default timbre parameters from a plurality of kinds of default timbre parameters. In addition, "X10" designates the user timbre parameter shown in FIG. 2 from a plurality of types of user timbre parameters.

상기한 예에서는, 재생되는 합성 음성은 「みなさんこんにちは鈴木です。」가 된다. 여기서, 「みなさん」은 디폴트 음색 파라미터(18) 및 디폴트 합성 사전 데이터(19)를 참조하여 재생한 합성 음성이고, 또한 「こんにちは」와 「鈴木です」는 사용자 음색 파라미터(12) 및 사용자 프레이즈 합성 사전 데이터(13)를 참조하여 재생한 합성 음성이다. 즉, 「みなさん」이라는 어구는, 「み」, 「な」, 「さ」, 「ん」이라는 4개의 음소에 따른 포먼트 프레임 데이터를 디폴트 합성 사전 데이터(19)로부터 판독하여 재생한 합성 음성이고, 또한 「こんにちは」와 「鈴木です」의 어구는 각각의 프레이즈 단위의 포먼트 프레임 데이터를 사용자 프레이즈 합성 사전 데이터(13)로부터 판독하여 재생한 합성 음성이다. In the above example, the synthesized voice to be reproduced is "Minasan Koninichi." Here, "mini-san" is a synthesized voice reproduced with reference to the default timbre parameter 18 and the default synthesis dictionary data 19. In addition, "Konnichiha" and "鈴木です" are the user timbre parameter 12 and the user phrase synthesis dictionary. Synthesized speech reproduced with reference to the data 13. In other words, the phrase "みなさん" is a synthesized speech obtained by reading formant frame data corresponding to four phonemes of "み", "な", "さ", and "ん" from the default synthesis dictionary data 19. In addition, the phrases "konichiha" and "鈴木です" are synthesized voices which read and reproduce formant frame data of each phrase unit from the user phrase synthesis dictionary data 13.

도 2에 도시하는 예에서는, 「あ」, 「い」, 「か」라는 3개의 발음 단위를 예시하였지만, 이것은 텍스트로 표기할 수 있는 문자 및 기호이면 특별히 규정되지는 않는다. 또한, 상기한 예에서는, 「X10」에 이어지는 「あ」라는 발음 기호에 의해 「こんにちは」라는 어구가 나타나 발음되고, 또한 「か」라는 발음 기호에 의해 「鈴木です」라는 어구가 나타나 발음된다. 이 때문에, 상기한 발음예 후에, 본래의 「あ」라는 발음을 행하는 경우, 참조 목적지를 디폴트 합성 사전 데이터(19)로 복귀하는 기호(예를 들면, 「XO0」, 「O」 내에는 소정의 숫자 등이 들어감)를 삽입하면 된다. In the example shown in FIG. 2, although three pronunciation units of "A", "I", and "KA" are illustrated, this is not specifically defined as long as it is a character and a symbol which can be represented by text. In the above example, the phrase "Konichiha" appears and is pronounced by the pronunciation symbol "あ" following "X10", and the phrase "鈴木です" appears and is pronounced by the pronunciation symbol "ka". For this reason, when the original "A" is pronounced after the above-described pronunciation example, the symbol for returning the reference destination to the default synthesis dictionary data 19 (for example, "XO0" and "O" is predetermined). Number, etc.).

다음으로, 본 실시예에 따른 음성 재생 장치(1)에서 이용되는 음악 재생 시퀀스 데이터(SMAF: synthetic music mobile application format)의 데이터 교환 포맷에 대하여 도 4를 참조하여 설명한다. 도 4는 SMAF 파일의 포맷을 도시한다. 이 SMAF는 음원을 이용하여 음악을 표현하기 위한 데이터 배신 및 상호 이용을 위한 데이터 교환 포맷의 하나로, 휴대 단말기기(personal digital assistant(PDA), personal computer, cellular phone 등)에서 멀티미디어 콘텐츠를 재생하기 위한 데이터 포맷 사양이다. Next, the data exchange format of the music reproduction sequence data (SMAF: synthetic music mobile application format) used in the audio reproducing apparatus 1 according to the present embodiment will be described with reference to FIG. 4 shows the format of a SMAF file. This SMAF is a data exchange format for data distribution and mutual use for expressing music using a sound source, and for playing multimedia contents on a personal digital assistant (PDA), personal computer, cellular phone, etc. Data format specification.

도 4에 도시하는 데이터 교환 포맷의 SMAF 파일(30)은 청크(chunk)라고 불리는 데이터 단위를 기본 구조로 하고 있다. 청크는 고정 길이(8 바이트)의 헤더부 와 임의 길이의 보디부에 의해 구성된다. 헤더부는 4 바이트의 청크 ID와 4 바이트의 청크 사이즈로 분리된다. 청크 ID는 청크의 식별자로서 이용되고, 청크 사이즈는 보디부의 길이를 나타낸다. SMAF 파일(30)은, 그 자체 및 그것에 포함되는 각종 데이터도 모두 청크 구조로 되어 있다. The SMAF file 30 in the data exchange format shown in FIG. 4 has a basic structure of data units called chunks. The chunk consists of a header part of fixed length (8 bytes) and a body part of arbitrary length. The header portion is separated into a 4-byte chunk ID and a 4-byte chunk size. The chunk ID is used as an identifier of the chunk, and the chunk size indicates the length of the body portion. The SMAF file 30 also has a chunk structure as well as the various data contained therein.

도 4에 도시한 바와 같이, SMAF 파일(30)은 콘텐츠 인포 청크(contents info chunk)(31), 옵셔널 데이터 청크(optional data chunk)(32), 스코어 트랙 청크(score track chunk)(33), 및 HV 청크(HV chunk)(36)로 이루어진다. As shown in FIG. 4, the SMAF file 30 includes a contents info chunk 31, an optional data chunk 32, and a score track chunk 33. , And HV chunks 36.

콘텐츠 인포 청크(31)는 SMAF 파일(30)에 대한 각종 관리 정보를 기억하고 있어, 예를 들면 콘텐츠의 클래스, 종류, 저작권 정보, 장르명, 곡명, 아티스트명, 작사/작곡자명, 등을 기억한다. 옵셔널 데이터 청크(32)는, 예를 들면 저작권 정보, 장르명, 곡명, 아티스트명, 작사/작곡자명, 등의 정보를 기억한다. 또한, SMAF 파일(30)에서 옵셔널 데이터 청크(32)를 반드시 구비할 필요는 없다. The content info chunk 31 stores various management information for the SMAF file 30, and stores, for example, the class, type, copyright information, genre name, song name, artist name, songwriter / composer name, and the like of the content. . The optional data chunk 32 stores information such as copyright information, genre name, song name, artist name, songwriter / composer name, and the like. It is also not necessary to include the optional data chunk 32 in the SMAF file 30.

스코어 트랙 청크(33)는 음원으로 송신하는 악곡의 시퀀스 트랙을 기억하는 청크로, 셋업 데이터 청크(setup data chunk)(34)(옵션) 및 시퀀스 데이터 청크(sequence data chunk)(35)를 포함한다. The score track chunk 33 includes a setup data chunk 34 (optional) and a sequence data chunk 35 as a chunk for storing a sequence track of music to be transmitted to a sound source. .

셋업 데이터 청크(34)는 음원의 음색 데이터 등을 기억하는 청크로, 배타적 메시지(exclusive message)의 어구의 배열을 기억한다. 배타적 메시지로서는, 예를 들면 음색 파라미터 등록 메시지가 있다. The setup data chunk 34 stores an array of phrases of exclusive messages as chunks for storing the tone data of the sound source and the like. As an exclusive message, there is a timbre parameter registration message, for example.

시퀀스 데이터 청크(35)는 실제 연주 데이터를 기억하는 것으로, 스크립트 데이터(11)의 재생 타이밍을 결정하는 HV 노트 온(HV note-on, 「HV」는 human voice(음성)를 나타냄)과 그 밖의 시퀀스 이벤트를 혼재하여 기억하고 있다. 여기서, HV와 그것 이외의 악곡 이벤트는, 그 HV의 채널 지정에 의해 구별된다. The sequence data chunk 35 stores actual performance data. HV note-on ("HV" stands for human voice) and other for determining the reproduction timing of the script data 11 are described. The sequence events are mixed and stored. Here, the HV and music events other than that are distinguished by the channel designation of the HV.

HV 청크(36)는, HV 셋업 데이터 청크(HV setup data chunk)(37)(옵션), HV 사용자 프레이즈 사전 청크(HV user phrase dictionary chunk)(38)(옵션), 및 HV-S 청크(39)를 포함한다. The HV chunk 36 includes an HV setup data chunk 37 (optional), an HV user phrase dictionary chunk 38 (optional), and an HV-S chunk 39. ).

HV 셋업 데이터 청크(37)는 HV 사용자 음색 파라미터, 및 HV용 채널을 지정하기 위한 메시지를 기억한다. 또한, HV-S 청크(39)는 HV 스크립트 데이터를 기억한다. The HV setup data chunk 37 stores HV user timbre parameters, and messages for specifying channels for HV. The HV-S chunk 39 also stores HV script data.

HV 사용자 프레이즈 사전 청크(38)는 사용자 프레이즈 합성 사전 데이터(13)의 내용을 기억한다. 또한, HV 셋업 데이터 청크(37)에 기억되는 HV 사용자 음색 파라미터에는, 도 2에 도시하는 모라와 프레이즈 ID의 할당을 위한 파라미터가 필요하다. The HV user phrase dictionary chunk 38 stores the contents of the user phrase synthesis dictionary data 13. In addition, the HV user timbre parameters stored in the HV setup data chunk 37 require parameters for assigning the Mora and phrase ID shown in FIG. 2.

도 4에 도시하는 SMAF 파일(30)을 본 실시예의 음색 파라미터에 적용함으로써, 악곡과 동기하여 합성 음성(HV)을 재생할 수 있음과 함께, 사용자 프레이즈 합성 사전 데이터(13)의 내용도 재생할 수 있다. By applying the SMAF file 30 shown in FIG. 4 to the timbre parameters of the present embodiment, the synthesized voice (HV) can be reproduced in synchronization with the piece of music, and the contents of the user phrase synthesis dictionary data 13 can also be reproduced. .

다음으로, 도 1에 도시하는 사용자 프레이즈 합성 사전 데이터(13) 및 도 4에 도시하는 SMAF 파일(30)을 작성하기 위한 툴인 HV 오서링 툴(HV authoring tool)에 대하여 도 5를 참조하여 설명한다. 도 5는 HV 오서링의 기능 및 사양예를 도시하는 블록도이다. Next, the HV authoring tool which is a tool for creating the user phrase synthesis dictionary data 13 shown in FIG. 1 and the SMAF file 30 shown in FIG. 4 is demonstrated with reference to FIG. . 5 is a block diagram showing a function and an example of a specification of the HV authoring.

HV 오서링 툴(42)은 SMAF 파일(30)을 작성하는 경우, 미리 MIDI(musical instrument digital interface) 시퀀서에 의해 작성된 SMF 파일(standard MIDI file)(41)(HV의 발음 타이밍을 정하는 노트 온을 포함함)을 판독하고, HV 스크립트 UI(HV script user interface)(44) 및 HV 음성 에디터(HV voice editor)(45)로부터 얻어진 정보에 기초하여 SMAF 파일(43)(상기 SMAF 파일(30)에 상당)로 변환하는 처리를 실행한다. When the HV authoring tool 42 creates the SMAF file 30, a SMF file 41, which is previously created by a musical instrument digital interface (MIDI) sequencer, performs note-on for determining the pronunciation timing of the HV. And the SMAF file 43 (the SMAF file 30) based on information obtained from the HV script user interface 44 and the HV voice editor 45. Equivalent).

HV 음성 에디터(45)는 HV 사용자 음색 파일(48)에 포함되는 HV 사용자 음색 파라미터(상기한 사용자 음색 파라미터(12)에 상당)를 편집하는 기능을 갖는 에디터이다. 이 HV 음성 에디터(45)는, 각종 HV 음색 파라미터의 편집 외에 추가로, 임의의 모라에 대하여 사용자 프레이즈를 할당할 수 있다. The HV voice editor 45 is an editor having a function of editing the HV user timbre parameters (corresponding to the above-described user timbre parameters 12) included in the HV user timbre file 48. In addition to editing various HV timbre parameters, the HV audio editor 45 can assign user phrases to arbitrary modes.

HV 음성 에디터(45)의 인터페이스는, 모라를 선택하는 메뉴를 갖고, 그 모라에 대하여 임의의 사운드 파일(50)을 할당하는 기능을 갖는다. HV 음성 에디터(45)의 인터페이스에 의해서 할당된 사운드 파일(50)은, 파형 분석기(46)에 의해 분석됨으로써, 8조의 포먼트 주파수, 포먼트 레벨, 및 피치로 이루어지는 포먼트 프레임 데이터가 생성된다. 이 포먼트 프레임 데이터는 개별 파일(즉, HV 사용자 음색 파일(48), HV 사용자 합성 사전 파일(49))로서 입출력할 수 있다. The interface of the HV voice editor 45 has a menu for selecting Mora, and has a function of assigning an arbitrary sound file 50 to the Mora. The sound file 50 assigned by the interface of the HV voice editor 45 is analyzed by the waveform analyzer 46 to generate formant frame data including eight sets of formant frequencies, formant levels, and pitches. . The formant frame data can be input and output as individual files (i.e., HV user timbre file 48, HV user synthesis dictionary file 49).

HV 스크립트 UI(44)는 HV 스크립트 데이터를 직접 편집할 수 있다. 이 HV 스크립트 데이터도 개별 파일(즉, HV 스크립트 파일(47))로서 입출력할 수 있다. 또, 본 실시예에 따른 HV 오서링 툴(40)은 상기한 HV 오서링 툴(42), HV 스크립트 UI(44), HV 음성 에디터(45), 및 파형 분석기(46)로만 구성하도록 해도 된다. HV script UI 44 may directly edit HV script data. This HV script data can also be input and output as individual files (i.e., HV script file 47). In addition, the HV authoring tool 40 according to the present embodiment may be configured only with the above-described HV authoring tool 42, the HV script UI 44, the HV voice editor 45, and the waveform analyzer 46. .

다음으로, 본 실시예에 따른 음성 재생 장치(1)를 휴대 통신 단말기에 적용 한 예에 대하여, 도 6을 참조하여 설명한다. 도 6은 음성 재생 장치(1)를 구비하는 휴대 통신 단말기(60)의 구성을 도시하는 블록도이다. Next, an example in which the audio reproducing apparatus 1 according to the present embodiment is applied to a portable communication terminal will be described with reference to FIG. 6 is a block diagram showing the configuration of the portable communication terminal 60 including the audio reproducing apparatus 1.

휴대 통신 단말기(60)는, 예를 들면 휴대 전화에 상당하는 것으로, CPU(61), ROM(62), RAM(63), 표시부(64), 바이브레이터(65), 입력부(66), 통신부(67), 안테나(68), 음성 처리부(69), 음원(70), 스피커(71), 및 버스(72)를 구비한다. CPU(61)는 휴대 통신 단말기(60) 전체의 제어를 행한다. ROM(62)은 각종 통신 제어 프로그램 및 악곡 재생을 위한 프로그램 등의 제어 프로그램, 및 각종 상수 데이터 등을 기억한다. The portable communication terminal 60 corresponds to, for example, a cellular phone, and includes a CPU 61, a ROM 62, a RAM 63, a display unit 64, a vibrator 65, an input unit 66, and a communication unit ( 67, an antenna 68, a sound processor 69, a sound source 70, a speaker 71, and a bus 72. The CPU 61 controls the entire portable communication terminal 60. The ROM 62 stores various communication control programs, control programs such as programs for reproducing music, and various constant data.

RAM(63)은 워크 에리어로서 사용됨과 함께, 악곡 파일 및 각종 어플리케이션 프로그램 등을 기억한다. 표시부(64)는, 예를 들면 액정 표시 장치(LCD: liquid crystal display)로 구성된다. 바이브레이터(65)는 휴대 전화의 착신 시에 진동한다. 입력부(66)는 복수의 키 등의 조작자로 구성된다. 이들 조작자는 사용자에 의한 조작에 기초하여, 사용자 음색 파라미터, 사용자 프레이즈 합성 사전 데이터, 및 HV 스크립트 데이터의 등록 처리를 지시하는 것이다. 통신부(67)는 복조·변조부(modulator-demodulator) 등으로 구성되어, 안테나와 접속되어 있다. The RAM 63 is used as a work area and stores a music file, various application programs, and the like. The display unit 64 is configured of, for example, a liquid crystal display (LCD). The vibrator 65 vibrates upon reception of the cellular phone. The input unit 66 is composed of operators such as a plurality of keys. These operators instruct the registration processing of the user timbre parameters, the user phrase synthesis dictionary data, and the HV script data based on the operation by the user. The communication unit 67 is composed of a demodulator-demodulator and the like, and is connected to an antenna.

음성 처리부(69)는 송화 마이크로폰 및 수화 스피커(e.g., microphone and earphone)에 접속되어 있고, 통화를 위해 음성 신호의 부호화 및 복호화를 행하는 기능을 갖는다. 음원(70)은 RAM(63) 등에 기억된 악곡 파일에 기초하여 악곡의 재생을 행함과 함께, 음성 신호를 재생하여 스피커(71)로 출력한다. 버스(72)는 CPU(61), ROM(62), RAM(63), 표시부(64), 바이브레이터(65), 입력부(66), 통신부(67), 음성 처리부(69), 및 음원(70)으로 이루어지는 휴대 전화의 각 구성 요소 사이에서의 데이터 전송을 행하기 위한 전송로이다. The speech processing section 69 is connected to a telephone microphone and a sign language speaker (e.g., microphone and earphone), and has a function of encoding and decoding a speech signal for a call. The sound source 70 reproduces the music based on the music file stored in the RAM 63 or the like, and reproduces the audio signal and outputs it to the speaker 71. The bus 72 includes a CPU 61, a ROM 62, a RAM 63, a display unit 64, a vibrator 65, an input unit 66, a communication unit 67, a voice processing unit 69, and a sound source 70. Is a transmission path for performing data transmission between each component of the cellular phone.

통신부(67)는 HV 스크립트 파일 또는 도 4에 도시하는 SMAF 파일(30)을 소정의 콘텐츠 서버 등으로부터 다운로드하여 RAM(63)에 기억시킬 수 있다. ROM(62)에는 도 1에 도시하는 음성 재생 장치(1)의 어플리케이션 소프트웨어(14) 및 미들웨어의 프로그램도 기억되어 있다. 이 어플리케이션 소프트웨어(14) 및 미들웨어의 프로그램은 CPU(61)에 의해 판독되어 기동된다. 또한, CPU(61)는 RAM(63)에 기억되어 있는 HV 스크립트 데이터를 해석하여 포먼트 프레임 데이터를 생성하고, 그 포먼트 프레임 데이터를 음원(70)으로 보낸다. The communication unit 67 can download the HV script file or the SMAF file 30 shown in FIG. 4 from a predetermined content server or the like and store them in the RAM 63. The ROM 62 also stores the application software 14 and the middleware program of the audio reproducing apparatus 1 shown in FIG. The programs of the application software 14 and the middleware are read by the CPU 61 and started. In addition, the CPU 61 analyzes the HV script data stored in the RAM 63 to generate formant frame data, and sends the formant frame data to the sound source 70.

다음으로, 본 실시예에 따른 음성 재생 장치(1)의 동작에 대하여 설명한다. 우선, 사용자 프레이즈 합성 사전 데이터(13)의 작성 방법에 대하여 설명한다. 도 7은 사용자 프레이즈 합성 사전 데이터(13)의 작성 방법을 도시하는 흐름도이다. Next, the operation of the audio reproducing apparatus 1 according to the present embodiment will be described. First, a method for creating user phrase synthesis dictionary data 13 will be described. 7 is a flowchart showing a method of creating user phrase synthesis dictionary data 13.

우선, 단계 S1에서, 도 5에 도시하는 HV 오서링 툴(42)을 이용하여 사용자 프레이즈 합성 사전 데이터(13)를 참조하는 HV 음색을 선택하고, HV 음성 에디터(45)를 기동한다. 다음으로, HV 음성 에디터(45)를 이용하여 사용할 모라를 선택하여 사운드 파일을 첨부한다. 이에 의해, 단계 S2에 의해, HV 음성 에디터(45)는 사용자 프레이즈 사전 데이터(HV 사용자 합성 사전 파일(49)에 상당)를 생성 출력한다. First, in step S1, the HV sound tone referring to the user phrase synthesis dictionary data 13 is selected using the HV authoring tool 42 shown in FIG. 5, and the HV voice editor 45 is started. Next, using the HV voice editor 45, select the Mora to be used to attach a sound file. As a result, in step S2, the HV voice editor 45 generates and outputs user phrase dictionary data (corresponding to the HV user synthesis dictionary file 49).

다음으로, HV 음성 에디터(45)를 이용하여 HV 음색 파라미터를 편집한다. 이에 의해, HV 음성 에디터(45)는, 단계 S3에서, 사용자 음색 파라미터(HV 사용자 음색 파일(48)에 상당)를 생성 출력한다. Next, the HV timbre parameters are edited using the HV voice editor 45. As a result, the HV audio editor 45 generates and outputs a user tone color parameter (corresponding to the HV user tone file 48) in step S3.

다음으로, HV 스크립트 UI(44)를 이용하여 HV 스크립트 데이터에서 해당하는 HV 음색을 지정하는 음질 변경 이벤트를 기술하므로써, 재생하고 싶은 모라를 기술한다. 이에 의해, HV 스크립트 UI(44)는 단계 S4에서, HV 스크립트 데이터(HV 스크립트 파일(47)에 상당)를 생성 출력한다. Next, using the HV script UI 44, the sound quality change event for designating the corresponding HV timbre in the HV script data is described to describe the Mora to be reproduced. As a result, the HV script UI 44 generates and outputs HV script data (corresponding to the HV script file 47) in step S4.

다음으로, 음성 재생 장치(1)에서의 사용자 프레이즈 합성 사전 데이터(13)의 재생 동작에 대하여 도 8을 참조하여 설명한다. 도 8은 음성 재생 장치(1)에서의 사용자 프레이즈 합성 사전 데이터(13)의 재생 동작을 나타내는 흐름도이다. Next, the reproduction operation of the user phrase synthesis dictionary data 13 in the audio reproduction device 1 will be described with reference to FIG. 8 is a flowchart showing the reproduction operation of the user phrase synthesis dictionary data 13 in the audio reproduction device 1.

우선, 단계 S11에서, 사용자 음색 파라미터(12) 및 사용자 프레이즈 합성 사전 데이터(13)를 음성 재생 장치(1)의 미들웨어에 등록한다. 그 후, 스크립트 데이터(11)를 음성 재생 장치(1)의 미들웨어에 등록하고, 단계 S12에서 HV 스크립트 데이터의 재생을 개시한다. First, in step S11, the user tone parameter 12 and the user phrase synthesis dictionary data 13 are registered in the middleware of the audio reproducing apparatus 1. Thereafter, the script data 11 is registered in the middleware of the audio reproducing apparatus 1, and playback of the HV script data is started in step S12.

재생 시에, 단계 S13에서, 스크립트 데이터(11) 내에 사용자 음색 파라미터(12)를 지정하는 음질 변경 이벤트(X 이벤트)가 포함되어 있는지의 여부를 감시한다. At the time of reproduction, in step S13, it is monitored whether or not the script data 11 includes a sound quality change event (X event) specifying the user tone parameter 12.

단계 S13에서, 음질 변경 이벤트를 찾아낸 경우, 그 사용자 음색 파라미터(12)로부터 모라에 할당되어 있는 프레이즈 ID를 찾고, 그 프레이즈 ID에 대응하는 데이터를 사용자 프레이즈 합성 사전 데이터(13)로부터 판독함으로써, 단계 S14에서, HV 드라이버가 관리하는 디폴트 합성 사전 데이터(19) 내에 해당하는 모라의 사전 데이터를 사용자 프레이즈 합성 사전 데이터(13)로 치환한다. 또한, 단계 S14의 교체 처리는 HV 스크립트 데이터의 재생 전에 행해도 된다. In step S13, when the sound quality change event is found, the phrase ID assigned to the Mora is found from the user timbre parameter 12, and the data corresponding to the phrase ID is read from the user phrase synthesis dictionary data 13, thereby In S14, the dictionary data of Mora corresponding to the default synthesis dictionary data 19 managed by the HV driver is replaced with the user phrase synthesis dictionary data 13. In addition, the replacement process of step S14 may be performed before reproduction of the HV script data.

단계 S14의 종료 후, 혹은 단계 S13에서 음질 변경 이벤트가 발견되지 않은 경우, 흐름은 단계 S15로 진행하여, 컨버터(16)가 스크립트 데이터(11)(단계 S14의 처리가 행해진 경우에는, 그 단계 S14의 교체 처리 후의 스크립트 데이터)의 모라를 해석하고, HV 드라이버를 이용하여 최종적으로 포먼트 프레임 열 데이터로 변환한다. After the end of step S14 or when no sound quality change event is found in step S13, the flow advances to step S15, in which case converter 16 performs script data 11 (step S14, if the process of step S14 is performed). The script data after the replacement process is analyzed and finally converted to formant frame column data using the HV driver.

단계 S16에서, 단계 S15에서 변환된 데이터를 음원(20)으로 재생한다. In step S16, the data converted in step S15 is reproduced by the sound source 20.

그 후, 흐름도는 단계 S17로 진행하여, 스크립트 데이터(11)의 재생이 종료하였는지의 여부를 판정하고, 종료하지 않은 경우에는 단계 S13으로 되돌아가고, 한편 종료한 경우에는, 도 8에 도시하는 사용자 프레이즈 합성 사전 데이터(13)의 재생 처리를 종료한다. Thereafter, the flowchart advances to step S17 to determine whether or not the reproduction of the script data 11 has ended. If not, the flow returns to step S13, and when finished, the user shown in FIG. The reproduction processing of the phrase synthesis dictionary data 13 is terminated.

다음으로, 도 4에 도시하는 SMAF 파일(30)의 작성 방법에 대하여 도 9를 참조하여 설명한다. 도 9는 SMAF 파일(30)의 작성 방법을 도시하는 흐름도이다. Next, the creation method of the SMAF file 30 shown in FIG. 4 is demonstrated with reference to FIG. 9 is a flowchart illustrating a method of creating the SMAF file 30.

우선, 도 7에 도시하는 수순에 따라, 사용자 프레이즈 합성 사전 데이터(13), 사용자 음색 파라미터(12), 및 스크립트 데이터(11)를 작성한다(단계 S21 참조). First, in accordance with the procedure shown in FIG. 7, the user phrase synthesis dictionary data 13, the user timbre parameters 12, and the script data 11 are created (see step S21).

다음으로, 단계 S22에서, 악곡 데이터 및 HV 스크립트 데이터의 발음을 제어하는 이벤트를 포함한 SMF 파일(41)을 작성한다. Next, in step S22, an SMF file 41 including an event for controlling pronunciation of music data and HV script data is created.

다음으로, 도 5에 도시하는 HV 오서링 툴(42)로 SMF 파일(41)을 입력받고, 그 HV 오서링 툴(42)에 의해 SMF 파일(41)을 SMAF 파일(43)(상기 SMAF 파일(30)에 상당)로 변환한다(단계 S23 참조). Next, the SMF file 41 is input to the HV authoring tool 42 shown in FIG. 5, and the SMF file 41 is converted into the SMAF file 43 (the SMAF file by the HV authoring tool 42). (Equivalent to 30) (see step S23).

다음으로, 상기 단계 S21에서 작성된 사용자 음색 파라미터(12)를 도 4에 도시하는 SMAF 파일(30)의 HV 청크(36) 내의 HV 셋업 데이터 청크(37)에 입력하고, 또한 단계 S21에서 작성된 사용자 프레이즈 합성 사전 데이터(13)를 SMAF 파일(30)의 HV 청크(36) 내의 HV 사용자 프레이즈 사전 청크(38)에 입력하며, 이에 의해, SMAF 파일(30)이 생성 출력된다(단계 S24 참조). Next, the user tone parameter 12 created in step S21 is inputted to the HV setup data chunk 37 in the HV chunk 36 of the SMAF file 30 shown in FIG. 4, and the user phrase created in step S21. The synthesis dictionary data 13 is input to the HV user phrase dictionary chunk 38 in the HV chunk 36 of the SMAF file 30, whereby the SMAF file 30 is generated and output (see step S24).

다음으로, SMAF 파일(30)의 재생 처리에 대하여 도 10을 참조하여 설명한다. 도 10은 SMAF 파일(30)의 재생 처리를 도시하는 흐름도이다. Next, the reproduction processing of the SMAF file 30 will be described with reference to FIG. 10 is a flowchart showing a reproduction process of the SMAF file 30.

우선, 단계 S31에서 SMAF 파일(30)을 도 1에 도시하는 음성 재생 장치(1)의 미들웨어에 등록한다. 여기서, 음성 재생 장치(1)는, 통상 SMAF 파일(30) 내의 악곡 데이터 부분을 미들웨어의 악곡 재생부에 등록하여 재생 준비를 행한다. First, in step S31, the SMAF file 30 is registered in the middleware of the audio reproducing apparatus 1 shown in FIG. Here, the audio reproducing apparatus 1 registers the music data portion in the SMAF file 30 normally in the music reproducing section of the middleware to prepare for reproduction.

단계 S32에서, 음성 재생 장치(1)는 SMAF 파일(30) 내에 HV 청크(36)가 포함되어 있는지의 여부를 판정한다. In step S32, the audio reproducing apparatus 1 determines whether or not the HV chunk 36 is included in the SMAF file 30.

단계 S32의 판정 결과가 「YES」인 경우, 흐름은 단계 S33으로 진행하여, 음성 재생 장치(1)는 HV 청크(36)의 내용을 해석한다. If the determination result in step S32 is "YES", the flow advances to step S33, and the audio reproducing apparatus 1 analyzes the contents of the HV chunk 36.

단계 S34에서, 음성 재생 장치(1)는 사용자 음색 파라미터의 등록, 사용자 프레이즈 합성 사전 데이터의 등록, 및 HV 스크립트 데이터의 등록을 행한다. In step S34, the audio reproducing apparatus 1 registers the user tone parameter, the registration of the user phrase synthesis dictionary data, and the registration of the HV script data.

단계 S32의 판정 결과가 「NO」인 경우, 혹은 단계 S34에서의 등록 처리가 종료한 경우, 흐름은 단계 S35로 진행하여, 음성 재생 장치(1)는 그 악곡 재생부 내의 청크의 해석을 행한다. If the determination result in step S32 is "NO" or the registration process in step S34 ends, the flow advances to step S35, and the audio reproducing apparatus 1 analyzes the chunk in the music reproducing unit.

다음으로, 음성 재생 장치(1)는 「스타트」 신호에 대응하여 시퀀스 데이터 청크(35) 내의 시퀀스 데이터(즉, 실제 연주 데이터)의 해석을 개시함으로써, 악곡 재생을 실행한다(단계 S36 참조). Next, the audio reproducing apparatus 1 starts the analysis of the sequence data (that is, the actual performance data) in the sequence data chunk 35 in response to the "start" signal, thereby performing music reproduction (see step S36).

상기한 악곡 재생에서, 음성 재생 장치(1)는 시퀀스 데이터에 포함되는 이벤트를 순차 해석하고 있고, 그 과정에서 각 이벤트가 HV 노트 온에 상당하는지의 여부를 판정한다(단계 S37 참조). In the music reproduction, the audio reproducing apparatus 1 sequentially analyzes the events included in the sequence data, and determines whether or not each event corresponds to the HV note on in the process (see step S37).

단계 S37의 판정 결과가 「YES」인 경우, 흐름은 단계 S38로 진행하여, 음성 재생 장치(1)는 상기 HV 노트 온으로 지정되어 있는 HV 청크의 HV 스크립트 데이터의 재생을 개시한다. If the determination result in step S37 is "YES", the flow advances to step S38, and the audio reproducing apparatus 1 starts reproduction of the HV script data of the HV chunk designated as the HV note on.

단계 S38의 종료 후, 음성 재생 장치(1)는 도 8에 도시하는 사용자 프레이즈 합성 사전 데이터의 재생 처리를 실행한다. 즉, 음성 재생 장치(1)는 단계 S38에서의 HV 스크립트 데이터의 재생에서, 사용자 음색 파라미터(12)를 지정하는 음질 변경 이벤트(X 이벤트)가 존재하는지의 여부를 판정한다(단계 S39 참조). After the end of step S38, the audio reproducing apparatus 1 executes the reproduction processing of the user phrase synthesis dictionary data shown in FIG. That is, the audio reproducing apparatus 1 determines whether or not there is a sound quality change event (X event) specifying the user tone parameter 12 in the reproduction of the HV script data in step S38 (see step S39).

상기한 음질 변경 이벤트가 존재하는 경우, 즉 단계 S39의 판정 결과가 「YES」인 경우, 흐름은 단계 S40으로 진행하여, 사용자 음색 파라미터(12)로부터 모라에 할당되어 있는 프레이즈 ID를 찾고, 프레이즈 ID에 대응하는 데이터를 사용자 프레이즈 합성 사전 데이터(13)로부터 판독하여, HV 드라이버가 관리하는 디폴트 합성 사전 데이터(19) 내에서, 해당하는 모라의 사전 데이터를 사용자 프레이즈 합성 사전 데이터로 치환한다. 이 단계 S40의 교체 처리는, HV 스크립트 데이터의 재생전에 실행해도 된다. If the above-described sound quality change event exists, that is, if the determination result in step S39 is "YES", the flow advances to step S40 to find the phrase ID assigned to Mora from the user tone parameter 12, and to the phrase ID. The corresponding data is read from the user phrase synthesis dictionary data 13 and the corresponding dictionary data of the corresponding Mora is replaced with the user phrase synthesis dictionary data in the default synthesis dictionary data 19 managed by the HV driver. The replacement process of step S40 may be executed before reproduction of the HV script data.

단계 S40의 종료 후, 혹은 단계 S39에서 음질 변경 이벤트가 발견되지 않은 경우, 흐름은 단계 S41로 진행하여, 컨버터(16)가 스크립트 데이터의 모라를 해석하고, HV 드라이버를 이용하여 최종적으로 포먼트 프레임 열 데이터로 변환한다. After the end of step S40, or if no sound quality change event is found in step S39, the flow advances to step S41, where the converter 16 interprets the Moras of the script data and finally forms the formant frame using the HV driver. Convert to column data.

다음으로, 흐름은 단계 S42로 진행하여, 단계 S41에서 변환된 데이터를 음원(20)의 HV 재생부에서 재생한다. Next, the flow advances to step S42 to reproduce the data converted in step S41 by the HV reproducing section of the sound source 20.

그 후, 흐름은 단계 S43으로 진행하여, 음성 재생 장치(1)는 악곡의 재생이 종료하였는지의 여부를 판정한다. 악곡 재생이 종료한 경우, SMAF 파일(30)의 재생 처리를 종료하고, 한편 악곡 재생이 종료하지 않은 경우, 흐름은 단계 S37로 되돌아간다. Thereafter, the flow advances to step S43, and the audio reproducing apparatus 1 determines whether or not the reproduction of the piece of music has ended. When the music reproduction ends, the reproduction processing of the SMAF file 30 ends, while when the music reproduction does not end, the flow returns to step S37.

단계 S37에서, 상기 시퀀스 데이터에서의 이벤트가 HV 노트 온이 아닌 경우, 음성 재생 장치(1)는 상기 이벤트를 악곡 데이터의 일부로서 인식하여, 음원 재생 이벤트 데이터로 변환한다(단계 S44 참조). In step S37, when the event in the sequence data is not HV note on, the audio reproducing apparatus 1 recognizes the event as part of the music data and converts it to sound source reproduction event data (see step S44).

다음으로, 흐름은 단계 S45로 진행하여, 음성 재생 장치(1)는 단계 S44에서 변환된 데이터를 음원(20)의 악곡 재생부에서 재생한다. Next, the flow advances to step S45, and the audio reproducing apparatus 1 reproduces the data converted in step S44 in the music reproducing section of the sound source 20. In FIG.

상술한 바와 같이, 본 실시예는 FM 음원을 이용한 포먼트 합성에 의한 음성 재생 방식을 채용하고 있고, 이것에는 이하의 3가지의 이점이 있다. As described above, the present embodiment employs a voice reproduction method by formant synthesis using an FM sound source, which has the following three advantages.

(1) 사용자가 원하는 프레이즈를 할당할 수 있다. 즉, 고정 사전에 의존하지 않고, 원하는 음색에 의해 근사한 음색으로써 음성 재생을 실행할 수 있다. (1) The user can assign the desired phrase. In other words, voice reproduction can be performed with a tone that is close to the desired tone without depending on the fixed dictionary.

(2) 디폴트 합성 사전 데이터(19)의 일부를 사용자 프레이즈 합성 사전 데이터(13)로 치환하기 위해서, 음성 재생 장치(1)에서 데이터 용량이 과대하게 증가하 는 것을 회피할 수 있다. 또, 디폴트 합성 사전 데이터(19)의 일부를 임의의 프레이즈로 치환할 수 있기 때문에, 프레이즈 단위의 발음을 실현할 수 있어, 종래의 발음 단위에 의한 합성 음성에서 발생하는 각 발음 단위 사이의 이음매에서의 청각 상의 위화감을 해소할 수 있다. (2) In order to replace a part of the default synthesis dictionary data 19 with the user phrase synthesis dictionary data 13, an excessive increase in the data capacity in the audio reproducing apparatus 1 can be avoided. In addition, since a part of the default synthesis dictionary data 19 can be replaced with an arbitrary phrase, pronunciation of a phrase unit can be realized, so that in the seam between the pronunciation units generated in the synthesized voice by the conventional pronunciation unit, Can resolve hearing discomfort

(3) HV 스크립트 데이터에서 임의의 프레이즈를 지정할 수 있기 때문에, 모라 단위의 음성 합성과 프레이즈 단위의 음성 발음을 병용할 수 있다. (3) Since an arbitrary phrase can be specified in the HV script data, the speech synthesis in units of Mora and the pronunciation of speech in units of phrases can be used together.

또한, 본 실시예에 따르면, 프레이즈를 미리 샘플링하여 구성한 파형 데이터를 재생하는 방법에 비하여, 포먼트 레벨에서의 음색 변화를 실현할 수 있다. 또, 본 실시예에서의 데이터 사이즈 및 품질은 프레임 레이트에 의존하지만, 종래의 샘플링 파형 데이터에 의한 방법에 비교하여, 훨씬 적은 데이터 용량으로 고품질인 음성 재생을 실현할 수 있다. 따라서, 예를 들면 본 실시예의 음성 재생 장치(1)를 휴대 전화 등의 휴대 통신 단말기에 용이하게 구비할 수 있어, 이에 의해, 전자 메일 등의 내용을 고품질의 음성으로 재생할 수 있다. Further, according to the present embodiment, the tone change at the formant level can be realized as compared with the method of reproducing the waveform data formed by sampling the phrase in advance. In addition, although the data size and quality in the present embodiment depend on the frame rate, high quality audio reproduction can be realized with much smaller data capacity as compared with the conventional method using sampling waveform data. Therefore, for example, the voice reproducing apparatus 1 of the present embodiment can be easily provided in a portable communication terminal such as a cellular phone, whereby contents such as electronic mail can be reproduced with high quality voice.

도 11은 본 발명의 제2 실시예에 따른 음성·악곡 재생 장치의 구성을 도시하는 블록도이다. 여기서, HV 스크립트(즉, HV 스크립트 데이터)는 음성을 재생하기 위한 포맷을 정의하는 파일에 상당하는 것으로, 운률 기호(즉, 액센트 등의 발음 양태를 지정하는 기호)를 포함한 발음 문자열, 발음하는 소리의 설정, 및 재생 어플리케이션 등으로의 메시지로 이루어지는 음성 합성을 실행하기 위한 데이터를 정의하는 파일로, 사용자에 의한 작성을 쉽게 하기 위해서 텍스트 입력에 의해 작성된다. Fig. 11 is a block diagram showing the construction of an audio / music reproducing apparatus according to a second embodiment of the present invention. Here, the HV script (i.e., HV script data) corresponds to a file defining a format for reproducing a voice, and includes a pronunciation string including a rhyme symbol (i.e., a symbol specifying a pronunciation aspect such as an accent) and a sound to be pronounced. A file that defines data for executing speech synthesis, which is composed of a message set and a message to a playback application or the like, is created by text input to facilitate creation by a user.

HV 스크립트는 텍스트 에디터 등의 어플리케이션 소프트웨어에 의해 판독되어, 텍스트에 의한 편집이 가능한 파일 형식으로 기술되면 되고, 일례로서 텍스트 에디터에 의해 작성되는 텍스트 파일을 예로 들 수 있다. HV 스크립트에는 언어 의존성이 있어, 여러 언어에 의한 정의가 가능하지만, 본 실시예에서 HV 스크립트는 일본어에 의해 정의되어 있는 것으로 한다. The HV script may be read by application software such as a text editor and described in a file format that can be edited by text. An example is a text file created by a text editor. HV scripts have language dependencies and can be defined in various languages. However, in this embodiment, the HV scripts are defined in Japanese.

참조 부호 101은 HV 스크립트 플레이어(HV-Script player)를 나타내고, HV 스크립트의 재생이나 정지 등을 제어하는 것이다. 여기서, HV 스크립트 플레이어(101)에 HV 스크립트가 등록되어 그 재생 지시를 받은 경우, HV 스크립트 플레이어(101)는 그 HV 스크립트의 해석을 개시한다. 그 후, HV 스크립트에 기술되어 있는 이벤트의 종류에 따라 HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107) 중 어느 하나에 대하여 상기 이벤트에 기초한 처리를 실행시킨다. Reference numeral 101 denotes an HV script player, which controls the reproduction, stop, and the like of the HV script. Here, when the HV script is registered in the HV script player 101 and the playback instruction is received, the HV script player 101 starts the analysis of the HV script. Thereafter, according to the type of event described in the HV script, any one of the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 is executed based on the event.

HV 드라이버(102)는 도시하지 않은 ROM(read-only memory)으로부터 합성 사전 데이터를 판독하여 참조한다. 사람의 음성은 인체의 구조(예를 들면, 성대나 구강 등의 형상)에 의존하는 소정의 포먼트(즉, 고유의 주파수 스펙트럼)를 갖고 있고, 합성 사전 데이터는 음성의 포먼트에 따른 파라미터를 발음 문자와 대응시켜서 보존하고 있다. 합성 사전 데이터는, 실제의 음이 발음 문자 단위(예를 들면, 일본어의 「あ」, 「い」 등의 음소 단위)로 샘플링 및 분석된 결과에 의해 얻어진 파라미터를 포먼트 프레임 데이터로서 발음 문자 단위로 미리 기억하고 있는 데이터베이스에 상당한다. The HV driver 102 reads and references synthesis dictionary data from a read-only memory (ROM) (not shown). The human voice has a predetermined formant (i.e., its own frequency spectrum) depending on the structure of the human body (e.g., the shape of the vocal cords, oral cavity, etc.), and the synthesis dictionary data provides parameters according to the formant of the voice. Corresponds with phonetic characters. Synthetic dictionary data is a phonetic character unit based on a parameter obtained by a result obtained by sampling and analyzing the actual sound in phonetic character units (for example, phoneme units such as Japanese 'あ' and 'い'). It is equivalent to the database memorized in advance.

예를 들면, 상기 CSM(Composite Sinusoidal Modeling: 복합 정현파 모델) 음성 합성 방식의 경우, 합성 사전 데이터는 8조의 포먼트 주파수, 포먼트 레벨, 및 피치 등을 파라미터로서 보존하고 있다. 이러한 음성 합성 방식은 음성의 샘플링에 의해 작성되는 파형 데이터의 재생 방식과 비교하여, 데이터량이 매우 적다고 하는 이점을 갖고 있다. 또한, 합성 사전 데이터에서, 재생되는 음성의 음질을 제어하는 파라미터(예를 들면, 8조의 포먼트 주파수 및 포먼트 레벨의 변경의 지정을 행하기 위한 파라미터 등)를 더 보존하도록 해도 된다. For example, in the CSM (Composite Sinusoidal Modeling) speech synthesis method, the synthesis dictionary data stores eight sets of formant frequency, formant level, pitch, and the like as parameters. This speech synthesis method has the advantage that the amount of data is very small compared with the reproduction method of waveform data created by sampling of speech. In addition, the synthesis dictionary data may further store parameters (for example, parameters for specifying eight sets of formant frequencies and formant levels) for controlling the sound quality of the reproduced voice.

HV 드라이버(102)는 HV 스크립트 내의 운률 기호를 포함한 발음 문자열 등을 해석하고, 합성 사전 데이터를 이용하여 포먼트 프레임 열로 변환하여 HV 음원(103)에 출력한다. HV 음원(103)은 HV 드라이버(102)로부터 출력된 포먼트 프레임 열에 기초하여 발음 신호를 생성하여 가산기(110)로 출력한다. The HV driver 102 interprets a phonetic string including a rhyme symbol in the HV script, converts it into a formant frame sequence using synthesized dictionary data, and outputs it to the HV sound source 103. The HV sound source 103 generates a pronunciation signal based on the formant frame string output from the HV driver 102 and outputs the pronunciation signal to the adder 110.

파형 재생 플레이어(104)는 음성이나 악곡 및 의사음 등이 미리 샘플링된 파형 데이터의 재생이나 정지 등을 행한다. 참조 부호 105는 파형 데이터용 RAM(waveform data random access memory)을 나타내고, 디폴트 파형 데이터를 미리 기억하고 있다. 사용자는 사용자 데이터용 RAM(112) 내의 사용자 파형 데이터를 등록 API(registration application program interface)(113)를 경유하여 파형 데이터용 RAM(105)에 기억시킬 수 있다. 파형 재생 플레이어(104)는 HV 스크립트 플레이어(101)로부터의 재생 지시를 받으면, 파형 데이터용 RAM(105)으로부터 파형 데이터를 판독하여 파형 재생기(106)로 출력한다. 파형 재생기(106)는 파형 재생 플레이어(104)로부터 출력된 파형 데이터에 기초하여 발음 신호를 생성하여 가산기(110)로 출력한다. 또한, 샘플링된 파형 데이터는 PCM(pulse-code modulation) 방식에 한하지 않고, 예를 들면 MP3(moving picture experts group layer 3) 방식에 의한 음성 압축 포맷으로 해도 된다. The waveform reproducing player 104 reproduces or stops waveform data pre-sampled with voices, musics, and pseudo sounds. Reference numeral 105 denotes a waveform data random access memory (RAM) for waveform data, and the default waveform data is stored in advance. The user can store the user waveform data in the user data RAM 112 in the waveform data RAM 105 via the registration application program interface 113. When the waveform reproduction player 104 receives the reproduction instruction from the HV script player 101, the waveform reproduction player 104 reads out the waveform data from the waveform data RAM 105 and outputs it to the waveform reproducer 106. The waveform player 106 generates a pronunciation signal based on the waveform data output from the waveform reproduction player 104 and outputs it to the adder 110. In addition, the sampled waveform data is not limited to the pulse-code modulation (PCM) method, but may be a voice compression format using, for example, the moving picture experts group layer 3 (MP3) method.

프레이즈 재생 플레이어(107)는 악곡 프레이즈 데이터(또는, 악곡 데이터)의 재생이나 정지 등을 행한다. 악곡 프레이즈 데이터는 SMF 포맷 형식으로, 발음하는 음 음고 및 음량 등을 나타내는 음표 정보와, 발음하는 음의 발음 시간을 나타내는 시간 정보에 의해 구성된다. 참조 부호 108은 악곡 프레이즈 데이터용 RAM을 나타내고, 디폴트 악곡 프레이즈 데이터를 미리 기억하고 있다. 사용자는 사용자 데이터용 RAM(112) 내의 사용자 악곡 프레이즈 데이터를 등록 API를 경유하여 악곡 프레이즈 데이터용 RAM(108)에 기억할 수 있다. The phrase replay player 107 reproduces or stops the music phrase data (or music data). The music phrase data is in SMF format and is composed of musical note information indicating the pitch and volume of the sound to be pronounced, and time information indicating the pronunciation time of the sound to be pronounced. Reference numeral 108 denotes a RAM for music phrase data, and the default music phrase data is stored in advance. The user can store user music phrase data in the user data RAM 112 in the music phrase data RAM 108 via the registration API.

프레이즈 재생 플레이어(107)는 HV 스크립트 플레이어(101)로부터의 재생 지시를 받으면, 악곡 프레이즈 데이터용 RAM(108)으로부터 악곡 프레이즈 데이터를 판독하여, 악곡 프레이즈 데이터 내의 음표 정보의 시간 관리를 행하고, 악곡 프레이즈 데이터에 기술되어 있는 시간 정보에 기초하여 음표 정보를 프레이즈 음원(109)으로 출력한다. 프레이즈 음원(109)은 프레이즈 재생 플레이어(107)에 의해 출력된 음표 정보에 기초하여 악음 신호를 생성하여 가산기(110)로 출력한다. 프레이즈 음원(109)으로서 FM 방식이나 PCM 방식 등을 채용할 수 있지만, 악곡 프레이즈 데이터의 재생 기능을 갖고 있으면 되기 때문에, 그 음원 방식을 한정할 필요는 없다. When the phrase playback player 107 receives a playback instruction from the HV script player 101, the phrase playback player 107 reads the music phrase data from the music phrase data RAM 108, performs time management of the note information in the music phrase data, and plays the phrase phrase. The note information is output to the phrase sound source 109 based on the time information described in the data. The phrase sound source 109 generates a musical sound signal based on the note information output by the phrase reproducing player 107 and outputs it to the adder 110. FIG. Although the FM system, the PCM system, etc. can be employ | adopted as the phrase sound source 109, it is not necessary to limit the sound source system, as long as it has a function of playing music phrase data.

가산기(110)는 HV 음원(103)으로부터 출력되는 발음 신호, 파형 재생기(106) 로부터 출력되는 음성 신호, 및 프레이즈 음원(109)으로부터 출력되는 악음 신호를 합성하고, 그 합성 신호를 스피커(111)로 출력한다. 스피커(111)는 가산기(110)의 합성 신호에 기초하여 음성 및/또는 악음을 발음한다. The adder 110 synthesizes the pronunciation signal output from the HV sound source 103, the audio signal output from the waveform regenerator 106, and the music signal output from the phrase sound source 109, and converts the synthesized signal into the speaker 111. Will output The speaker 111 pronounces the voice and / or the musical sound based on the synthesized signal of the adder 110.

또한, HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)에서 동시에 처리를 행함으로써, 발음 신호, 음성 신호, 및 악음 신호(또한, 음성 신호와 악음 신호를 통합하여 「음 신호」라고 칭해도 됨) 각각에 기초한 음성 및 악곡을 동시에 발음하도록 해도 된다. 혹은 HV 스크립트 플레이어(101)에 의해 HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)의 처리 타이밍을 관리하고, 각각의 처리에 기초한 음성 및 악곡을 동시에 재생하지 않도록 해도 된다. 본 실시예에서는 HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)에 의한 동시 처리를 금지하는 것으로 한다. 또한, 도 11에서는, 설명의 형편 상, 파형 데이터용 RAM(105), 악곡 프레이즈 데이터용 RAM(108), 및 사용자 데이터용 RAM(112)으로서 각각 별개의 RAM을 설치하고 있지만, 이들 기능을 단일의 RAM 내의 다른 기억 영역에 할당하도록 해도 된다. In addition, the HV driver 102, the waveform reproducing player 104, and the phrase reproducing player 107 perform the processing at the same time, thereby integrating the pronunciation signal, the audio signal, and the sound signal (also, the voice signal and the sound signal). Sound signal ”and sound music based on each sound may be simultaneously pronounced. Alternatively, the processing timings of the HV driver 102, the waveform reproducing player 104, and the phrase reproducing player 107 are managed by the HV script player 101, and the audio and music based on the respective processing may not be reproduced simultaneously. do. In this embodiment, simultaneous processing by the HV driver 102, the waveform reproducing player 104, and the phrase reproducing player 107 is prohibited. In FIG. 11, for convenience of explanation, separate RAMs are provided as the waveform data RAM 105, the music phrase data RAM 108, and the user data RAM 112, but these functions are provided in a single manner. May be allocated to another storage area in the RAM.

도 12는 HV 스크립트에서 기술되는 파형 데이터나 악곡 프레이즈 데이터(이하, 이들을 통합하여 「음 데이터」라고 칭함)를 재생하기 위한 이벤트의 정의예를 도시한다. 이벤트의 머리문자인 「D」는 디폴트 정의를 의미하고, 또한 「O」는 사용자 정의를 의미한다. 각 이벤트의 종별로서, 파형 또는 프레이즈가 할당된다. 디폴트 정의(D0∼D63)에는 미리 파형 데이터용 RAM(105)이 기억하는 디폴트 파형 데이터나 미리 악곡 프레이즈 데이터용 RAM(108)이 기억하는 디폴트 악곡 프레이즈 데이터가 할당된다. 디폴트 정의에는, 64개의 디폴트 파형 데이터 및 디폴트 악곡 프레이즈 데이터를 할당할 수 있다. 사용자 정의(O0∼O63)에는, 사용자가 임의로 작성한 샘플링 파형 데이터나 악곡 프레이즈 데이터가 할당된다. 사용자 정의에는 64개의 샘플링 파형 데이터 및 악곡 프레이즈 데이터를 할당할 수 있다. Fig. 12 shows a definition example of an event for reproducing waveform data or music phrase data (hereinafter, collectively referred to as "tone data") described in the HV script. "D", which is the header of the event, means a default definition, and "O" means a user definition. For each type of event, a waveform or phrase is assigned. Default waveform data stored in the waveform data RAM 105 and default music phrase data stored in the music phrase data RAM 108 in advance are allocated to the default definitions D0 to D63. In the default definition, 64 default waveform data and default music phrase data can be assigned. In user definitions (O0 to O63), sampling waveform data and music phrase data created by the user are assigned. The user definition can be assigned 64 sampling waveform data and music phrase data.

도 12에 도시하는 종별이 파형 데이터인 이벤트와, 그 이벤트가 나타내는 파형 데이터와의 관계를 나타내는 데이터가 파형 데이터용 RAM(105)에 미리 기억되어 있다. 또한, 종별이 프레이즈인 이벤트와, 그 이벤트가 나타내는 악곡 프레이즈 데이터와의 관계를 나타내는 데이터가 악곡 프레이즈 데이터용 RAM(108)에 기억되어 있다. 이들 데이터는, 사용자에 의해 사용자 데이터용 RAM(112) 내의 파형 데이터 혹은 악곡 프레이즈 데이터의 등록이 행해진 경우에 갱신된다. An event in which the type shown in Fig. 12 is waveform data and the relationship between the waveform data indicated by the event is stored in the waveform data RAM 105 in advance. In addition, data indicating the relationship between the event whose type is a phrase and the music phrase data indicated by the event is stored in the music phrase data RAM 108. These data are updated when the waveform data or music phrase data is registered in the user data RAM 112 by the user.

HV 스크립트로서는, 예를 들면 「TJK12みなさんO0です。D20」이라고 기술된다. 선두에 기술되는 「TJK12」 중에서, 「T」는 HV 스크립트의 개시를 나타내는 부호이고, 「J」는 국가·문자 코드를 지정하고 있으며, 여기서는 HV 스크립트가 일본어로 기술되는 것을 나타내고 있다. 「K12」는 음질을 설정하는 부호로, 12번째의 음질이 지정되어 있는 것을 나타내고 있다. 또한, 「みなさん」 및 「です」는 HV 드라이버(102)에 의해 해석되고, 스피커(111)로부터「みなさん」 및 「です」라는 일본어의 음성이 발음된다. 이 「みなさん」 및 「です」와 같은 발음 문자열 중에 액센트(혹은, 강약) 및 발음 양태를 나타내는 운률 기호가 포함되어 있는 경우에는, 액센트를 붙인 (혹은, 강약을 붙인) 음성이 발음된다. As the HV script, for example, "TJK12 MinaO0 desu. D20" is described. In "TJK12" described at the beginning, "T" is a code indicating the start of the HV script, "J" designates a country character code, and shows that the HV script is described in Japanese here. "K12" is a code for setting the sound quality, and indicates that the 12th sound quality is specified. In addition, "minusan" and "です" are interpreted by the HV driver 102, and the Japanese voices of "minusan" and "です" are pronounced from the speaker 111. If the accented strings such as "みなさん" and "です" contain accents (or strengths and weaknesses) and rhyme symbols indicating the pronunciation mode, the accented (or weakened) voices are pronounced.

사용자 이벤트 「O0」에는, 예를 들면 「鈴木」라고 발음되는 음성이 샘플링된 파형 데이터가 등록되어 있다. 이 사용자 이벤트 「O0」은 파형 재생 플레이어(104)에 의해 해석됨으로써, 스피커(111)로부터 「鈴木」라는 음성이 발음된다. 또한, 사용자 이벤트 「D20」에는, 예를 들면 북적거리는 쇼트 악곡 프레이즈 데이터가 등록된다. 이 사용자 이벤트 「D20」은 프레이즈 재생 플레이어(107)에 의해 해석됨으로써, 스피커(111)로부터 북적거리는 악곡이 발음된다. 이 경우, 재생 음성은 「みなさん鈴木です」(악곡 프레이즈 재생 시)가 되고, 「鈴木」의 부분만큼 파형 데이터가 재생된다. 파형 데이터의 재생에 의한 음성의 발음은, 「みなさん」이나 「です」와 같은 발음 단위의 음성 합성에 의해 재생된 음세와 비교하여, 발음 단위의 이음매의 재생이 보다 자연스러워진다. 또한, 「鈴木」라는 어구의 발음을 특징적인 파형의 재생으로 함으로써, 사용자에게 재생 음성을 효과적으로 들려줄 수 있다. 이상과 같이, 파형 데이터나 악곡 프레이즈 데이터의 재생을 지정하는 이벤트를 HV 스크립트에 의해 기술함으로써, 파형 데이터나 악곡 프레이즈 데이터의 재생 타이밍을 임의로 지정할 수 있다. 또한, HV 스크립트의 기술에 관한 설정은 소위 설계 사항으로, 상술한 기술 방법에 한정되는 것은 아니다. In the user event "O0", for example, waveform data in which a voice pronounced as "鈴木" is sampled is registered. This user event "O0" is interpreted by the waveform reproducing player 104, so that the sound "鈴木" is pronounced from the speaker 111. In addition, in the user event "D20", for example, crowded short music phrase data is registered. This user event "D20" is interpreted by the phrase reproducing player 107, and the music which thrusts from the speaker 111 is pronounced. In this case, the reproduced voice becomes "Minasan 鈴木です" (when music phrases are reproduced), and waveform data is reproduced by the portion of "鈴木". The pronunciation of the sound by the reproduction of the waveform data is more natural in the reproduction of the seam in the pronunciation unit, compared to the sound reproduction reproduced by the speech synthesis in the pronunciation unit such as "Minana" or "です". In addition, by reproducing a characteristic waveform by using the phrase "鈴木", it is possible to effectively reproduce the reproduced voice to the user. As described above, the HV script describes an event specifying playback of waveform data and music phrase data, so that the timing of reproduction of the waveform data and music phrase data can be arbitrarily specified. In addition, the setting regarding description of HV script is what is called a design matter, and is not limited to the above-mentioned description method.

다음으로, 본 실시예에 따른 음성·악곡 재생 장치의 동작을 도 13의 흐름도를 이용하여 설명한다. 우선, 사용자가 HV 스크립트를 텍스트 에디터에 의해 작성하여 상기 HV 스크립트 플레이어(101)에 등록된다(단계 S101 참조). 이 때, 사용자 정의에 의한 파형 데이터나 악곡 프레이즈 데이터가 존재하면, 등록 API(113)가 사용자 데이터용 RAM(112)으로부터 파형 데이터나 악곡 프레이즈 데이터를 판독한 다. 등록 API(113)는 파형 데이터를 파형 데이터용 RAM(105)에 기억하고, 악곡 프레이즈 데이터를 악곡 프레이즈 데이터용 RAM(108)에 기억한다. Next, the operation of the audio / music reproducing apparatus according to the present embodiment will be described using the flowchart of FIG. First, a user creates an HV script by a text editor and registers it with the HV script player 101 (see step S101). At this time, if waveform data or music phrase data by user definition exists, the registration API 113 reads the waveform data or music phrase data from the RAM 112 for user data. The registration API 113 stores the waveform data in the waveform data RAM 105 and stores the music phrase data in the music phrase data RAM 108.

사용자에 의해 스타트 지시가 이루어지면(단계 S103), HV 스크립트 플레이어(101)는 HV 스크립트의 해석을 개시한다(단계 S102 참조). HV 스크립트 플레이어(101)는, HV 스크립트 내에 「D」 또는 「O」로 시작되는 이벤트가 포함되어 있는지의 여부를 판정하고(단계 S104), 「D」 또는 「O」로 시작되는 이벤트의 입력 시에는, 그 종별이 파형 데이터인지의 여부를 판정한다(단계 S105). 상기 이벤트의 종별이 파형 데이터인 경우, HV 스크립트 플레이어(101)는 파형 재생 플레이어(104)에 그 처리를 지시한다. 파형 재생 플레이어(104)는, 「D」 또는 「O」에 계속되는 번호의 파형 데이터를 파형 데이터용 RAM(105)으로부터 판독하여 파형 재생기(106)로 출력한다(단계 S106). 파형 재생기(106)는 이 파형 데이터에 기초하여 음성 신호를 생성하고, 가산기(110)를 통하여 스피커(111)로 출력한다(단계 S107). 이에 의해, 스피커(111)는 해당하는 음성을 발음한다. When the start instruction is made by the user (step S103), the HV script player 101 starts the analysis of the HV script (see step S102). The HV script player 101 determines whether an event starting with "D" or "O" is included in the HV script (step S104), and upon input of an event starting with "D" or "O". Next, it is determined whether the type is waveform data (step S105). When the type of the event is waveform data, the HV script player 101 instructs the waveform reproduction player 104 to process the same. The waveform reproducing player 104 reads the waveform data of the number following "D" or "O" from the waveform data RAM 105 and outputs it to the waveform reproducing unit 106 (step S106). The waveform regenerator 106 generates an audio signal based on this waveform data and outputs it to the speaker 111 via the adder 110 (step S107). As a result, the speaker 111 pronounces the corresponding voice.

또한, 단계 S105에서 이벤트의 종별이 파형 데이터가 아닌 경우, 흐름은 단계 S108로 진행하여, HV 스크립트 플레이어(101)는 이벤트의 종별이 악곡 프레이즈 데이터인지의 여부를 판정한다. 이벤트의 종별이 악곡 프레이즈 데이터인 경우, HV 스크립트 플레이어(101)는 프레이즈 재생 플레이어(107)에 대하여 그 처리를 지시한다. 프레이즈 재생 플레이어(107)는, 「D」 또는 「O」에 이어지는 번호의 악곡 프레이즈 데이터를 악곡 프레이즈 데이터용 RAM(108)으로부터 판독하고, 이 악곡 프레이즈 데이터 내의 시간 정보에 기초하여, 악곡 프레이즈 데이터 내의 음표 정보를 프레이즈 음원(109)으로 출력한다(단계 S109 참조). 프레이즈 음원(109)은 이 음표 정보에 기초하여 악음 신호를 생성하고, 가산기(110)를 통하여 스피커(111)로 출력한다(단계 S110 참조). 이에 의해, 스피커(111)로 악곡이 발음된다. 또한, 단계 S108에서, 이벤트의 종별이 악곡 프레이즈 데이터가 아니라고 판정된 경우, 본 실시예에 따른 음성·악곡 재생 장치에서는 처리할 수 없는 종별의 이벤트라고 인식하여, 흐름은 단계 S113으로 진행한다. If the type of event is not waveform data in step S105, the flow advances to step S108, and the HV script player 101 determines whether the type of event is music phrase data. When the event type is music phrase data, the HV script player 101 instructs the phrase playback player 107 to process the same. The phrase replay player 107 reads out the music phrase data of the number following "D" or "O" from the music phrase data RAM 108, and based on the time information in this music phrase data, The note information is output to the phrase sound source 109 (see step S109). The phrase sound source 109 generates a musical sound signal based on this note information, and outputs it to the speaker 111 via the adder 110 (refer to step S110). As a result, music is pronounced through the speaker 111. If it is determined in step S108 that the event type is not music phrase data, it is recognized that the event is a type of event that cannot be processed by the audio / music reproducing apparatus according to the present embodiment, and the flow advances to step S113.

단계 S104에서, HV 스크립트 내에 「D」로 시작되는 이벤트나 「O」로 시작되는 이벤트가 기술되어 있지 않은 경우, HV 스크립트 플레이어(101)는 HV 드라이버(102)에 그 처리를 지시한다. HV 드라이버(102)는 합성 사전 데이터를 이용하여 문자열을 포먼트 프레임 열로 변환하여, HV 음원(103)으로 출력한다(단계 S111 참조). HV 음원(103)은 이 포먼트 프레임 열에 기초하여 발음 신호를 생성하여 가산기(110)를 통하여 스피커(111)로 출력한다(단계 S112 참조). 이에 의해, 스피커(111)로부터 해당하는 음성이 발음된다. In step S104, when an event starting with "D" or an event starting with "O" is not described in the HV script, the HV script player 101 instructs the HV driver 102 to perform the processing. The HV driver 102 converts the character string into a formant frame string using the synthesis dictionary data, and outputs it to the HV sound source 103 (see step S111). The HV sound source 103 generates a pronunciation signal based on this formant frame sequence and outputs it to the speaker 111 via the adder 110 (see step S112). As a result, the corresponding voice is pronounced from the speaker 111.

HV 스크립트 플레이어(101)는, 이벤트에 따른 처리의 종료마다 HV 스크립트의 최후의 기술까지 해석을 완료하였는지의 여부를 판정한다(단계 S113). 또한 해석하여야 할 기술이 남아 있는 경우, 흐름은 단계 S104로 되돌아간다. 한편, HV 스크립트의 모든 기술의 해석이 완료된 경우에는, 도 13에 도시하는 음성·악곡 재생 처리가 종료한다. The HV script player 101 determines whether or not the analysis has been completed up to the last description of the HV script at the end of the processing according to the event (step S113). If the technique to be interpreted still remains, the flow returns to step S104. On the other hand, when the analysis of all the techniques of the HV script is completed, the audio / music reproduction processing shown in FIG. 13 ends.

본 실시예에 따른 HV 스크립트의 기술예로서 나타낸 「TJK12みなさんO0です。D20」의 경우, 이벤트 「O0」으로 정의된 파형 데이터의 발음이 종료한 후, 다음 의 어구「です」가 발음될 필요가 있다. 예를 들면, HV 스크립트 플레이어(101)가 파형 데이터(또는 악곡 프레이즈 데이터)의 이벤트의 해석을 실행하는 경우, 그 다음의 이벤트의 재생을 잠시 유예하고, 파형 재생 플레이어(104)(또는 프레이즈 재생 플레이어(107))에 의한 발음의 종료 시에, 상기 파형 재생 플레이어(104)로부터 HV 스크립트 플레이어(101)에 대하여 발음의 종료를 나타내는 신호를 출력하도록 한다. In the case of "TJK12 MINAO0DESU。 D20" shown as a description example of the HV script according to the present embodiment, after the pronunciation of the waveform data defined by the event "O0" is finished, the following phrase "です" needs to be pronounced. have. For example, when the HV script player 101 analyzes an event of waveform data (or music phrase data), the playback of the next event is suspended for a while and the waveform playback player 104 (or phrase playback player) is suspended. At the end of the pronunciation by (107)), the waveform reproduction player 104 outputs a signal indicating the end of the pronunciation to the HV script player 101.

또한, HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)에 의한 동시 재생 처리를 허용하는 경우, HV 스크립트의 기술에 의해 이들의 재생 처리를 제어하도록 해도 된다. 예를 들면, HV 스크립트가 「TJK12みなさんO03です。D20」라고 기술되어 있는 경우, 「O0」에 계속되는 「」(스페이스)와「3」에 의해 소정의 무음 기간을 구비한다고 하는 이벤트를 나타내고 있고, 「O0」이 도시하는 「鈴木」라는 어구가 발음되고 있는 동안에, HV 드라이버(102)에 의해 재생되는 음성이 무음이 되도록 제어된다. 또한, HV 스크립트가 「TJK12こんにちは。D20みなさん O0 3です。」라고 기술되어 있는 경우, 「D20」에 의해 지정되는 악곡과, 「みなさん鈴木です。」라는 음성이 동시에 발음된다. In addition, when the simultaneous reproduction process by the HV driver 102, the waveform reproduction player 104, and the phrase reproduction player 107 is allowed, you may control these reproduction processes by description of an HV script. For example, when the HV script is described as "TJK12 MINAO03Desu.D20", the event that the predetermined silence period is provided by "" (space) and "3" following "O0" is shown. While the phrase " 」木 " shown by " O0 " is pronounced, the audio reproduced by the HV driver 102 is controlled to be silent. In addition, when the HV script is described as "TJK12 kun. D20 minus O0 3 desu.", The music designated by "D20" and the voice "minusan 鈴 wooden で。" are pronounced simultaneously.

도 14는 본 실시예에 따른 음성·악곡 재생 장치를 구비하는 휴대 전화의 구성예를 도시하는 블록도이다. 여기서, 참조 부호 141은 휴대 전화의 각 부를 제어하는 CPU를 도시한다. 참조 부호 142는 데이터 송수신용 안테나를 도시한다. 참조 부호 143은 통신부를 나타내고, 송신용 데이터를 변조하여 안테나(142)로 출력함과 함께, 안테나(142)에 의해 수신된 수신용 데이터를 복조한다. 참조 부호 144 는 음성 처리부를 나타내고, 휴대 전화의 통화 시에 통신부(143)로부터 출력되는 통화 상대의 음성 데이터를 음성 신호로 변환하여 이어 스피커(또는 이어폰, 도시하지 않음)으로 출력함과 함께, 마이크로폰(도시하지 않음)에 의해 입력되는 음성 신호를 음성 데이터로 변환하여 통신부(143)로 출력한다. 14 is a block diagram showing an example of the configuration of a mobile telephone including the audio / music reproducing apparatus according to the present embodiment. Here, reference numeral 141 denotes a CPU that controls each unit of the cellular phone. Reference numeral 142 denotes an antenna for data transmission and reception. Reference numeral 143 denotes a communication unit which modulates the data for transmission and outputs it to the antenna 142 and demodulates the reception data received by the antenna 142. Reference numeral 144 denotes a voice processing unit, which converts voice data of a caller who is output from the communication unit 143 into a voice signal when the mobile phone is called, and outputs the voice signal to an ear speaker (or earphone, not shown). The voice signal input by (not shown) is converted into voice data and output to the communication unit 143.

참조 부호 145는 음원을 나타내고, 도 11에 도시한 HV 음원(103), 파형 재생기(106), 및 프레이즈 음원(109)과 마찬가지의 기능을 갖고 있다. 참조 부호 146은 스피커를 나타내고, 원하는 음성이나 악음을 발음한다. 참조 부호 147은 사용자에 의해 조작되는 조작부를 나타낸다. 참조 부호 148은 HV 스크립트나 사용자에 의해 정의되는 파형 데이터 및 악곡 프레이즈 데이터 등을 기억하는 RAM을 나타낸다. 참조 부호 149는 CPU(141)가 실행하는 프로그램, 및 합성 사전 데이터, 디폴트 파형 데이터, 및 디폴트 악곡 프레이즈 데이터 등을 기억하는 ROM을 나타낸다. 참조 부호 150은 표시부를 나타내고, 사용자에 의한 조작 결과나 휴대 전화의 상태 등을 화면 상에 표시한다. 참조 부호 151은 바이브레이터를 나타내고, 휴대 전화에서의 착신 시에 CPU(141)로부터의 지시를 받아 진동을 발생한다. 상기한 각 블록은 버스 B를 통하여 서로 접속되어 있다. Reference numeral 145 denotes a sound source, and has the same function as that of the HV sound source 103, the waveform reproducer 106, and the phrase sound source 109 shown in FIG. Reference numeral 146 denotes a speaker and pronounces a desired voice or musical sound. Reference numeral 147 denotes an operation unit operated by the user. Reference numeral 148 denotes a RAM that stores waveform data, music phrase data, and the like defined by the HV script and the user. Reference numeral 149 denotes a program that the CPU 141 executes, and a ROM that stores synthesis dictionary data, default waveform data, default music phrase data, and the like. Reference numeral 150 denotes a display unit, and displays the result of operation by the user, the state of the mobile phone, and the like on the screen. Reference numeral 151 denotes a vibrator, and generates a vibration upon receiving an instruction from the CPU 141 at the time of incoming call from the cellular phone. Each block described above is connected to each other via a bus B.

이 휴대 전화는 음성으로부터 파형 데이터를 생성하는 기능을 갖고 있고, 마이크로폰에 의해 입력된 음성을 음성 처리부(144)로 보내어, 파형 데이터로 변환하고, 이 파형 데이터를 RAM(148)에 기억한다. 또한, 통신부(143)에 의해 WEB 서버로부터 악곡 프레이즈 데이터를 다운로드한 경우, 그 악곡 프레이즈 데이터는 RAM(148)에 기억된다. The mobile phone has a function of generating waveform data from voice, and sends the voice input by the microphone to the voice processing unit 144, converts the voice data into waveform data, and stores the waveform data in the RAM 148. In addition, when the music phrase data is downloaded from the WEB server by the communication unit 143, the music phrase data is stored in the RAM 148.

CPU(141)는 ROM(149)에 기억된 프로그램에 따라서, 도 11에 도시한 HV 스크립트 플레이어(101), HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)와 마찬가지의 처리를 실행한다. 또한, CPU(141)는 RAM(148)으로부터 판독한 HV 스크립트에 기술되어 있는 이벤트의 해석도 행한다. 이벤트가 음성 합성에 의한 발음을 나타내고 있는 경우, CPU(141)는 ROM(149)으로부터 합성 사전 데이터를 판독하여 참조하여, HV 스크립트에 기술되어 있는 문자열을 포먼트 프레임 열로 변환하여 음원(145)으로 출력한다. The CPU 141 is similar to the HV script player 101, the HV driver 102, the waveform reproducing player 104, and the phrase reproducing player 107 shown in FIG. 11, in accordance with the program stored in the ROM 149. FIG. Execute the processing of. The CPU 141 also analyzes the events described in the HV script read from the RAM 148. If the event indicates the pronunciation by speech synthesis, the CPU 141 reads out the synthesis dictionary data from the ROM 149 and converts the character string described in the HV script into a formant frame sequence to the sound source 145. Output

이벤트가 파형 데이터의 재생을 나타내고 있는 경우, CPU(141)는 HV 스크립트 내의 「D」 또는 「O」에 계속되는 번호의 파형 데이터를 RAM(148) 또는 ROM(149)으로부터 판독하여 음원(145)으로 출력한다. 이벤트가 악곡 데이터의 재생을 나타내고 있는 경우, CPU(141)는 HV 스크립트 내의 「D」 또는 「O」에 계속되는 번호의 악곡 프레이즈 데이터를 RAM(148) 또는 ROM(149)으로부터 판독하고, 상기 악곡 프레이즈 데이터 내의 시간 정보에 기초하여, 악곡 프레이즈 데이터 내의 음표 정보를 음원(145)으로 출력한다. When the event indicates reproduction of waveform data, the CPU 141 reads waveform data of a number following "D" or "O" in the HV script from the RAM 148 or the ROM 149 to the sound source 145. Output When the event indicates the reproduction of music data, the CPU 141 reads the music phrase data of the number following "D" or "O" from the HV script from the RAM 148 or the ROM 149, and the music phrase is read. Based on the time information in the data, note information in the music phrase data is output to the sound source 145.

음원(145)은 CPU(141)로부터 출력되는 포먼트 프레임 열에 기초하여 합성 발음 신호를 생성하여 스피커(146)로 출력한다. 또한, CPU(141)로부터 출력되는 파형 데이터에 기초하여 음성 신호를 생성하여 스피커(146)로 출력한다. 또한, CPU(141)로부터 출력되는 악곡 프레이즈 데이터에 기초하여 악음 신호를 생성하여 스피커(146)로 출력한다. 스피커(146)는 합성 발음 신호, 음성 신호, 또는 악음 신호에 기초하여 음성 또는 악음을 적절하게 발음한다. The sound source 145 generates a synthesized pronunciation signal based on the formant frame string output from the CPU 141 and outputs the synthesized pronunciation signal to the speaker 146. In addition, an audio signal is generated based on the waveform data output from the CPU 141 and output to the speaker 146. Also, a sound signal is generated based on the music phrase data output from the CPU 141 and output to the speaker 146. The speaker 146 appropriately pronounces a speech or musical sound based on the synthesized pronunciation signal, the speech signal, or the musical sound signal.

사용자가 조작부(147)를 조작하여 텍스트 편집에 대응한 소프트웨어를 기동하면, 사용자는 표시부(150)의 화면 상에 표시되는 내용을 확인하면서 HV 스크립트를 작성할 수 있고, 이와 같이 하여 작성한 HV 스크립트를 RAM(148)에 보존할 수 있다. When the user operates the operation unit 147 to start the software corresponding to the text editing, the user can create an HV script while checking the contents displayed on the screen of the display unit 150, and the HV script thus created is RAM It can save to 148.

또한, 사용자에 의해 작성된 HV 스크립트를 착신 멜로디에 응용할 수도 있다. 또한, 이 경우, 휴대 전화의 착신 시에 HV 스크립트를 이용하는 것이 설정 정보로서 RAM(148)에 미리 기억되어 있는 것으로 한다. 즉, 통신부(143)가 안테나(142)를 통하여 다른 휴대 전화 등으로부터 송신된 발호 정보를 수신하면, 통신부(143)는 CPU(141)에 착신을 통지한다. 착신 통지를 받은 CPU(141)는 RAM(148)으로부터 설정 정보를 판독하고, 그 설정 정보가 나타내는 HV 스크립트를 RAM(148)으로부터 판독하여 그 해석을 개시한다. 그 이후의 처리는 상술한 바와 같다. 즉, HV 스크립트에 기술되어 있는 이벤트의 종별에 따라서, 스피커(146)는 음성 또는 악음을 발음한다. It is also possible to apply the HV script written by the user to the incoming melody. In this case, it is assumed that the use of the HV script when receiving a mobile phone is stored in advance in the RAM 148 as setting information. That is, when the communication unit 143 receives call information transmitted from another mobile phone or the like through the antenna 142, the communication unit 143 notifies the CPU 141 of the incoming call. Upon receipt of the notification of notification, the CPU 141 reads the setting information from the RAM 148, reads the HV script indicated by the setting information from the RAM 148, and starts the analysis. Processing thereafter is as described above. That is, according to the type of event described in the HV script, the speaker 146 pronounces a voice or a musical sound.

또한, 사용자는 전자 메일에 HV 스크립트를 첨부하여 다른 단말기로 송신할 수도 있다. 또한, CPU(141)가 전자 메일의 본문 자체를 HV 스크립트 형식으로 해석하고, 사용자에 의한 지시를 받아, 전자 메일 내의 기술에 따라 음성 처리부(144)로 상기 HV 스크립트의 재생 지시를 출력하도록 해도 된다. 또한, HV 스크립트 플레이어(101), HV 드라이버(102), 파형 재생 플레이어(104), 및 프레이즈 재생 플레이어(107)의 모든 기능을 CPU(141)에 부담시킬 필요는 없다. 예를 들면, 음원(145)에 의해 상기한 기능 중 어느 하나를 부담하도록 해도 된다. The user may also attach the HV script to an electronic mail and send it to another terminal. Further, the CPU 141 may interpret the body of the e-mail itself in the form of an HV script, receive an instruction from the user, and output the playback instruction of the HV script to the voice processing unit 144 according to the description in the e-mail. . In addition, it is not necessary to burden the CPU 141 with all the functions of the HV script player 101, the HV driver 102, the waveform play player 104, and the phrase play player 107. For example, the sound source 145 may bear any of the above functions.

또한, 본 실시예의 적용 대상은 휴대 전화(cellular phone)에 한정할 필요는 없고, 예를 들면 PHS(personal handyphone system: 일본 등록상표)나 PDA(personal digital assistant) 등의 휴대 단말기에 적용하여, 상기한 바와 같은 음성 및 악곡 재생을 행하도록 해도 된다. In addition, the application target of the present embodiment need not be limited to a cellular phone, but is applied to, for example, a portable terminal such as a personal handyphone system (PHS) or a personal digital assistant (PDA). Voice and music reproduction as described above may be performed.

또한, 본 실시예의 활용예로서, 휴대 전화 등의 휴대 이동 단말기에 있어서 사용자에 의한 HV 스크립트 입력을 가능하게 함으로써, 일반 사용자가 음성 합성용의 문자뿐만 아니라, 정형의 샘플링 파형 데이터나 악곡 프레이즈 데이터를 재생하기 위한 HV 스크립트를 용이하게 작성할 수 있다. 또한, 송신 및 수신용 휴대 이동 단말기에서 본 실시예에 따른 음성·악곡 재생 장치를 구비한 경우, 사용자가 휴대 이동 단말기를 조작하여 전자 메일에 첨부하여 HV 스크립트를 송수신할 수 있다. 이에 의해, 수신측의 휴대 이동 단말기에서는 수신한 전자 메일에 의해, 음성 합성용 문자뿐만 아니라, 정형의 샘플링 데이터나 악곡 프레이즈 데이터를 적절하게 재생할 수 있다. 또한, HV 스크립트를 이용한 음성 및 악곡의 재생을 착신 멜로디로서 이용할 수도 있다. Further, as an application example of the present embodiment, by allowing a user to input HV scripts in a portable mobile terminal such as a mobile phone, the general user can not only form text for speech synthesis but also formally sampled waveform data and music phrase data. You can easily write HV scripts for playback. In addition, when the transmitting and receiving portable mobile terminal is provided with the voice and music reproducing apparatus according to the present embodiment, the user can operate the portable mobile terminal to send and receive HV scripts by attaching it to an e-mail. Thereby, the portable mobile terminal on the receiving side can appropriately reproduce not only speech synthesis texts but also shaped sampling data and music phrase data by the received electronic mail. In addition, playback of voices and music using the HV script can also be used as an incoming melody.

다음으로, 본 발명의 제3 실시예에 따른 음성·악곡 재생 장치의 구성 및 동작을 도 15 및 도 16을 참조하여 설명한다. 이 제3 실시예는, 상기한 제1 실시예 및 제2 실시예를 조합하여 구성한 것으로, 미들웨어는 HV 재생, 파형 재생, 및 악곡 프레이즈 재생을 실시하고, 음원은 이들 3 종류의 데이터에 기초하여 합성 발음 신호를 발생하는 것으로, 3계통으로부터의 신호를 합성하여 스피커로 출력한다. Next, the configuration and operation of the voice and music reproducing apparatus according to the third embodiment of the present invention will be described with reference to Figs. The third embodiment is a combination of the first and second embodiments described above, the middleware performs HV playback, waveform playback, and music phrase playback, and the sound source is based on these three types of data. By generating a synthesized pronunciation signal, a signal from three systems is synthesized and output to a speaker.

여기서, 도 15는 도 1의 구성을 기초로 하여 도 11의 일부 구성을 조합시킨 것으로, 참조 부호 211∼219는 도 1의 참조 부호 11∼19에 대응하고, 참조 부호 303∼313은 도 11의 참조 부호 103∼113에 대응하고 있다. 즉, 도 1의 미들웨어에 도 11의 사용자 데이터용 RAM이 사용자 데이터 API를 경유하여 접속되어 있고, 미들웨어 내에는 도 11의 파형 재생 플레이어와 프레이즈 재생 플레이어가 각각 파형 데이터용 RAM과 악곡 프레이즈 데이터용 RAM에 접속되어 추가되어 있다. 또한, 어플리케이션 소프트웨어는 도 11의 HV 스크립트 플레이어로서의 기능을 함께 갖고, HV 스크립트 미들웨어 API를 통하여 HV 스크립트에 기술되어 있는 이벤트의 종류에 따라 HV 컨버터, 파형 재생 플레이어, 악곡 프레이즈 재생 플레이어 중 어느 하나로 처리를 지시한다. 또한, 음원은 도 11의 HV 음원, 파형 발생기 및 악곡 프레이즈 음원의 3개의 기능을 함께 갖는 것으로, 이들 각 출력 신호는 가산기에서 합성되어 스피커로 발음된다. 또, 도 15에 도시한 각 구성 요소의 동작은 도 1 및 도 11에 도시한 대응 구성 요소의 동작과 마찬가지이므로, 그 상세 설명을 생략한다. Here, FIG. 15 is a combination of some components of FIG. 11 based on the configuration of FIG. 1, and reference numerals 211 to 219 correspond to reference numerals 11 to 19 of FIG. 1, and reference numerals 303 to 313 indicate Correspond to reference numerals 103 to 113. That is, the user data RAM of FIG. 11 is connected to the middleware of FIG. 1 via the user data API, and in the middleware, the waveform reproduction player and the phrase reproduction player of FIG. 11 are respectively the waveform data RAM and the music phrase data RAM. It is connected to and added. In addition, the application software has a function as the HV script player of FIG. 11, and processes the processing to any one of an HV converter, a waveform playback player, and a music phrase playback player according to the type of events described in the HV script through the HV script middleware API. Instruct. In addition, the sound source has three functions of the HV sound source, the waveform generator and the music phrase sound source in Fig. 11, and each of these output signals is synthesized in an adder and pronounced as a speaker. In addition, since the operation | movement of each component shown in FIG. 15 is the same as that of the corresponding component shown in FIG. 1 and FIG. 11, the detailed description is abbreviate | omitted.

도 16은 도 15에 도시한 음성·악곡 재생 장치의 동작을 도시하는 흐름도이다. 이것은 도 13의 흐름도를 기초로 하여 도 8의 흐름도의 일부를 추가한 것으로, 참조 부호 S211∼S216은 도 8의 참조 부호 S11∼S16에 대응하고, 부호 S304∼S310, S312, S313은 도 13의 부호 S104∼S110, S112, S113에 대응하고 있다. 즉, 도 13의 단계 S104에서의 판단 결과가 「NO」인 경우, 도 8의 단계 S13, S14, S15와 마찬가지의 처리를 실행하고, 다음 단계 S312에서 HV 음원 재생 처리가 실행된다. 이와 같이 하여, 하나의 HV 스크립트를 입력함으로써 HV 음원에 의한 음성의 재생, 파형 재생 플레이어에 의한 파형 데이터의 재생, 및 악곡 프레이즈 재생 플레이어에 의한 음표 정보에 기초한 악곡 프레이즈의 재생이 가능해진다. 또, 도 16에 도시한 각 단계의 처리는 도 8 및 도 13과 마찬가지이므로, 그 상세 설명을 생략한다. FIG. 16 is a flowchart showing the operation of the audio / music reproducing apparatus shown in FIG. This adds a part of the flowchart of FIG. 8 based on the flowchart of FIG. 13, and S211-S216 corresponds to S11-S16 of FIG. 8, and S304-S310, S312, S313 are FIG. Corresponds to symbols S104 to S110, S112, and S113. That is, when the determination result in step S104 in FIG. 13 is "NO", the same processing as in steps S13, S14, and S15 in FIG. 8 is executed, and the HV sound source reproduction processing is executed in the next step S312. In this manner, by inputting one HV script, it is possible to reproduce the audio by the HV sound source, the waveform data by the waveform reproduction player, and the music phrase based on the note information by the music phrase reproduction player. In addition, since the process of each step shown in FIG. 16 is the same as that of FIG. 8 and FIG. 13, detailed description is abbreviate | omitted.

마지막으로, 상기한 실시예에서 사용되고 있는 운률 기호에 대하여 설명한다. 예를 들면, HV 스크립트에서 기술된「は^3じま$り^ま$5し>10た。」는, 「はじまりました」라는 언어(즉, 발음하는 문자열)에 소정의 인토네이션을 부가하여 음성 합성시킨 것으로, 여기서 「^」, 「$」「>」 등이 운률 기호에 상당한다. 이 운률 기호 다음의 문자(운률 기호 직후에 수치가 기술되어 있는 경우에는, 그 수치에 계속되는 문자)에 대하여 소정의 억양(액센트)이 부가된다. Finally, the rhyme symbols used in the above embodiments will be described. For example, "Ha ^ 3Jima $ ri ^ Ma $ 5Shi> 10Ta" described in the HV script adds a predetermined innation to the language "HajiMariMashita" (that is, pronounced strings). In this case, "^", "$", ">", and the like correspond to rhyme symbols. A predetermined accent (accent) is added to the character following the rhyme symbol (the character following the digit if the numerical value is described immediately after the rhyme symbol).

구체적으로는, 「^」은 발음 중 피치를 올리는 것을 의미하고, 「$」는 발음중 피치를 내리는 것을 의미하며, 「>」는 발음 중 음량을 내리는 것을 의미하여, 이들에 따라 음성 합성이 행해진다. 또한, 운률 기호의 직후에 수치가 기술되어 있는 경우에는, 그 수치는 부가하는 액센트의 변화량을 지정하는 것이다. 예를 들면, 「は^3じま」라는 어구의 경우, 「は」은 표준의 피치 및 음량으로 발음하여, 「じ」를 그 발음 중에 「3」의 양만큼 피치를 올리는 것을 나타내고, 다음의 「ま」는 올려진 피치대로 발음하는 것을 나타낸다. Specifically, "^" means raising the pitch during pronunciation, "$" means lowering the pitch during pronunciation, and ">" means lowering the volume during pronunciation, and voice synthesis is performed accordingly. All. In addition, when a numerical value is described immediately after the rhyme symbol, the numerical value designates the change amount of the accent to add. For example, in the phrase "Ha ^ 3 じま", "は" is pronounced at the standard pitch and volume, indicating that "じ" is raised by the amount of "3" during the pronunciation. "ま" means to pronounce according to the raised pitch.

이와 같이, 발음되는 언어에 포함되는 문자에 소정의 액센트(또는 인토네이션)을 부가하는 경우, 그 문자 직전에 상기한 바와 같은 운률 기호(또한, 인토네이션의 변화량을 나타내는 수치)를 기술한다. 또한, 상기한 운률 기호는 발음 중의 피치나 음량의 제어를 행하는 것으로 했지만, 이에 한정되지 않고, 예를 들면 음질이나 속도를 제어하는 기호를 이용할 수 있다. 이러한 기호를 HV 스크립트에 부가함으로써, 액센트 등의 발음 양태를 적합하게 표현할 수 있다. In this way, when a predetermined accent (or intonation) is added to a character included in the pronounced language, the above-described rhyme symbol (a numerical value representing the amount of change in tonation) is described immediately before the character. In addition, although the said rhyme symbol controls the pitch and volume during pronunciation, it is not limited to this, For example, the symbol which controls a sound quality and a speed can be used. By adding such a symbol to the HV script, it is possible to suitably express pronunciation aspects such as accents.

또한, 본 발명은 전술한 실시예에 한정될 필요는 없어, 발명의 범위 내의 변경 등은 본 발명에 포함될 수 있는 것이다. In addition, this invention does not need to be limited to the above-mentioned embodiment, A change etc. within the scope of the invention can be included in this invention.

Claims

Storage means for storing composition dictionary data which preforms and stores formant frame data corresponding to a phonetic character representing a predetermined pronunciation unit in association with the phonetic character;

Registration means for registering, as user dictionary data, user phrase data indicating separate formant frame data for use in place of the formant frame data corresponding to the phonetic characters stored in the synthesis dictionary data, according to a user's operation;

Input script data including a character string composed of a plurality of phonetic characters and event data indicating replacement of formant frame data corresponding to at least some of the phonetic characters of the string; Input means to say,

Interpret the script data, read formant frame data from the composite dictionary data based on phonetic characters other than the partial phonetic characters, and from the user dictionary data based on the event data and the character strings Speech synthesis means for reading phrase data and generating synthesized speech based on the read formant frame data and the read user phrase data

Speech reproduction apparatus comprising a speech reproduction apparatus.

delete

The method of claim 1,

Music reproducing means for reproducing music based on the music reproducing information,

The input means includes music reproducing information for reproducing music, voice reproducing information including the script data, and the user phrase data, and music reproducing based on the music reproducing information and voice reproducing based on the voice reproducing information. A data exchange format, which is an information structure for reproducing synchronously,

The music reproducing means reproduces the music reproducing information,

And the speech synthesizing means is adapted to reproduce the speech reproduction information.

delete

A portable terminal device comprising the audio reproducing apparatus according to claim 1.

First storage means for storing sound data,

Second storage means for storing script data describing a character string composed of a pronunciation character representing a predetermined pronunciation unit, a rhyme symbol representing a pronunciation aspect of the character string, and event data instructing reproduction of the sound data;

The script data is read sequentially from the second storage means, the pronunciation is instructed based on the character string and the rhyme code in the script data, and when the event data in the script data is read, the sound is read based on the event data. Playback instruction means for instructing playback of data;

Synthesized pronunciation signal generation means for generating a synthesized pronunciation signal by performing voice synthesis based on the pronunciation instruction of the character string from the reproduction instruction means;

Sound signal generation means for reading the sound data from the first storage means based on a playback instruction of the sound data from the playback instruction means and generating a sound signal based on the sound data;

And a synthesized speech generating means for generating a synthesized speech based on the synthesized pronunciation signal and generating a sound based on the sound signal.

The method of claim 6,

And the sound data is waveform data generated by sampling a predetermined sound.

The method of claim 6,

And the sound data is music data including musical note information indicating a pitch and a volume of a sound to be pronounced.

The method of claim 6,

And said synthesized speech signal generating means stores formant control parameters that characterize the pronunciation of characters, and performs speech synthesis using formant control parameters corresponding to character strings in said script data.

The method according to any one of claims 6 to 9,

And the script data is described by a file composed of text data.

The method of claim 6,

Form synthesis dictionary data which preforms and stores formant frame data corresponding to a phonetic character representing a predetermined phonetic unit with the phonetic character,

In accordance with a user's operation, register user phrase data indicating separate formant frame data for use in place of the formant frame data corresponding to the phonetic characters stored in the synthesis dictionary data, as user dictionary data,

When the script data includes event data for instructing replacement of formant frame data corresponding to at least some of the pronunciation characters in the character string, the synthesized pronunciation signal generating means is configured to produce pronunciation characters other than the partial pronunciation characters. Read the formant frame data from the composite dictionary data based on the readout, and read the user phrase data from the user dictionary data based on the event data and the character string of the partial, read the formant frame data and read the And generating the synthesized pronunciation signal based on user phrase data.

A portable terminal device comprising the audio reproducing apparatus according to any one of claims 6 to 9 and 11.

A portable terminal device comprising the audio reproducing apparatus according to claim 3.

delete

A portable terminal device comprising the audio reproducing apparatus of claim 10.