KR100634142B1

KR100634142B1 - Potable terminal device

Info

Publication number: KR100634142B1
Application number: KR20040020474A
Authority: KR
Inventors: 가와이마사히코
Original assignee: 야마하 가부시키가이샤
Priority date: 2003-03-27
Filing date: 2004-03-25
Publication date: 2006-10-16
Also published as: CN1534955A; KR20040084855A; HK1066365A1; CN100359907C; JP2004294816A

Abstract

소정 언어로 입력된 텍스트를 타 언어로 번역하는 번역 수단(배송 서버 등)에 대하여 상기 텍스트를 송신하여 번역 처리시키고, 그 번역 결과의 텍스트 및 그 발음 방법을 나타내는 발음 데이터를 포함하는 번역 결과 정보를 회신시켜, 상기 발음 데이터에 기초하여 번역 결과의 텍스트를 발음하도록 한 휴대 단말 장치이며, 상기 발음 데이터로서 포르만트 파라미터를 이용하여 음성 합성하도록 하였다. 이에 따라, 비교적 적은 통신 용량에 의해 번역 결과 정보의 통신이 가능하게 되고, 또한, 휴대 단말 장치 측에서는 관련되는 음성 및 화상 등도 수신할 수 있도록 된다. Translating the text by translating the text inputted in a predetermined language into another language (transmission server, etc.) and translating the text, and the translation result information including the text of the translation result and pronunciation data indicating the pronunciation method In response, the portable terminal device is configured to pronounce the text of the translation result based on the pronunciation data, and the speech synthesis is performed using the formant parameter as the pronunciation data. Accordingly, the translation result information can be communicated with a relatively small communication capacity, and the portable terminal device can also receive related voices, images, and the like.

Description

Portable terminal device {POTABLE TERMINAL DEVICE}

도 1은 본 발명의 적절한 실시예에 따른 휴대 전화기 및 통신망을 통해 접속되는 배송 서버로 이루어지는 시스템의 개략 구성을 도시한다. 1 shows a schematic configuration of a system consisting of a mobile telephone and a delivery server connected via a communication network according to a preferred embodiment of the present invention.

도 2는 사전 데이터베이스에 등록되는 사전의 콘텐츠 예를 도시한다. 2 shows an example of the contents of a dictionary registered in a dictionary database.

도 3은 각 시퀀스 데이터에서의 이벤트 데이터와 듀레이션 데이터와의 관계를 도시한다. 3 shows a relationship between event data and duration data in each sequence data.

도 4는 SMAF의 데이터 기억·처리 구조를 설명하기 위한 도면이다. 4 is a diagram for explaining the data storage and processing structure of SMAF.

도 5는 종래의 SMAF 파일에 대하여 HV 트랙 청크를 추가하여 본 발명의 실시예에 이용한 예를 나타낸다. 5 shows an example used in an embodiment of the present invention by adding an HV track chunk to a conventional SMAF file.

도 6은 3종류의 음성 재생용 포맷을 도시하고 있고, (a)는 TSeq형, (b)는 PSeq형, (c)는 FSeq형을 각각 도시한다. Fig. 6 shows three types of audio reproduction formats, (a) shows TSeq type, (b) shows PSeq type and (c) shows FSeq type.

도 7은 음성 재생 시퀀스 데이터의 데이터 교환 포맷의 일례를 도시한다. 7 shows an example of a data exchange format of audio reproduction sequence data.

도 8A는 시퀀스 데이터의 구성을 도시한다. 8A shows a configuration of sequence data.

도 8B는 듀레이션과 게이트 타임의 관계를 도시한다. 8B shows the relationship between duration and gate time.

도 9는 운율 제어 정보를 설명하기 위한 도면이다. 9 is a diagram for explaining rhyme control information.

도 10은 게이트 타임과 딜레이 타임의 관계를 도시하는 도면이다. 10 is a diagram illustrating a relationship between gate time and delay time.

도 11은 각 포르만트 파형의 레벨과 중심 주파수를 도시한다. 11 shows the level and center frequency of each formant waveform.

도 12는 FSeq 데이터 청크의 바디부의 데이터 예를 도시한다. 12 shows an example of data of a body portion of an FSeq data chunk.

도 13A는 HV-Script에서의 기호의 의미를 도시한다. 13A shows the meaning of symbols in HV-Script.

도 13B는 HV-Script에서의 기호에 의해 지시되는 피치 변화를 도시한다. 13B shows a pitch change indicated by a symbol in HV-Script.

도 13C는 HV-Script에서의 기호에 의해 지시되는 피치 변화를 도시한다. Fig. 13C shows the pitch change indicated by the symbol in the HV-Script.

도 14는 포르만트를 특징짓는 파라미터를 도시한다. 14 shows the parameters characterizing the formant.

도 15는 본 실시예에 따른 휴대 전화기의 개략 구성을 도시하는 블록도이다. 15 is a block diagram showing a schematic configuration of a mobile telephone according to the present embodiment.

도 16은 음성 합성 유닛의 구성을 도시한다. 16 shows the configuration of the speech synthesis unit.

도 17은 도 16에 도시하는 포르만트 생성부의 구성을 도시한다. FIG. 17 shows the configuration of the formant generating unit shown in FIG. 16.

도 18은 번역 처리를 실행하는 경우의 휴대 전화기의 동작을 설명하는 흐름도이다. 18 is a flowchart for explaining the operation of the cellular phone in the case of executing a translation process.

도 19는 번역 처리를 실행하는 경우의 배송 서버의 동작을 설명하는 흐름도이다. 19 is a flowchart for explaining the operation of the delivery server in the case of executing a translation process.

도 20은 사전 검색 처리를 실행하는 경우의 휴대 전화기의 동작을 설명하는 흐름도이다. 20 is a flowchart for explaining the operation of the cellular phone in the case of executing the pre-search processing.

도 21은 사전 검색 처리를 실행하는 경우의 배송 서버의 동작을 설명하는 흐름도이다. 21 is a flowchart for explaining the operation of the delivery server in the case of executing the pre-search processing.

도 22는 포맷 타입의 일례를 도시한다. 22 shows an example of a format type.

도 23은 언어 타입의 일례를 도시한다. 23 shows an example of a language type.

도 24는 타임 베이스의 일례를 도시한다. 24 shows an example of a time base.

도 25는 HV 음색 파라미터의 일례를 도시한다. 25 shows an example of the HV timbre parameters.

도 26은 운율 제어 정보의 구조의 일례를 도시한다. Fig. 26 shows an example of the structure of the rhyme control information.

도 27은 FSeq형의 프레임 데이터 열의 일례를 도시한다. 27 shows an example of a frame data column of the FSeq type.

* 도면의 주요부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

1 : 휴대 전화기 2 : 배송 서버1: mobile phone 2: delivery server

2a : 제어기 2b : 데이터베이스2a: controller 2b: database

100 : SMAF 파일 113 : LCD 디스플레이100: SMAF file 113: LCD display

12 : 통신부 12a : 안테나12: communication unit 12a: antenna

13 : 음성 처리부 15 : 마이크13: voice processing unit 15: microphone

16a : 음원 17 : 스피커16a: sound source 17: speaker

본 발명은 번역 수단에 의한 번역 결과 또는 사전 검색 결과를 음성 합성에 의해 발음하는 휴대 단말 장치에 관한 것이다. The present invention relates to a portable terminal apparatus which pronounces a translation result or a dictionary search result by a translation means by speech synthesis.

최근, 휴대 전화기(Cellular Phone 등)에 대하여 여러 가지의 서비스나 기능이 제공되고 있고, 예컨대, 어떤 언어로 기술된 문장을 입력하는 것 만으로, 자동적으로 타언어로 기계 번역하고, 그 번역 결과를 음성으로서 발음하여 사용자에게 들려줄 수 있는 무선 통신망을 이용한 서비스도 실시되고 있다. 현재로는, 이러한 휴대 전화기를 이용하여 전화기로서의 통신 기능 외에 상기의 서비스 등에 의해 제공되는 타언어 간의 번역·통역 기능도 실현되도록 되어 있다. In recent years, various services and functions have been provided for mobile phones (Cellular Phone, etc.). For example, only by inputting sentences written in a language, the machine is automatically translated into another language, and the translation results are voiced. Also, a service using a wireless communication network that can be pronounced and spoken to a user is also being implemented. At present, such a mobile phone is used to realize a translation and interpretation function between other languages provided by the above-described service or the like as well as the communication function as a phone.

또한, 상기의 번역 서비스 등에 관한 특허 문헌도 존재하고, 예컨대, 일본국 특허 출원 공개 번호「특개 제2002-125050호 공보」에는, 휴대 전화기에서 인터넷을 경유하여 사용자의 입력 음성을 통역 서버로 송신하여, 상기 통역 서버로 기계 번역(또는 자동 번역)된 음성을 휴대 전화기에 회신하는 기술이 개시되어 있다.Patent documents related to the above translation services and the like also exist. For example, Japanese Patent Application Laid-open No. 2002-125050 discloses a user's input voice via an internet via a mobile phone to an interpretation server. A technique for returning a machine-translated (or automatically translated) voice to a mobile phone is disclosed.

그러나, 상기 서비스나 특허 문헌 기재 기술로서는, 음성에 의한 번역 결과를 제공하기 위해 공중 전화 회선을 이용하고 있으므로, 휴대 전화기에서의 음성 데이터(또는 음성 신호)의 송수신은 반드시 소정의 회선 용량의 사용 제한을 받게 된다. 따라서, 상기의 종래 기술에서는, 최근 광 대역화되어 고속화·대용량화의 경향의 통신 리소스를 유효하게 이용할 수 있다고는 말할 수 없다. However, since the public telephone line is used to provide a translation result by voice as the above-described service and patent document description technology, transmission and reception of voice data (or voice signal) in a mobile phone is necessarily limited to use of a predetermined line capacity. Will receive. Therefore, in the above-mentioned conventional technology, it cannot be said that communication resources which have recently become wideband and have a tendency of high speed and large capacity can be effectively used.

또한, 상기의 종래 기술에서는 휴대 전화기에서 번역 결과로서의 텍스트 표시나 번역 결과로서의 음성 출력이 가능하게 될 뿐이므로, 이들의 번역 정보의 제공에 더하여 관련되는 화상이나 음성의 정보도 동시에 제공하도록 설계되어 있지 않고, 금후의 통신 기술의 발전에 따르는 사용자 측의 기대·요망에 대응될 수 있다고는 말하기 힘들다. In addition, in the above-mentioned prior art, since the text display as the translation result and the audio output as the translation result are only possible in the mobile phone, in addition to the provision of the translation information thereof, the related image and the audio information are also designed to be provided simultaneously. It is difficult to say that it can meet the expectations and demands of the user side in the future with the development of communication technology.

본 발명은 상기의 점을 감안하여 이루어진 것으로서, 번역 결과나 사전 검색 결과를 음성 합성에 의해 출력하는 휴대 단말 장치에서, 번역·사전 검색 결과의 정보의 송수신에 필요한 용량을 줄일 수 있고, 또한, 관련 정보에 관해서도 실시간으로 적어도 음성(또는 화상)을 이용하여 사용자에 제공 가능하게 하는 것을 목적으로 한다. SUMMARY OF THE INVENTION The present invention has been made in view of the above point, and in a portable terminal device that outputs a translation result or a dictionary search result by speech synthesis, the capacity required for transmission and reception of information on translation and presearch results can be reduced, and It also aims at enabling information to be provided to a user in real time using at least voice (or image).

본 발명에 따른 휴대 단말 장치는, 입력된 텍스트를 타 언어로 번역하는 번역 수단(예컨대, 외부의 번역 배송 서버)에 대하여 번역 대상의 텍스트를 송신하여 번역 처리시키고, 그 번역 결과에 따른 텍스트 및 그 발음 방법을 나타내는 발음 데이터를 포함하는 번역 결과 정보를 회신시키며, 수신한 발음 데이터에 기초하여 번역 결과의 텍스트를 원하는 언어 및 자연스러운 인토네이션으로 발음하게 하도록 구성된다. 여기서, 발음 데이터로서는 포르만트 파라미터가 이용된다. The portable terminal device according to the present invention transmits a text to be translated to a translation means (e.g., an external translation delivery server) for translating an input text into another language, and translates the text to be translated according to the translation result. And returning the translation result information including the pronunciation data indicating the pronunciation method, and causing the text of the translation result to be pronounced in a desired language and natural intonation based on the received pronunciation data. Here, the formant parameter is used as the pronunciation data.

이와 같이, 번역 결과를 음성으로서 발음할 수 있으므로, 휴대 단말 장치의 사용자는 입력한 텍스트의 번역 결과를 청각에 의해 용이하게 인식·파악할 수 있다. 또한, 번역 수단으로부터 회신되는 발음 데이터는 각 음소의 합성에 이용되는 포르만트 파라미터(예컨대, 포르만트 주파수, 포르만트 레벨, 대역폭 등)를 지시·특정하는 것이고, 그 데이터 용량은 통상의 음성 신호에 비해 작고, 따라서, 휴대 단말 장치에서 통신망을 통해 발음 데이터의 회신을 받는 경우, 종래의 음성 신호의 회신에 비해 작은 전송 용량이라도 가능하게 된다. In this way, since the translation result can be pronounced as a voice, the user of the portable terminal device can easily recognize and grasp the translation result of the input text by hearing. The pronunciation data returned from the translation means indicates and specifies the formant parameters (for example, formant frequency, formant level, bandwidth, etc.) used for synthesizing each phoneme. It is smaller than the voice signal, and therefore, when the portable terminal apparatus receives the reply of the pronunciation data through the communication network, even a smaller transmission capacity is possible than the reply of the conventional voice signal.

또한, 휴대 단말 장치에 표시 수단(예컨대, 액정 표시기)이 구비되는 경우, 번역 결과의 텍스트를 표시할 수 있고, 따라서, 사용자가 시각적으로 번역 결과를 인식·파악할 수 있다. In addition, when the portable terminal apparatus is provided with display means (e.g., a liquid crystal display), the text of the translation result can be displayed, so that the user can visually recognize and grasp the translation result.

상기의 번역 수단을 휴대 단말 장치의 외부에 설치하여 통신망을 통해 그 번역 처리의 실행 지시 및 번역 결과의 수신 등을 행하도록 해도 되고, 또한, 번역 수단을 휴대 단말 장치 내부에 설치하도록 해도 된다. The translation means described above may be provided outside the portable terminal device to execute an instruction for executing the translation process, receive a translation result, or the like through the communication network, or may provide the translation means inside the portable terminal device.

또한, 본 발명에 따른 휴대 단말 장치는, 소정의 음성 데이터를 입력하여 음성을 발음하는 것이고, 사전 데이터베이스에 대하여 표제어 정보를 검색키로서 송신하여 그 의미 정보를 검색하는 동시에, 상기 의미 정보의 발음 방법을 나타내는 발음 데이터를 포함하는 검색 결과 정보를 회신시켜, 수신한 발음 데이터에 기초하여 검색 결과를 음성으로 발음하도록 구성된다. 여기서, 발음 데이터는 포르만트 파라미터를 이용하여 구성된다. In addition, the portable terminal apparatus according to the present invention inputs predetermined voice data to pronounce a voice, transmits the headword information as a search key to a dictionary database, retrieves the semantic information, and simultaneously pronounces the semantic information. And returning the search result information including the pronunciation data indicating that the speech is pronounced based on the received pronunciation data. Here, the pronunciation data is constructed using the formant parameter.

이와 같이, 표제어 정보를 검색키로서 이용하여 검색된 의미 정보를 음성으로서 발음할 수 있으므로, 휴대 단말 장치의 사용자는 그 의미 정보를 청각에 의해 용이하게 인식·파악할 수 있다. 또한, 사전 데이터베이스로부터 회신되는 검색 결과 정보는 포르만트 파라미터를 나타내는 것이고, 따라서, 검색 결과의 회신에 관해서는 적은 전송 용량만으로 가능하다. In this way, the semantic information retrieved can be pronounced as a voice using the headword information as a search key, so that the user of the portable terminal device can easily recognize and grasp the semantic information by hearing. Further, the search result information returned from the dictionary database indicates the formant parameter, and therefore, only a small transmission capacity can be used for the return of the search result.

상기의 검색 결과 정보는 의미 정보를 나타내는 텍스트와, 검색키의 표제어에 관련된 화상을 표시하는 화상 데이터와, 표제어에 관련된 음을 표시하는 제2 발음 데이터를 포함하여 구성된다. 이 경우, 휴대 단말 장치에 구비된 표시 수단에 의해 텍스트와 화상을 표시할 수 있다. 또한, 제2 발음 데이터에 기초하여, 표제어에 관련된 음도 발생할 수 있다. 이에 따라, 휴대 단말 장치의 사용자는 검색키인 표제어의 의미뿐 아니라, 그 풍부한 관련 정보에 관해서도 얻을 수 있다. The above search result information includes text indicating semantic information, image data indicating an image related to a headword of a search key, and second pronunciation data indicating a sound related to a headword. In this case, text and an image can be displayed by the display means provided in the portable terminal device. Also, based on the second pronunciation data, a sound related to the headword may also occur. Accordingly, the user of the portable terminal device can obtain not only the meaning of the headword which is the search key but also the rich related information.

상기의 발음 데이터는, 번역 결과 또는 검색 결과를 표시하는 발음 문자열과, 상기 발음 문자열을 음성화 할 시의 발음의 억양 등(인토네이션이나 액센트)을 규정하는 운율 기호를 포함하여 구성된다. 구체적으로는, 통상의 문자(일본어의 경우, 히라가나나 카타가나 등)나 숫자 및 소정의 기호의 조합에 의해 규정되는 소위 HV-Script에 의해 기술된다.The pronunciation data includes a pronunciation string for displaying a translation result or a search result, and a rhyme symbol for specifying an accent or the like (innation or accent) of the pronunciation when the pronunciation string is spoken. Specifically, it is described by so-called HV-Script, which is defined by ordinary characters (in Japanese, hiragana and katakana, etc.) or a combination of numerals and predetermined symbols.

본 발명의 실시예에 관해서 첨부 도면을 참조하여 상세하게 설명한다. Embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 적절한 실시예에 따른 휴대 전화기(1)와, 상기 휴대 전화기(1)에 대하여 통신망(무선 통신망 및 디지털 데이터망을 포함한다.)을 통해 번역 서비스 및 사전 검색 서비스를 제공하는 배송 서버(2)로 이루어지는 시스템의 개략 구성을 도시한다. 1 illustrates a mobile phone 1 according to a preferred embodiment of the present invention and a translation service and a dictionary retrieval service for the mobile phone 1 through a communication network (including a wireless communication network and a digital data network). The schematic structure of the system which consists of the delivery server 2 is shown.

휴대 전화기(1)는 음성 합성에 의해 번역 결과나 검색 결과를 원하는 언어로 발음하는 기능을 갖고, 소정의 콘텐츠 사업자 등이 관리하는 배송 서버(2)와 상기의 통신망을 통해 접속된다. 배송 서버(2)는 상기 배송 서버의 각 부를 제어하는 제어부(2a)와 사전 데이터베이스(2b)를 포함한다. The mobile phone 1 has a function of pronounced translation results or search results in a desired language by speech synthesis, and is connected to a delivery server 2 managed by a predetermined content provider or the like through the communication network. The delivery server 2 includes a control unit 2a and a dictionary database 2b for controlling each part of the delivery server.

사전 데이터베이스(2b)는 제어부(2a)에 의해 실행되는 번역 처리에 이용되는 번역 사전 및 단어의 의미 등을 검색하기 위한 각종 사전을 기억하고 있다. 번역 사전에는, 번역 대상의 문장 또는 단어(텍스트 데이터)에 대한 번역 결과 및 소정의 언어에 의한 발음 방법(예컨대, 읽는 방법 등)으로 음성 합성시키기 위한 발음 데이터(이하,「제1 발음 데이터」라고 한다.)가 등록되어 있다. 또한, 각종 사전에는, 검색키가 되는 표제어에 대응하는 의미 정보(즉, 검색 대상의 표제어의 의미 등)를 나타내는 텍스트와, 그 읽는 방법의 발음 데이터(이하, 「제2 발음 데이터」라고 한다.)와 관련되는 정보(화상, 음성 등)가 대응되어 등록되어 있다. The dictionary database 2b stores various dictionaries for searching for the translation dictionary used for the translation processing executed by the control unit 2a, the meaning of words, and the like. The translation dictionary includes pronunciation data (hereinafter referred to as "first pronunciation data") for speech synthesis using a translation result for a sentence or word (text data) to be translated and a pronunciation method (for example, a reading method) in a predetermined language. Is registered. The various dictionaries are referred to as text representing semantic information (i.e., the meaning of the heading to be searched, etc.) corresponding to the heading as the search key, and pronunciation data (hereinafter referred to as "second pronunciation data") of the reading method. Information (images, sounds, etc.) associated with this information is registered correspondingly.

제어부(2a)는 휴대 전화기(1)로부터의 번역 요구나 검색 요구에 따라서 소정 의 처리를 행하는 것이고, 번역 요구의 경우, 상기의 번역 사전을 이용하여 휴대 전화기(1)로부터 보내진 번역 대상의 텍스트의 번역 처리를 실행하며, 그 번역 결과에 따른 텍스트와 그 읽는 방법에 따른 발음 데이터를 포함하는 번역 결과 정보를 생성하여 휴대 전화기(1)에 회신한다. 한편, 검색 요구의 경우에는, 상기 검색 요구에 포함되는 검색 대상의 표제어를 검색키로서 지정한 사전을 이용하여 검색 처리를 실행하고, 그 검색 결과를 표시하는 텍스트 및 그 읽는 방법의 발음 데이터, 및 표제어에 관련되는 음을 나타내는 발음 데이터 및 관련되는 화상을 나타내는 화상 데이터를 포함하는 검색 결과 정보를 생성하여 휴대 전화기(1)에 회신한다. The control unit 2a performs predetermined processing in accordance with a translation request or a retrieval request from the mobile phone 1, and in the case of a translation request, the translation target text sent from the mobile phone 1 using the above-described translation dictionary. The translation process is executed, and the translation result information including the text according to the translation result and the pronunciation data according to the reading method is generated and returned to the mobile phone 1. On the other hand, in the case of a search request, a search process is performed using a dictionary designated as a search key the headword of the search target included in the search request, the text displaying the search result, the pronunciation data of the reading method, and the headword The search result information including the pronunciation data representing the sound associated with and the image data representing the associated image is generated and returned to the mobile phone 1.

다음에, 사전 데이터베이스(2b)의 구성에 관해서 상세하게 설명한다.Next, the structure of the dictionary database 2b is demonstrated in detail.

사전 데이터베이스(2b)는, 상술하는 바와 같이 번역 사전이나 각종 사전 등을 기억하고 있다. 예컨대, 사전 데이터베이스(2b)에 기억되는 번역 사전이 영일 사전의 기능을 갖는 경우, “It’s very fine, isn’t it ?”이라는 영문에 대응하는 번역문의 텍스트로서 「とてもいい天氣ですね。」가 등록되어 있고, 그 읽는 방법의 발음 데이터로서 「とっ’ても, S54’い/いて$ん＿き/です＿ね－2*－」와 같은 문자와 기호의 혼합 데이터가 등록되어 있다. 또한, 번역 사전에는 후술하는 발음용 시퀀스 데이터도 등록되어 있지만, 여기서는 그것을 생략하여 설명한다. 이와 같이, 본 실시예에서의 음성(즉, 인간의 목소리)에 따른 발음 데이터는 음성 합성에 의하여 발음할 때의 음의 억양 등을 규정하는 운율 기호를 포함하여 구성된다. As described above, the dictionary database 2b stores translation dictionaries, various dictionaries, and the like. For example, when the translation dictionary stored in the dictionary database 2b has the function of an English-Japanese dictionary, "とてもいい天氣ですね。" is the text of the translation corresponding to the English word "It's very fine, isn't it?" It is registered, and mixed data of characters and symbols such as "とっ 'ても, S54' い / いていんん / です＿ね -2 *-" is registered as pronunciation data of the reading method. In addition, although the pronunciation sequence data mentioned later is also registered in the translation dictionary, it abbreviate | omits description here. As described above, the pronunciation data according to the voice (i.e., human voice) in the present embodiment is constituted by including a rhyme symbol that defines the intonation of the sound and the like when pronounced by voice synthesis.

상기한 바와 같이 텍스트로 기술되는 발음 데이터의 기술(記述) 룰(본 실시예에서는 「HV-Script」라고 한다.)에 관해서는 후술한다. 또한, 본 실시예에서는, 휴대 전화기(1)가 배송 서버(2)로부터 수신한 HV-Script에 의한 발음 데이터를 음소 마다의 포르만트 파라미터로 변환하는 동시에 부가된 운율 기호에 기초하여 상기 포르만트 파라미터를 변경하여 프레임 데이터 열을 형성하고, 이것을 이용하여 음성 합성을 실행하고 있다. 이 경우, 음소 기술형에 의한 데이터나 포르만트 프레임 기술형에 의한 데이터를 이용할 수도 있다. As described above, the description rule of the pronunciation data described in text (hereinafter, referred to as "HV-Script" in this embodiment) will be described later. In addition, in the present embodiment, the mobile phone 1 converts the pronunciation data by HV-Script received from the delivery server 2 into formant parameters for each phoneme and at the same time based on the added rhyme symbols. Frame parameters are formed by changing the frame parameters, and speech synthesis is performed using the frame data streams. In this case, the data according to the phoneme description type or the data using the formant frame description type can also be used.

이상과 같이, 번역 사전에는 짧은 문장이나 단어에 대응하는 번역문의 텍스트와 그 읽는 방법의 발음 데이터가 등록되어 있다. 비교적 긴 문장을 번역하는 경우에는, 주지의 수법에 의한 구문 해석 등을 행하여 번역 처리를 실행하는 것이고, 번역 사전에는 그 번역을 행하기 위한 여러가지의 데이터가 포함된다. 또한, 번역 사전에는 문장을 구성하는 문절이나 단어마다 대응하는 발음 데이터가 등록되어 있고, 번역 결과의 문장을 구성하는 문절이나 단어를 차차 대응하는 발음 데이터로 치환함으로써 문장 전체에 대응하는 발음 데이터를 생성한다. 또한, 사전 데이터베이스(2b)에는 문절 또는 단어가 사용되는 조건(즉, 문두, 문중, 의문문 등의 종류별에 따른 조건)으로부터 선출·결정되는 운율 기호의 선출·결정 룰도 등록되어 있고, 이것에 의해 원하는 운율 기호를 결정하여 추가하거나, 혹은 상기 운율 기호를 적절 변경하거나 할 수 있다. As described above, the text of the translation sentence corresponding to the short sentence or word and the pronunciation data of the reading method are registered in the translation dictionary. In the case of translating a relatively long sentence, a translation process is performed by parsing a syntax using a known technique and the like, and the translation dictionary includes various data for performing the translation. In addition, pronunciation data corresponding to each sentence or word constituting the sentence is registered in the translation dictionary, and the pronunciation data corresponding to the entire sentence is generated by gradually replacing the sentence or word constituting the sentence of the translation result with corresponding pronunciation data. do. In addition, in the dictionary database 2b, the selection and decision rules of the rhyme symbols selected and determined from the conditions under which the phrases or words are used (that is, the conditions according to the types of the sentence, the sentence, the question and the like) are also registered. The desired rhyme symbol may be determined and added, or the rhyme symbol may be appropriately changed.

상기의 사전 데이터베이스(2b)에는 각종 사전도 기억되어 있고, 각 사전은 도 2에 도시하는 바와 같이 표제어 정보(인덱스)와, 그 표제어 정보에 대응하는 의 미 정보를 1세트로 한 사전 항목 정보가 다수 집중되어 구성된다. 의미 정보는, 그 표제어 정보의 의미를 표시하는 데이터(「데이터 1」), 표제어의 발음 방법을 나타내는 제1 발음 데이터(「데이터 2」), 표제어에 관련된 음을 표시하는 제2 발음 데이터(「데이터 3」), 및 표제어에 관련된 화상을 표시하는 화상 데이터(「데이터 4」)에 의해 구성된다. Various dictionaries are also stored in the dictionary database 2b, and as shown in FIG. 2, the dictionary entry information including the heading information (index) and the meaning information corresponding to the heading information as one set is shown. It is composed of a large number. The semantic information includes data ("data 1") indicating the meaning of the headword information, first pronunciation data ("data 2") indicating the pronunciation method of the headword, and second pronunciation data (" Data 3 ") and image data (" data 4 ") for displaying an image related to the headword.

예컨대, 번역 사전이 영일 사전의 기능을 갖는 경우, 도 2에 도시하는 바와 같이 인덱스의 란에 검색키로서 지정되는 영어 단어가 등록되어 있다. 각 표제어의 영어 단어에 대응하는 데이터 1로서 그 의미 정보를 표시하는 단어의 번역(예컨대, 표제어가 “Duck”인 경우, 「오리」)이 등록되고, 데이터 2로서 그 번역의 발음 데이터가 등록되고, 데이터 3으로서 표제어의 영어 단어에 관련되는 음의 발음 데이터(표제어가 “Duck”인 경우, 예컨대, 오리의 울음소리에 따른 발음 데이터)가 등록되며, 데이터 4로서 표제어의 영어 단어에 관련되는 화상 데이터(표제어가 “Duck”인 경우, 예컨대 오리의 화상 데이터)가 등록된다. For example, when the translation dictionary has the function of an English-Japanese dictionary, an English word designated as a search key is registered in the column of the index as shown in FIG. As data 1 corresponding to the English word of each heading, a translation of a word (for example, "duck" when the heading is "Duck") is registered as data 1, and pronunciation data of the translation is registered as data 2 Sound data related to the English word of the headword as data 3 (when the table control is “Duck”, for example, pronunciation data according to the duck's cry), the image related to the English word of the headword as data 4 Data (when the table control is "Duck", for example, image data of a duck) is registered.

또한, 인덱스 및 데이터 1의 란에는 텍스트 시퀀스 데이터가 등록되어 있고, 그것은 이벤트 데이터(텍스트 문자열, 재생 위치 등을 나타낸다.)와 듀레이션 데이터로 구성된다. 그 상세한 설명에 관해서는 후술한다. In addition, text sequence data is registered in the column of the index and data 1, which is composed of event data (a text string, a reproduction position, and the like) and duration data. The detailed description will be described later.

데이터 2의 란에서, 하나의 단어만을 등록하는 경우에는 HV-Script에 의한 발음 데이터만이 기술되고, 한편, 일련의 다수의 단어를 등록하는 경우에는 음성 시퀀스 데이터가 기술된다. In the column of data 2, only pronunciation data by HV-Script is described when only one word is registered, while speech sequence data is described when registering a plurality of words in a series.

음성 시퀀스 데이터는, 다수의 음성 데이터(HV-Script에 의해 표현된 데이 터.) 및 발음용 시퀀스 데이터에 의해 구성되는 것이고, 각 발음 데이터에는 발음 번호가 할당된다. 발음용 시퀀스 데이터는 도 3에 도시하는 바와 같이, 이벤트 데이터 및 이벤트 간격을 나타내는 듀레이션 데이터에 의해 구성된다. 이 이벤트 데이터는 대응하는 발음 데이터를 지시하는 발음 번호를 표시하는 데이터 및 상기 발음 데이터에 의한 발음 기간을 나타내는 데이터를 갖고 있다. 발음용 시퀀스 데이터를 재생함으로써, 각 이벤트 데이터에 대응하는 발음 데이터가 듀레이션 데이터에 따르는 타이밍으로 재생되므로, 일련의 다수의 단어가 순차 발음되게 된다. The speech sequence data is composed of a plurality of speech data (data expressed by HV-Script) and pronunciation sequence data, and a pronunciation number is assigned to each pronunciation data. As shown in Fig. 3, the pronunciation sequence data is composed of event data and duration data indicating an event interval. This event data has data indicating a pronunciation number indicating the corresponding pronunciation data and data indicating a pronunciation period by the pronunciation data. By reproducing the pronunciation sequence data, the pronunciation data corresponding to each event data is reproduced at a timing in accordance with the duration data, so that a plurality of words in a series are sequentially pronounced.

데이터 3의 란에는 PCM(Pulse-Code Modulation) 시퀀스 데이터 또는 FM(Frequency Modulation) 시퀀스 데이터가 기술되어 있고, PCM 시퀀스 데이터는 발음 데이터에 상당하는 파형 데이터와 PCM용 시퀀스 데이터로 구성된다. 파형 데이터는 파형 번호에 의해 지정된다. PCM용 시퀀스 데이터는 도 3에 도시하는 바와 같이, 이벤트 데이터(즉, 파형 번호, 발음 시간 등을 나타낸다.)와 듀레이션 데이터(이벤트 간격을 나타낸다.)로 구성된다. 또한, FM 시퀀스 데이터는 MIDI(Musical Instrument Digital Interface) 규격에 준거한 발음 데이터인 음색 데이터와 FM용 시퀀스 데이터로 구성된다. 음색 데이터는 FM 합성 알고리즘을 나타내는 데이터이고, 소정의 음색 번호에 의해 지정된다. FM용 시퀀스 데이터는 도 3에 도시하는 바와 같이, 이벤트 데이터(즉, 음색 번호, 음정(예컨대, 피치(pitch)), 음장(音長) 등을 나타낸다.)와 듀레이션 데이터(이벤트 간격을 나타낸다.)로 구성된다. In the column of data 3, PCM (Pulse-Code Modulation) sequence data or FM (Frequency Modulation) sequence data is described. PCM sequence data is composed of waveform data corresponding to pronunciation data and sequence data for PCM. Waveform data is specified by waveform number. As shown in Fig. 3, the PCM sequence data is composed of event data (i.e., waveform number, pronunciation time, etc.) and duration data (indicative event interval). In addition, FM sequence data is composed of tone data, which is pronunciation data conforming to the Musical Instrument Digital Interface (MIDI) standard, and sequence data for FM. The tone data is data representing an FM synthesis algorithm and is designated by a predetermined tone number. As shown in Fig. 3, the FM sequence data indicates event data (i.e., tone number, pitch (e.g., pitch), sound field, etc.) and duration data (event interval). It is composed of

또한, PCM 시퀀스 데이터의 파형 데이터는 녹음 채취된 음성 데이터인 데 대 하여, FM 시퀀스 데이터는 FM 음원을 제어하여 악음을 합성하기 위한 데이터이다. 따라서, PCM 시퀀스 데이터를 사용하는 경우에는, 악음 및 음성 모두 리얼한 음을 재현할 수 있는 것으로서, 데이터 양이 크고, 따라서 메모리의 사용 용량은 FM 시퀀스 데이터에 비해 크다. 한편, FM 시퀀스 데이터는 악기의 음 등을 리얼하게 재현하는 것으로 적당하고, 또한 데이터 양이 적어서, 메모리의 사용 용량도 적어도 가능한 이점이 있다. The waveform data of the PCM sequence data is recorded audio data, while the FM sequence data is data for synthesizing a musical sound by controlling an FM sound source. Therefore, when PCM sequence data is used, real sounds can be reproduced for both music and voice, and the amount of data is large, and thus the use capacity of the memory is larger than that of FM sequence data. On the other hand, the FM sequence data is suitable for realistically reproducing the sound of musical instruments and the like, and there is an advantage that the amount of data can be used at least, since the amount of data is small.

데이터 4의 란에는 화상 시퀀스 데이터가 기술된다. 이 화상 시퀀스 데이터는 소정 형식(예컨대, JPEG(Joint Photograph Experts Group)규격 등)의 화상 데이터와 화상 표시용의 시퀀스 데이터로 구성된다. 화상 데이터는 소정의 화상 번호에 의해 지정된다. 화상 표시용 시퀀스 데이터는 도 3에 도시하는 바와 같이, 이벤트 데이터(즉, 화상 번호, 표시 시간, 표시 형태 등을 나타낸다.)와 듀레이션 데이터(이벤트 간격을 나타낸다.)로 구성된다. In the column of data 4, image sequence data is described. This image sequence data is composed of image data in a predetermined format (e.g., Joint Photograph Experts Group (JPEG) standard, etc.) and sequence data for image display. The image data is designated by a predetermined image number. As shown in Fig. 3, the image display sequence data is composed of event data (i.e., picture number, display time, display type, etc.) and duration data (shows event intervals).

또한, 각 사전은 상기한 바와 같이 다수의 사전 항목에 의해 구성되지만, 이것과 더불어, 각 사전을 링크처로 하는 다른 사전 링크가 있는 문장으로부터 상기 사전의 소정의 사전 항목에 액세스처를 점프할 수 있도록 각 사전 항목에 소정의 링크 어드레스를 부여해도 된다. In addition, each dictionary is constituted by a plurality of dictionary items as described above, but in addition, it is possible to jump the access destination to a predetermined dictionary item of the dictionary from a sentence having a different dictionary link to which each dictionary is linked. A predetermined link address may be given to each dictionary item.

또한, 검색 결과 정보를 휴대 전화기(1)에 회신할 시, 상기 검색 결과 정보에 포함되는 재생용 데이터를 휴대 전화기(1) 측에서 재생시키므로, 소정의 데이터 교환 포맷을 갖게 한다. 이것은 본 출원인에 의해 이미 공개되어 있는 SMAF 사양서 Ver.3.06, 야마하 주식회사, [평성 14년 10월 18일 검색], 인터넷<URL:http://smaf.yamaha.co.jp>에 의한 포맷을 음성(즉, 인간의 목소리)의 발음에 대응할 수 있도록 확장한 것이다. Further, when the search result information is returned to the mobile phone 1, the reproduction data included in the search result information is reproduced on the mobile phone 1 side, so that the mobile phone 1 has a predetermined data exchange format. This is a format that SMAF specification Ver.3.06, Yamaha Co., Ltd., searched on October 18, 2014, and Internet <URL: http: //smaf.yamaha.co.jp> already disclosed by the present applicant. It is expanded to respond to the pronunciation of a human voice.

이 SMAF(Synthetic Music Mobile Application Format)는 휴대 단말 등에서 멀티미디어 콘텐츠를 표현하기 위한 데이터 포맷 사양이다. This Synthetic Music Mobile Application Format (SMAF) is a data format specification for representing multimedia contents in portable terminals.

여기서, SMAF에 관해서 도 4를 참조하여 설명한다. Here, SMAF will be described with reference to FIG. 4.

도 4에서, 부호 100은 SMAF 파일을 도시하고, 청크라고 불리는 데이터의 무리가 기본 구조로 되어있다. 청크는 고정 길이(8바이트)의 헤더부와 임의 길이의 바디부로 이루어지고, 헤더부는 또한 4바이트의 청크 ID와 4바이트의 청크 사이즈로 나누어진다. 청크 ID는 청크의 식별자로서 이용하고, 청크 사이즈는 바디부의 길이를 나타내고 있다. SMAF 파일(100)은 그 자체 및 이것에 포함되는 각종 데이터도 모두 청크 구조로 되어 있다. In Fig. 4, reference numeral 100 denotes an SMAF file, and a group of data called chunks has a basic structure. The chunk consists of a fixed length (8 byte) header portion and an arbitrary length body portion, and the header portion is further divided into a 4-byte chunk ID and a 4-byte chunk size. The chunk ID is used as an identifier of the chunk, and the chunk size indicates the length of the body portion. The SMAF file 100 itself has a chunk structure as well as the various data contained therein.

도 4에 도시하는 바와 같이, SMAF 파일(100)은 관리용의 정보가 기술되는 콘텐츠·인포·청크(Contents Info Chunk)(101)와 출력 디바이스에 대한 시퀀스 데이터를 포함하는 하나 이상의 트랙 청크(102∼108)로 구성된다. 시퀀스 데이터는 출력 디바이스에 대한 제어 내용을 시간을 따라 정의·표현한 데이터이다. 하나의 SMAF 파일(100)에 포함되는 모든 시퀀스 데이터는 시각(O)에서 동시에 재생을 개시하는 것으로서 정의되어 있고, 결과적으로 모든 시퀀스 데이터가 동기하여 재생된다. As shown in FIG. 4, the SMAF file 100 includes one or more track chunks 102 including contents information chunks 101 in which information for management is described and sequence data for an output device. 108). Sequence data is data which defined and expressed the control content with respect to an output device with time. All sequence data included in one SMAF file 100 are defined as starting playback at the same time at time O, and as a result, all the sequence data are reproduced in synchronization.

시퀀스 데이터는 이벤트와 듀레이션의 조합으로 표현된다. 이벤트는 시퀀스 데이터에 대응하는 출력 디바이스에 대한 제어 내용을 표시하는 데이터이고, 듀레 이션 데이터는 이벤트와 이벤트간의 경과 시간을 표시하는 데이터이다. 이벤트의 처리 시간은 실제로는 0이 아니지만, SMAF에서는 실질적으로 0으로 보며, 시간의 흐름은 모두 듀레이션으로 표시하도록 하고 있다. 어떤 이벤트를 실행하는 시각은 그 시퀀스 데이터의 선두로부터의 듀레이션을 적산함으로써 일의(一意)적으로 결정할 수 있다. 이벤트의 처리 시간은 다음 이벤트의 처리 개시 시간에 영향을 미치지 않는 것을 원칙으로 한다. 따라서, 값이 0의 듀레이션을 개재하여 연속한 이벤트는 동시에 실행하는 것으로서 해석된다. Sequence data is represented by a combination of events and durations. An event is data indicating control contents of an output device corresponding to sequence data, and duration data is data indicating an elapsed time between the event and the event. Although the processing time of the event is not actually 0, it is substantially 0 in SMAF, and the passage of time is expressed in duration. The time at which an event is executed can be uniquely determined by integrating the duration from the beginning of the sequence data. In principle, the processing time of an event does not affect the processing start time of the next event. Thus, successive events with a duration of 0 are interpreted as being executed simultaneously.

SMAF에는, 출력 디바이스로서 MIDI 상당의 제어 데이터로 발음을 행하는 FM 음원 디바이스나, PCM 데이터의 재생을 행하는 PCM 음원 디바이스나, 텍스트나 화상의 표시를 행하는 LCD(Liquid Crystal Display) 등의 표시 디바이스 등이 정의되어 있다. SMAF includes an FM sound source device that pronounces with control data equivalent to MIDI, a PCM sound source device that plays back PCM data, and a display device such as an LCD (Liquid Crystal Display) that displays text or an image as an output device. It is defined.

상기의 트랙 청크로서 정의되고 있는 각 출력 디바이스에 대응하여, 스코어 트랙 청크(102∼105), PCM 오디오 트랙 청크(106), 그래픽스 트랙 청크(107) 및 마스터 트랙 청크(108)가 설치되어 있다. 여기서, 마스터 트랙 청크(108)를 제외하고, 스코어 트랙 청크(102∼105), PCM 오디오 트랙 청크(106) 및 그래픽스 트랙 청크(107)에 관해서는, 각각 최대 256 트랙까지 기술하는 것이 가능하게 되어 있다. Corresponding to each output device defined as the track chunk described above, score track chunks 102 to 105, PCM audio track chunks 106, graphics track chunk 107, and master track chunk 108 are provided. Here, except for the master track chunk 108, the score track chunks 102 to 105, the PCM audio track chunk 106, and the graphics track chunk 107 can be described up to 256 tracks, respectively. have.

도 4의 예에서는, 스코어 트랙 청크(102∼105)는, FM 음원 디바이스(음원(111))에서 재생 처리를 실행시키기 위한 시퀀스 데이터를 기억하고, PCM 트랙 청크(106)는 PCM 음원 디바이스(PCM 디코더(112))에서 발음되는 ADPCM(Adaptive Differential Pulse-Code Modulation)이나 MP3(MPEG Audio Layer3), TwinVQ 등의 wave 데이터를 이벤트 형식으로 기억하고, 그래픽스 트랙 청크(107)는 배경화나 삽입 정지화 등의 화상 데이터나 텍스트 데이터 및 이들을 표시 디바이스(LDC 디스플레이(113))에 재생시키기 위한 시퀀스 데이터를 기억하고 있다. 또한, 마스터 트랙 청크(108)는 SMAF 시퀀서 자신을 제어하기 위한 시퀀스 데이터를 기억하고 있다. In the example of FIG. 4, the score track chunks 102 to 105 store sequence data for executing playback processing in the FM sound source device (sound source 111), and the PCM track chunk 106 stores a PCM sound source device (PCM). The decoder 112 stores the wave data, such as Adaptive Differential Pulse-Code Modulation (ADPCM), MPEG Audio Layer 3 (MP3), TwinVQ, etc., which are pronounced by the decoder 112, in an event format. Image data, text data, and sequence data for reproducing them on the display device (LDC display 113) are stored. The master track chunk 108 also stores sequence data for controlling the SMAF sequencer itself.

SMAF는 상기와 같은 포맷을 갖고, MIDI 상당의 데이터(악곡 데이터), PCM 오디오 데이터, 텍스트나 화상의 표시용 데이터 등의 각종 시퀀스 데이터를 포함하고 있고, 이들 모든 시퀀스를 동기 재생할 수 있다. 그러나, 인간의 목소리를 표현하는 것에 관해서는 특별히 정의되어 있지 않으므로, 본 실시예에서는, 종래의 SMAF를 이하와 같이 기능 확장한다. SMAF has the above-described format and contains various sequence data such as MIDI equivalent data (music data), PCM audio data, text and image display data, and all these sequences can be synchronously reproduced. However, since expressing a human voice is not particularly defined, in the present embodiment, the conventional SMAF is expanded as follows.

즉, 도 5에 도시하는 바와 같이, SMAF 파일(100)을 확장하고, 소정의 음원을 이용하여 음성(인간의 목소리)을 재생하기 위한 음성 재생 시퀀스 데이터를 기억하는 HV(Human Voice) 트랙 청크(h4)를 더 구비한다. 이 음성 재생 시퀀스 데이터는 포르만트 파라미터를 나타내는 발음 데이터에 기초하는 음성의 재생을 지시하는 음성 재생 이벤트와, 그 음성 재생 이벤트를 실행하는 타이밍에 선행하는 음성 재생 이벤트로부터의 경과 시간을 지정하는 듀레이션 데이터를 세트로 하여, 이들이 시간 순으로 배치되어 있다. That is, as shown in FIG. 5, the HV (Human Voice) track chunk (which expands the SMAF file 100 and stores voice reproduction sequence data for reproducing voice (human voice) using a predetermined sound source) h4) is further provided. This audio reproduction sequence data is a duration for specifying an audio reproduction event for instructing the reproduction of the audio based on the pronunciation data indicating the formant parameter and an elapsed time from the audio reproduction event preceding the timing at which the audio reproduction event is executed. Using data as a set, these are arranged in chronological order.

상기와 같은 확장에 의해, 종래의 SMAF 파일(100)을 이용한 재생과 마찬가지로, 각 시퀀스 데이터의 재생을 동시에 개시시킴으로써, 각 데이터를 같은 시간축 상에서 동기하여 재생할 수 있다. By the above-described extension, similarly to the reproduction using the conventional SMAF file 100, by simultaneously starting the reproduction of each sequence data, each data can be reproduced synchronously on the same time axis.

또한, 상기의 음성 재생 이벤트로서는, 이하의 3종류의 재생 지시 정보 중 어느 하나를 사용해도 된다. As the audio reproduction event, any of the following three kinds of reproduction instruction information may be used.

(1) 합성되는 음성의 읽는 방법을 나타내는 문자열과 음성 표현(발음의 억양 등)을 지정하는 운율 기호로 이루어지는 텍스트 기술형의 정보. (1) Text descriptive information consisting of a character string indicating how to read synthesized speech and a rhyme symbol specifying a speech expression (such as an accent).

(2) 합성되는 음성을 나타내는 음소 정보와 운율 제어 정보로 이루어지는 음소 기술형의 정보. (2) Phoneme description type information consisting of phoneme information indicating synthesized speech and rhyme control information.

(3) 재생되는 음성을 나타내는 프레임 시간마다의 포르만트 파라미터로 이루어지는 포르만트 프레임 기술형의 정보. (3) Formant frame description type information composed of formant parameters for each frame time indicating the reproduced sound.

텍스트 기술형(「TSeq형」이라고 한다.)은 발음해야 할 음성을 텍스트 표기에 의해 기술하는 포맷이고, 각 언어의 문자 코드에 의한 문자열과 액센트 등의 음성 표현을 지시하는 기호(운율 기호)를 포함한다. 휴대 전화기(1) 측의 재생 시, 도 6(a)에 도시하는 바와 같이 미들 웨어(middle ware) 처리에 의해, 이 TSeq형의 시퀀스 데이터를 (제1 컨버트 처리에 의해) PSeq형으로 변환하고, 다음에, (제2 컨버트 처리에 의해) PSeq형을 FSeq형으로 변환하여 음성 합성 디바이스로 출력하게 된다. Text description type (referred to as TSeq type) is a format for describing a voice to be pronounced by text notation, and a symbol (rhyme symbol) indicating a voice expression such as a character string and an accent by a character code of each language. Include. During playback on the mobile phone 1 side, as shown in Fig. 6A, the middle ware process converts the sequence data of the TSeq type into the PSeq type (by the first converting process). Next, the PSeq type is converted to the FSeq type (by the second converting process) and output to the speech synthesis device.

TSeq형으로부터 PSeq형으로 변환하는 제1 컨버트 처리는 언어에 의존하는 정보인 문자열(예컨대, 일본어의 히라가나나 카타카나 등의 텍스트)과 운율 기호와, 이것에 대응하는 언어에 의존하지 않은 발음을 나타내는 정보(음소)와 운율을 제어하기 위한 운율 제어 정보를 기억한 제1 사전을 참조함으로써 실행된다. 한편, PSeq형에서 FSeq형으로 변환하는 제2 컨버트 처리는, 각 음소와 이것에 대응하는 포르만트 파라미터(각 포르만트를 생성하기 위한 포르만트 주파수, 대역폭, 레벨 등의 파라미터)를 기억한 제2 사전을 참조함으로써 실행되고, 그 변환 결과인 포르만트 파라미터는 운율 제어 정보에 기초하여 변경된다. The first converting process from the TSeq type to the PSeq type includes a character string (for example, Japanese hiragana and katakana texts), a rhyme symbol, and information representing a pronunciation not dependent on the language corresponding to the language. It is executed by referring to a first dictionary that stores (phoneme) and rhyme control information for controlling rhyme. On the other hand, the second converting process for converting from PSeq type to FSeq type stores each phoneme and corresponding formant parameters (parameters for forming each formant, such as formant frequency, bandwidth, and level). It is executed by referring to one second dictionary, and the formant parameter as a result of the conversion is changed based on the rhyme control information.

음소 기술형(PSeq형)은 SMF(Standard MIDI File)로 정의하는 MIDI 이벤트와 유사한 형식에 의해 발음해야 할 음성에 관한 정보를 기술하는 것이고, 이 음성 기술로서는 언어 의존에 의하지 않은 음소 단위를 베이스로 한다. 도 6(b)에 도시하는 바와 같이, 배송 서버(2)에서, 사전 데이터베이스(2b)에 기억된 사전으로부터 검색된 TSeq형의 발음 데이터를 제1 컨버트 처리에 의해 PSeq형으로 변환한다. 이 PSeq형의 발음 데이터를 휴대 전화기(1)로 재생할 시에는, 그 미들 웨어 처리로서 실행되는 제2 컨버트 처리에 의해 PSeq형의 데이터 파일을 FSeq형으로 변환하여 음성 합성 디바이스에 출력한다. Phoneme description type (PSeq type) describes information about a voice to be pronounced in a format similar to the MIDI event defined by SMF (Standard MIDI File). This phoneme description is based on phoneme units that are not language dependent. do. As shown in Fig. 6B, the delivery server 2 converts the TSeq type pronunciation data retrieved from the dictionary stored in the dictionary database 2b into the PSeq type by the first converting process. When the PSeq type pronunciation data is reproduced by the cellular phone 1, the PSeq type data file is converted into the FSeq type and output to the speech synthesis device by the second converting process executed as the middleware process.

포르만트 프레임 기술형(FSeq형)은 포르만트 파라미터를 프레임 데이터 열로서 표현한 포맷이다. 도 6(c)에 도시하는 바와 같이, 배송 서버(2)에서, TSeq형→제1 컨버트 처리→PSeq형→제2 컨버트 처리→FSeq형으로 일련의 변환 처리를 실행한다. 또한, 샘플링된 파형 데이터에 대하여, 통상의 음성 분석 처리와 마찬가지의 처리인 제3 컨버트 처리를 실행하고, FSeq형의 데이터를 작성할 수도 있게 된다. 휴대 전화기(1)에서의 재생 시에는, 주어진 FSeq형의 파일을 그대로 음성 합성 디바이스로 출력하여 재생할 수 있다. The formant frame description type (FSeq type) is a format in which a formant parameter is expressed as a frame data string. As shown in Fig. 6C, the delivery server 2 executes a series of conversion processing from TSeq type to first conversion process to PSeq type to second conversion process to FSeq type. Further, the third converted process, which is the same process as the normal speech analysis process, may be performed on the sampled waveform data, thereby creating FSeq-type data. At the time of reproduction on the cellular phone 1, a file of a given FSeq type can be output to a speech synthesis device as it is and reproduced.

다음에, HV 트랙 청크(h4)의 내용에 관해서 도 7을 참조하여 상세하게 설명한다. Next, the contents of the HV track chunk h4 will be described in detail with reference to FIG.

도 7에 도시하는 바와 같이, 각 HV 트랙 청크(h4)에는, 이 청크에 포함되어 있는 음성 재생 시퀀스 데이터가 상술한 3가지의 포맷 중의 어떤 타입인지를 나타내는 포맷 타입(Format Type), 사용되고 있는 언어 종별을 나타내는 언어 타입(Language Type) 및 타임베이스(Timebase)를 각각 지정하는 데이터가 기술되어 있다. As shown in Fig. 7, each HV track chunk h4 has a format type indicating the type of audio reproduction sequence data included in this chunk in the three formats described above, and the language being used. Data specifying language types and timebases each indicating a type is described.

도 23은 언어 타입의 일례를 도시하고 있다. 여기서는, 일본어(0x00;0x는 16진수를 나타낸다.)와 한국어(0x02)만을 나타내고 있지만, 중국어나 영어 등 그 밖의 언어에 관해서도 마찬가지로 정의할 수 있다. 23 shows an example of a language type. Although only Japanese (0x00; 0x represents hexadecimal) and Korean (0x02) are shown here, other languages such as Chinese and English can also be defined.

타임베이스는 이 트랙 청크에 포함되는 시퀀스 데이터 청크 내의 듀레이션 및 게이트 타임의 기준 시간을 정하는 것이다. 도 24는 이 타임 베이스의 일례를 도시하고 있고, 여기서, 시간값(예컨대, 20msec)는 적절하게 변경할 수 있는 것이다. The timebase defines the reference time of the duration and gate time in the sequence data chunks included in this track chunk. 24 shows an example of this time base, where the time value (for example, 20 msec) can be changed as appropriate.

다음에, 상기의 3가지의 포맷 타입의 데이터의 상세에 대해 설명한다. Next, the details of the data of the above three format types will be described.

(a) TSeq형(포맷 타입: 0x00)(a) TSeq type (Format type: 0x00)

상술한 바와 같이, 이 포맷 타입은, 텍스트 표기에 의한 시퀀스 표현(TSeq:Text Sequence)을 이용한 포맷이고, 시퀀스 데이터 청크(h5)와 n개(n은 1 이상의 정수)의 TSeq 데이터 청크(TSeq#00∼TSeq#n)(h6, h7, h8)를 포함하고 있다(도 7참조). 시퀀스 데이터에 포함되는 음성 재생 이벤트(노트 온 이벤트(note-on event))에 의해 TSeq 데이터 청크에 포함되는 데이터의 재생을 지시하고 있다. As described above, this format type is a format using a sequence representation (TSeq: Text Sequence) by text notation, and includes sequence data chunks h5 and n (where n is an integer of 1 or more) TSeq data chunks (TSeq #). 00 to TSeq # n) (h6, h7, h8) are included (see Fig. 7). The audio reproduction event (note-on event) included in the sequence data is instructed to reproduce the data included in the TSeq data chunk.

(a-1) 시퀀스 데이터 청크 (a-1) Sequence data chunk

시퀀스 데이터 청크(h5)는 SMAF에서의 시퀀스 데이터 청크와 마찬가지로, 듀레이션과 이벤트와의 조합을 시간 순으로 배치한 시퀀스 데이터를 포함한다. 도 8a는 시퀀스 데이터의 구성을 도시하고 있고, 여기서, 듀레이션은 이벤트와 이벤트간의 시간을 나타내고 있다. 선두의 듀레이션(Duration 1)은 시각(0)으로부터의 경과 시간을 나타낸다. 도 8b는 이벤트가 노트 메시지인 경우, 듀레이션과 노트 메시지에 포함되는 게이트 타임의 관계를 도시하고 있고, 여기서, 게이트 타임은 그 노트 메시지의 발음 시간을 나타내고 있다. 또한, 도 8a 및 도 8b에 도시하는 시퀀스 데이터 청크의 구조는 PSeq형 및 FSeq형에서의 시퀀스 데이터 청크에서도 마찬가지이다. The sequence data chunk h5 includes sequence data in which the combination of duration and event is arranged in chronological order, similar to the sequence data chunk in SMAF. 8A shows the configuration of sequence data, where duration represents the time between events. The leading duration (Duration 1) represents the elapsed time from the time (0). 8B illustrates a relationship between a duration and a gate time included in the note message when the event is a note message, where the gate time indicates a pronunciation time of the note message. The structure of the sequence data chunks shown in Figs. 8A and 8B is also the same for the sequence data chunks of the PSeq type and the FSeq type.

이 시퀀스 데이터 청크에 의해 서포트되는 이벤트로서는, 이하의 3가지의 이벤트가 있다. 또한, 이하에 기술하는 초기값은 이벤트 지정이 없을 때의 디폴트값이다. The events supported by this sequence data chunk include the following three events. In addition, the initial value described below is a default value when there is no event designation.

(a-1-1) 노트 메시지 「0x9n kk gt」(a-1-1) Note message `` 0x9n kk gt ''

여기서, 「n」은 채널 번호(0x0[고정]), 「kk」는 TSeq 데이터 번호(OxO0∼0x7F), 「gt」는 게이트 타임(1∼3바이트)을 나타낸다. Here, "n" represents a channel number (0x0 [fixed]), "kk" represents a TSeq data number (OxO0 to 0x7F), and "gt" represents a gate time (1 to 3 bytes).

노트 메시지는 채널 번호(n)로 지정되는 채널에서 TSeq 데이터 번호(kk)로 지정되는 TSeq 데이터 청크를 해석하여 발음을 개시하는 메시지이다. 또한, 게이트 타임(gt)이 「O」인 노트 메시지에 관해서는 발음을 행하지 않는다. The note message is a message for interpreting the TSeq data chunk designated by the TSeq data number kk in the channel designated by the channel number n to start pronunciation. Note that the note message whose gate time is " O " is not pronounced.

(a-1-2) 볼륨「0xBn 0x07 vv」(a-1-2) Volume `` 0xBn 0x07 vv ''

여기서, 「n」은 채널 번호(OxO[고정]), 「vv」는 컨트롤치(OxO0∼Ox7F)를 나타낸다. 또한, 채널 볼륨의 초기값은 「Ox64」이다. Here, "n" represents a channel number (OxO [fixed]), and "vv" represents a control value (OxO0 to Ox7F). In addition, the initial value of a channel volume is "Ox64."

또한, 볼륨은 지정 채널의 음량을 지정하는 메시지이다. In addition, the volume is a message for designating the volume of the designated channel.

(a-1-3) 팬(팬포트(panport)) 「0xBn 0x0A vv」(a-1-3) Fan (panport) `` 0xBn 0x0A vv ''

여기서, 「n」은 채널 번호(OxO[고정]), 「vv」는 컨트롤값(0xO0∼0x7F)를 나타낸다. 또한, 팬포트의 초기값은 「0x40(센터)」이다. Here, "n" represents a channel number (OxO [fixed]), and "vv" represents a control value (0xO0 to 0x7F). In addition, the initial value of a fan pot is "0x40 (center)".

또한, 팬 메시지는, 스테레오의 사운드 시스템을 갖는 이용 장치에 대하여 지정 채널의 스테레오 음장 위치를 지정하는 메시지이다. In addition, the pan message is a message for designating a stereo sound field position of a designated channel with respect to a using device having a stereo sound system.

(a-2) TSeq 데이터 청크(TSeq#00∼TSeq#n) (a-2) TSeq data chunks (TSeq # 00 to TSeq # n)

TSeq 데이터 청크(h6, h7, h8 등)는 음성 합성용의 정보로서, 언어나 문자 코드에 관한 정보, 발음하는 음(억양 등)의 설정 및(음성 합성에 의한) 읽는 정보를 포함한 챗(chat)용 포맷이고, HV-Script에 의해 기술되어 있다. TSeq data chunks (h6, h7, h8, etc.) are information for speech synthesis, which includes information about language or character code, setting of pronunciation sounds (according tones), and information to be read (by speech synthesis). Format, described by HV-Script.

(b)PSeq형(포맷 타입: OxO1) (b) PSeq type (Format type: OxO1)

PSeq형은 MIDI 이벤트에 유사하는 형식의 음소에 의한 시퀀스 표현(PSeq:Phoneme Sequence)을 이용한 포맷 타입이다. 이 형식은 음소를 기술하도록 하고 있으므로 언어 의존이 아니다. 또한, 음소는 발음을 나타내는 문자 정보에 의해 표현할 수 있으므로, 예컨대 다수의 언어간에 공통의 아스키(ASCII) 코드를 이용할 수 있다. The PSeq type is a format type that uses phoneme sequence representation (PSeq: Phoneme Sequence) in a format similar to MIDI events. This format lets you describe phonemes, so it's not language dependent. In addition, since the phoneme can be represented by character information indicating pronunciation, it is possible to use, for example, an ASCII code that is common among many languages.

도 7에 도시하는 바와 같이, PSeq형은 셋업 데이터 청크(h9), 딕셔너리 데이터 청크(h10) 및 시퀀스 데이터 청크(h11)를 포함하고 있고, 시퀀스 데이터 중의 음성 재생 이벤트(노트 메시지)로 지정된 채널의 음소와 운율 제어 정보의 재생을 지시한다. As shown in Fig. 7, the PSeq type includes a setup data chunk h9, a dictionary data chunk h10, and a sequence data chunk h11, and includes a channel designated as a voice reproduction event (note message) in the sequence data. Instructs reproduction of phoneme and rhyme control information.

(b-1) 셋업 데이터 청크(Setup data Chunk)(옵션)(b-1) Setup data chunk (optional)

이것은 음원 부분의 음색 데이터를 기억하는 청크이고, 익스클러시브(exclusive)·메시지의 나열을 기억한다. 본 실시예에서는 익스클러시브·메시지로서 HV 음색 파라미터 등록 메시지를 기억하고 있다. This is a chunk that stores the timbre data of the sound source portion, and stores an array of exclusive messages. In this embodiment, the HV timbre parameter registration message is stored as an exclusive message.

HV 음색 파라미터 등록 메시지는 예컨대, 「0xF0 Size 0x43 0x79 0xO7 0x7F 0x01 PCdata...0xF7」이라는 포맷이고, 여기서, 「PC」는 프로그램 번호(0x02∼0x0F)를 나타내고, 「data」는 HV 음색 파라미터를 나타낸다. 이 메시지에서는 해당하는 프로그램 번호「PC」의 HV 음색 파라미터를 등록한다. The HV timbre parameter registration message is, for example, in the format "0xF0 Size 0x43 0x79 0xO7 0x7F 0x01 PCdata ... 0xF7", where "PC" represents a program number (0x02 to 0x0F), and "data" represents an HV timbre parameter. Indicates. In this message, the HV timbre parameters of the corresponding program number "PC" are registered.

도 25은 HV 음색 파라미터의 일례를 도시한다. 25 shows an example of the HV timbre parameters.

도 25에 도시하는 바와 같이, HV 음색 파라미터에는 피치 시프트량, 제1∼제n(n은 2이상의 정수)의 포르만트에 대한 포르만트 주파수 시프트량, 포르만트 레벨 시프트량 및 오퍼레이터 파형 선택 정보가 포함된다. 이용 장치인 휴대 전화기(1)에는, 각 음소와 이것에 대응하는 포르만트 파라미터(즉, 포르만트 주파수, 대역폭, 레벨 등)를 기술한 프리셋 사전(상기의 「제2 사전」)이 기억되어 있고, HV 음색 파라미터는 이 프리셋 사전에 기억되어 있는 파라미터에 대한 시프트량을 규정하고 있다. 이에 따라, 모든 음소에 관하여 같은 시프트가 행해지고, 따라서 합성되는 음성의 성질(聲質)을 변화시킬 수 있다. As shown in Fig. 25, the HV timbre parameters include pitch shift amounts, formant frequency shift amounts, formant level shift amounts, and operator waveforms for formants from first to nth (n is an integer of 2 or more). Selection information is included. The mobile phone 1, which is a user device, stores a preset dictionary (above "second dictionary") that describes each phoneme and its corresponding formant parameters (i.e., formant frequency, bandwidth, level, and the like). The HV timbre parameter defines the shift amount relative to the parameter stored in this preset dictionary. Thereby, the same shift is performed with respect to all the phonemes, so that the nature of the synthesized voice can be changed.

또한, 이 HV 음색 파라미터에 의해, 0x02∼0x0F에 대응하는 수(즉, 프로그램 번호의 수값)의 음색을 등록할 수 있다. In addition, with this HV timbre parameter, a timbre of numbers corresponding to 0x02 to 0x0F (that is, the numerical value of the program number) can be registered.

(b-2) 딕셔너리 데이터 청크(Dictionary data Chunk)(옵션)(b-2) Dictionary data chunk (optional)

이 청크에는 언어 종별에 따른 사전 데이터, 예컨대 상기 프리셋 사전과 비교한 차분 데이터나 프리셋 사전에서 정의하지 않은 음소 데이터 등을 포함하는 사전 데이터를 기억한다. 이에 따라, 음색이 상이한 개성이 있는 음성을 합성하는 것이 가능하게 된다. The chunk stores dictionary data including language data, for example, difference data compared with the preset dictionary, phoneme data not defined in the preset dictionary, and the like. This makes it possible to synthesize voices with different personalities.

(b-3) 시퀀스 데이터 청크(Sequence Data Chunk)(b-3) Sequence Data Chunk

상술의 시퀀스 데이터 청크와 마찬가지로, 듀레이션과 이벤트의 조합을 시간 순으로 배치한 시퀀스 데이터를 포함한다. Similar to the sequence data chunk described above, the sequence data includes sequence data in which a combination of durations and events are arranged in chronological order.

이 PSeq형에서의 시퀀스 데이터 청크(h11)로 서포트하는 이벤트(또는 메시지)를 이하에 열거한다. 읽어들이는 측에서는, 이들의 메시지 이외를 무시한다. 또한, 이하에 기재하는 소기 설정값은 이벤트 지정이 없을 때의 디폴트값이다. The events (or messages) supported by the sequence data chunk h11 of this PSeq type are listed below. The reader ignores only these messages. In addition, the desired setting value described below is a default value when there is no event designation.

(b-3-1) 노트 메시지 「0x9n Nt Vel Gatetime Size Data...」(b-3-1) Note message `` 0x9n Nt Vel Gatetime Size Data ... ''

여기서, 「n」은 채널 번호(0x0[고정]), 「Nt」는 노트 번호(절대치 노트 지정:0xO0∼0x7F, 상대값 노트 지정:0x80∼0xFF), 「Vel」은 빌로시티(velocity)(0x00∼0x7F), 「Gatetime」은 게이트 타임 길이(Variable), 「Size」는 데이터부의 사이즈(가변 길이)를 나타낸다. Here, "n" is channel number (0x0 [fixed]), "Nt" is note number (absolute value note designation: 0xO0 to 0x7F, relative value note designation: 0x80 to 0xFF), and "Vel" is velocity (velocity) ( 0x00 to 0x7F), "Gatetime" represents a gate time length (Variable), and "Size" represents the size (variable length) of the data portion.

이 노트 메시지에 의해, 지정 채널의 음성의 발음이 개시된다. 또한, 노트 번호의 MSB(Most Significant Bit)는 그 해석을 「절대값」와 「상대값」으로 전환하는 플래그이다. 이 MSB 이외의 7비트에 의해 노트 번호를 나타낸다. 또한, 음 성의 발음은 모노럴만으로 되어 있으므로, 게이트 타임이 중복되는 경우에는 후착 우선으로서 발음 처리가 행해진다. This note message starts the pronunciation of the audio of the designated channel. Note that the MSB (Most Significant Bit) of the note number is a flag for switching the interpretation between "absolute value" and "relative value". Note bits are represented by seven bits other than this MSB. In addition, since the pronunciation of the sound is monaural only, when the gate time overlaps, the pronunciation process is performed as a post-deposition priority.

또한, 데이터부는 음소와 이것에 대응하는 운율 제어 정보(피치 밴드, 볼륨)를 포함하고, 도 26에 도시하는 데이터 구조로 된다. In addition, the data portion contains phonemes and rhyme control information (pitch band and volume) corresponding thereto, and has a data structure shown in FIG.

도 26에 도시하는 바와 같이, 데이터부는 음소의 수(n)(#1), 예컨대 아스키코드로 기술한 개개의 음소(음소 1∼음소 n)(#2∼#4) 및 운율 제어 정보로 구성된다. 운율 제어 정보에는 피치 밴드와 볼륨이 있다. 피치 밴드에 관하여, 그 발음 구간을 음소 피치 밴드 수(#5)에 의해 규정되는 N개의 구간으로 나누고, 각 구간에서의 피치 밴드를 지정하는 피치 밴드 정보로서 음소 피치 밴드 위치와 음소 피치 밴드의 세트(즉, 음소 피치 밴드 위치(1) 및 음소 피치 밴드(1)(#6, #7)∼음소 피치 밴드 위치(N) 및 음소 피치 밴드(N)(#9, #10))가 기술된다. 또한, 볼륨에 관하여 그 발음 구간을 음소 볼륨수(#11)에 의해 규정되는 M개의 구간으로 구분하고, 각 구간에서의 볼륨을 지정하는 볼륨 정보로서 음소 볼륨 위치와 음소 볼륨의 세트(즉, 음소 볼륨 위치(1) 및 음소 볼륨(1)(#12, #13)∼음소 볼륨 위치(M) 및 음소 볼륨 (M)(#15, #16))가 기술된다. As shown in Fig. 26, the data portion is composed of the number n of phonemes (# 1), for example, individual phonemes (phonemes 1 to phoneme n) (# 2 to # 4) and rhyme control information described in ASCII codes. do. Rhyme control information includes pitch band and volume. Regarding the pitch band, the phoneme pitch band is divided into N sections defined by the number of phoneme pitch bands (# 5), and a set of phoneme pitch band positions and phoneme pitch bands as pitch band information specifying a pitch band in each section. (I.e., phoneme pitch band position 1 and phoneme pitch band 1 (# 6, # 7) to phoneme pitch band position N and phoneme pitch band N (# 9, # 10)) are described. . Further, the phoneme section is divided into M sections defined by the phoneme volume number (# 11) with respect to the volume, and a set of phoneme volume positions and phoneme volumes (i.e., phonemes) as volume information for designating the volume in each section. The volume position 1 and the phoneme volume 1 (# 12, # 13) to the phoneme volume position M and the phoneme volume M (# 15, # 16) are described.

도 9는 운율 제어 정보에 관해서 설명하기 위한 도면이다. 여기서는, 발음하는 문자 정보가「おはよう」(“ohayou”)인 경우를 예로 들어 나타내고 있다. 또한, 이 예에서는 N=M=128로 설정하고 있다. 도 9에서는 문자 정보(ohayou)의 발음 구간을 128(=N=M)개의 구간으로 구분하고, 각 구간에서의 피치와 볼륨을 상술의 피치 밴드 정보 및 볼륨 정보로 표현하여 운율을 제어하도록 하고 있다. 9 is a diagram for explaining rhythm control information. Here, the case where the character information to pronounce is "おはよう" ("ohayou") is shown as an example. In this example, N = M = 128 is set. In FIG. 9, the pronunciation section of the character information (ohayou) is divided into 128 (= N = M) sections, and the pitch and volume in each section are expressed by the above pitch band information and volume information to control the rhyme. .

도 10은 게이트 타임 길이(Gatetime)와 딜레이 타임(Delay Time(#0))과의 관계를 나타내고 있다. 도 10에 도시하는 바와 같이, 실제의 발음을 듀레이션으로 규정되는 타이밍보다 딜레이 타임만큼 지연시킬 수 있다. 또한, 본 실시예에서는 Gatetime=0을 금지한다. FIG. 10 illustrates a relationship between a gate time length (Gatetime) and a delay time (# 0). As shown in Fig. 10, the actual pronunciation can be delayed by a delay time rather than the timing defined by the duration. In addition, in this embodiment, Gatetime = 0 is prohibited.

(b-3-2) 프로그램 체인지「0xCn pp」(b-3-2) Program Change `` 0xCn pp ''

여기서, 「n」은 채널 번호(OxO[고정]), 「pp」는 프로그램 번호(0x 00∼0xFF)를 나타낸다. 또한, 프로그램 번호의 초기값은 0x00으로 설정되어 있다. Here, "n" represents a channel number (OxO [fixed]), and "pp" represents a program number (0x 00 to 0xFF). In addition, the initial value of a program number is set to 0x00.

이 프로그램 체인지 메시지에 의해 지정된 채널의 음색이 설정된다. 여기서 채널 번호의 설정값으로서, 「0xO0」(남성 프리셋 음색), 「0x01」(여성 프리셋 음색) 및 「0x02」∼「0x0F」(확장 음색)가 있다. The timbre of the channel specified by this program change message is set. Here, the set values of the channel numbers include "0xO0" (male preset timbre), "0x01" (female preset timbre), and "0x02" to "0x0F" (extended timbre).

(b-3-3) 컨트롤 체인지 (b-3-3) Control Change

본 실시예는 이하의 컨트롤 체인지 메시지를 사용하고 있다. This embodiment uses the following control change messages.

(b-3-3-1) 채널 볼륨「0xBn 0x07 vv」(b-3-3-1) Channel volume `` 0xBn 0x07 vv ''

여기서 「n」은 채널 번호(OxOO[고정]), 「vv」는 컨트롤값(OxO0∼Ox7F)를 나타낸다. 또한, 채널 볼륨의 초기값은 0x64로 설정되어 있다. Here, "n" represents a channel number (OxOO [fixed]), and "vv" represents a control value (OxO0 to Ox7F). In addition, the initial value of the channel volume is set to 0x64.

이 채널 볼륨 메시지는 소정 채널의 음량을 지정하는 것이고, 채널간의 음량 밸런스를 설정하는 것을 목적으로 한다. This channel volume message designates the volume of a predetermined channel, and aims to set the volume balance between the channels.

(b-3-3-2) 팬(팬포트)「0xBn 0x0A vv」(b-3-3-2) Fan (fan port) `` 0xBn 0x0A vv ''

여기서, 「n」은 채널 번호(OxOO[고정]), 「vv」는 컨트롤값(0x00∼0x7F)를 나타낸다. 팬포트의 초기값은 0x40(센터)으로 설정되어 있다. Here, "n" represents a channel number (OxOO [fixed]), and "vv" represents a control value (0x00 to 0x7F). The initial value of the fan port is set to 0x40 (center).

이 메시지는 스테레오의 사운드 시스템을 갖는 이용 장치에 대하여 소정 채널의 스테레오 음장 위치를 지정하는 것이다. This message specifies the location of the stereo sound field of a given channel for a using device having a stereo sound system.

(b-3-3-3) 익스프레션「0xBn 0x0B vv」(b-3-3-3) Expression `` 0xBn 0x0B vv ''

여기서, 「n」은 채널 번호(OxOO[고정]), 「vv」는 컨트롤값(0x00∼0x7F)를 나타낸다. 이 익스프레션 메시지의 초기값은 0x7F(최대값)로 설정되어 있다. Here, "n" represents a channel number (OxOO [fixed]), and "vv" represents a control value (0x00 to 0x7F). The initial value of this expression message is set to 0x7F (maximum value).

이 메시지는 소정 채널에서 채널 볼륨으로 지정한 음량의 변화를 지시한다. 이것은, 예컨대 악곡 중의 음량 등을 변화시킬 목적으로 사용된다. This message indicates a change in the volume specified by the channel volume in a given channel. This is used for the purpose of changing, for example, the volume in a piece of music.

(b-3-3-4) 피치 밴드「0xEn ll mm」(b-3-3-4) Pitch band `` 0xEn ll mm ''

여기서, 「n」은 채널 번호(0x00[고정]), 「ll」은 밴드값(LSB)(0x00∼0x7F), 「mm」은 밴드값(MSB)(0x00∼0x7F)를 나타낸다. 피치 밴드의 초기값은 MSB(또는 상위 바이트)가 0x40, LSB(또는 하위 바이트)가 0x00으로 설정되어 있다. Here, "n" represents a channel number (0x00 [fixed]), "ll" represents a band value LSB (0x00 to 0x7F), and "mm" represents a band value MSB (0x00 to 0x7F). The initial value of the pitch band is set to 0x40 for the MSB (or upper byte) and 0x00 for the LSB (or lower byte).

이 메시지는, 소정 채널의 피치를 상하(즉, 주파수의 고저 방향)로 변화시킨다. 변화 폭(즉, 피치 밴드 레인지)의 초기값은 ±2반음이고, 상기의 밴드값의 조합이 OxO0/0xO0의 경우에 하 방향의 피치 밴드가 최대가 되고, 0x7F/0x7F의 경우에 상 방향의 피치 밴드가 최대가 된다. This message changes the pitch of a given channel up and down (i.e., up and down in frequency). The initial value of the change width (i.e., pitch band range) is ± 2 semitones, and when the combination of the above band values is OxO0 / 0xO0, the downward pitch band is maximum, and in the case of 0x7F / 0x7F, The pitch band is maximum.

(b-3-3-5) 피치 밴드·센서티비티(sensitivity)「0x8n bb」(b-3-3-5) Pitch band sensitivity (0x8n bb)

여기서 「n」은 채널 번호(OxOO[고정]), 「bb」는 데이터값(Ox00∼0x18)을 나타낸다. 이 피치 밴드·센서티비티의 초기값은 0x02로 설정되어 있다. Here, "n" represents a channel number (OxOO [fixed]), and "bb" represents a data value (Ox00 to 0x18). The initial value of this pitch band sensory is set to 0x02.

이 메시지는, 소정 채널의 피치 밴드의 감도 설정을 행하는 것이고, 그 단위 는 반음이다. 예컨대, bb=O1일 때에는 ±1반음(변화 범위는 합계 2반음)이 된다. This message is to set the sensitivity of the pitch band of a predetermined channel, and the unit is semitone. For example, when bb = O1, it is ± 1 semitone (the change range is 2 semitones in total).

이와 같이, PSeq형의 포맷 타입은 발음을 나타내는 문자 정보로 표현한 음소 단위를 베이스로서, MIDI 이벤트에 유사한 형식으로 음성 정보를 기술하는 것이고, 그 데이터 사이즈는 TSeq형보다는 크지만 FSeq형보다는 작다. In this way, the PSeq type format is based on a phoneme unit expressed by character information indicating pronunciation, and describes voice information in a format similar to a MIDI event. The data size is larger than that of TSeq but smaller than that of FSeq.

이에 따라, MIDI 규격과 마찬가지로, 시간축 상에서 세밀하게 피치나 볼륨을 컨트롤할 수 있고, 또, 음소 베이스로 기술하고 있으므로 언어 의존성이 없으며, 음색(성질)을 세밀하게 편집할 수 있게 된다. 즉, 본 실시예에서는 MIDI 규격에 유사한 음성 제어를 행할 수 있으므로, 종래의 MIDI 기기에 대하여 추가적으로 실장하기 쉬운 장점을 갖고 있다. As a result, similar to the MIDI standard, pitch and volume can be finely controlled on the time axis, and since it is described as a phoneme base, there is no language dependency, and the tone (quality) can be edited finely. In other words, in the present embodiment, voice control similar to the MIDI standard can be performed, which has the advantage of being easily mounted on a conventional MIDI device.

(c)포르만트 프레임 기술(FSeq)형(포맷 타입: 0x02)(c) Formant frame technology (FSeq) type (format type: 0x02)

이것은, 포르만트 파라미터(즉, 각 포르만트를 생성하기 위한 포르만트 주파수가 게인 등의 파라미터)를 프레임 데이터 열로서 표현한 포맷이다. 즉, 일정 시간(프레임) 안에서 발음하는 음성의 포르만트는 일정하다고 하여, 각 프레임마다 발음하는 음성에 대응하는 포르만트 파라미터(포르만트 주파수나 게인 등)를 갱신하는 시퀀스 표현(FSeq:Formant Sequence)를 이용한다. 이에 따라, 시퀀스 데이터에 포함되는 노트 메시지에 의해 지정된 FSeq 데이터 청크의 데이터 재생을 지시한다. This is a format in which formant parameters (that is, parameters such as formant frequency gain for generating each formant) are expressed as a frame data string. That is, since the formant of the voice to be pronounced within a certain time (frame) is constant, a sequence expression (FSeq: Formant) for updating the formant parameter (formant frequency or gain, etc.) corresponding to the spoken voice for each frame. Sequence). This instructs data reproduction of the FSeq data chunk specified by the note message included in the sequence data.

이 포맷 타입은, 시퀀스 데이터 청크(h12)와 n개(n은 1 이상의 정수)의 FSeq 데이터 청크(FSeq#00∼FSeq#n:h13, h14, h15 등)를 포함하고 있다. This format type includes a sequence data chunk h12 and n FSeq data chunks (FSeq # 00 to FSeq # n: h13, h14, h15, etc.).

(c-1) 시퀀스 데이터 청크 (c-1) Sequence data chunk

FSeq 데이터 청크는 FSeq 프레임 데이터 열로 구성한다. 즉, 음성 정보를 소정 시간 길이(예컨대, 20msec)를 갖는 프레임마다 잘라내어, 각 프레임 기간 내의 음성 데이터를 분석하여 획득한 포르만트 파라미터(포르만트 주파수나 게인 등)를 각 프레임의 음성 데이터를 나타내는 프레임 데이터 열로서 표현한 포맷이다. The FSeq data chunks consist of FSeq frame data columns. That is, the speech information is cut out for each frame having a predetermined length of time (for example, 20 msec), and the formant parameters (formal frequency, gain, etc.) obtained by analyzing the speech data within each frame period are used for the speech data of each frame. A format expressed as a frame data string to indicate.

도 27에서, #0∼#3은 음성 합성에 이용하는 다수 개(본 실시예에서는 n개)의 포르만트 파형의 종류(즉, 사인파, 사각파 등)를 지정하는 데이터를 나타낸다. #4∼#11은 포르만트 레벨(진폭)(#4∼#7)과 중심 주파수(#8∼#11)의 조합에 의해 n개의 포르만트를 규정하는 파라미터를 나타낸다. 즉, #4와 #8은 제1 포르만트 파형(#0)을 규정하는 파라미터를 나타내고, #5와 #9는 제2 포르만트 파형(#1)을 규정하는 파라미터를 나타낸다. 이하 마찬가지로, #7과 #11은 제n 포르만트 파형(#3)을 규정하는 파라미터를 나타낸다. 또한, #12는 무성/유성의 전환을 나타내는 플래그이다. In Fig. 27, # 0 to # 3 represent data for designating a number of formant waveforms (i.e., sine wave, square wave, and the like) used for speech synthesis (n in this embodiment). # 4 to # 11 represent parameters defining n formants by a combination of formant levels (amplitudes) (# 4 to # 7) and center frequencies (# 8 to # 11). That is, # 4 and # 8 represent parameters defining the first formant waveform # 0, and # 5 and # 9 represent parameters defining the second formant waveform # 1. Likewise below, # 7 and # 11 represent parameters defining the nth formant waveform # 3. In addition, # 12 is a flag indicating unvoiced / voiced transition.

도 11은 각 포르만트 파형의 레벨과 중심 주파수를 도시한다. 본 실시예에서는, 제1∼제n 포르만트까지의 n개의 포르만트 데이터를 이용하고 있다. 각 프레임마다의 제1∼제n 포르만트에 관한 파라미터와 피치 주파수에 관한 파라미터는, 휴대 전화기(1)에 구비되는 음성 합성 디바이스에 공급되고, 각 클레임에 관해서의 음성 합성 출력이 상술한 바와 같이 생성 출력된다. 11 shows the level and center frequency of each formant waveform. In this embodiment, n formant data from the first to the nth formant are used. The parameters relating to the first to n-th formants for each frame and the parameters relating to the pitch frequency are supplied to the speech synthesis device provided in the cellular phone 1, and the speech synthesis output for each claim is as described above. Will be generated as output.

도 12는 FSeq 데이터 청크의 바디부의 데이터를 도시한다. 도 27에 도시한 FSeq형의 프레임 데이터 열 중, #0∼#3은 각 포르만트 파형의 종류를 지정하는 데 이터이고, 따라서, 각 프레임마다 지정할 필요는 없다. 또한, 도 12에 도시하는 바와 같이, 최초의 프레임에 관해서는 도 27에 도시한 모든 데이터를 설정하고, 후속하는 프레임에 관해서는 도 27에서의 #4 이후의 데이터만을 설정한다. FSeq 데이터 청크의 바디부를 도 12에 도시하는 바와 같이함으로써, 총 데이터 수를 적게 할 수 있다. 12 shows data of the body portion of the FSeq data chunk. Of the FSeq type frame data columns shown in FIG. 27, # 0 to # 3 are data for specifying the types of formant waveforms, and therefore, it is not necessary to specify each frame. As shown in FIG. 12, all data shown in FIG. 27 is set for the first frame, and only data after # 4 in FIG. 27 is set for the subsequent frame. As shown in Fig. 12, the body portion of the FSeq data chunk can reduce the total number of data.

이와 같이, FSeq형은 포르만트 파라미터(포르만트 주파수나 게인 등)를 프레임 데이터 열로서 표현한 포맷이므로, FSeq형의 파일을 그대로 음성 합성 디바이스에 출력함으로써 음성을 재생할 수 있다. 따라서, 처리 측에서는 TSeq형이나 PSeq형과 같이 변환 처리를 행할 필요가 없고, CPU(중앙 처리 장치)는 소정 시간마다 프레임의 갱신 처리를 행하는 것만으로 된다. 또한, 이미 기억되어 있는 발음 데이터에 대해서는, 일정한 오프셋을 부여하는 것으로서 그 음색(성질)을 변경할 수 있다. As described above, the FSeq type is a format in which formant parameters (formant frequency, gain, etc.) are expressed as a frame data string, so that the audio can be reproduced by outputting the FSeq type file as it is to the speech synthesis device. Therefore, the processing side does not need to perform the conversion processing like the TSeq type or the PSeq type, and the CPU (central processing unit) only needs to perform frame update processing every predetermined time. In addition, the tone data (quality) can be changed by giving a fixed offset to the pronunciation data already stored.

이상과 같이 작성된 어느 한 타입의 파일이 휴대 전화기(1)에 송신된다. 이에 따라, 시퀀스 데이터에 포함되는 듀레이션에 의해 규정되는 타이밍으로 음성 합성 디바이스에 제어 파라미터를 공급하는 발음용 시퀀서와, 상기 발음용 시퀀서로부터 공급되는 제어 파라미터에 기초한 음성을 재생 출력하는 음성 합성 디바이스를 갖는 이용 장치인 휴대 전화기(1)에서, 다른 정보(의미 정보, 관련되는 음이나 화상 등의 정보.)와 함께 음성이 동기 재생된다. Any type of file created as described above is transmitted to the cellular phone 1. Accordingly, the apparatus includes a pronunciation sequencer for supplying control parameters to the speech synthesis device at a timing defined by the duration included in the sequence data, and a speech synthesis device for reproducing and outputting speech based on the control parameters supplied from the pronunciation sequencer. In the mobile phone 1, which is the user device, the audio is synchronously reproduced together with other information (meaning information, information such as a related sound or image).

다음에, 본 실시예에 관하여, 상술한 바와 같이 HV-Script에 의한 텍스트 기술형의 음성 재생 시퀀스 데이터를 이용하는 것으로서 이하에 설명한다. Next, the present embodiment will be described below by using audio description sequence data of text description type by HV-Script as described above.

우선, HV-Script에 의한 발음 데이터(단, 발음용 시퀀스 데이터를 제외.)에 관해서 상세하게 설명한다. First, the pronunciation data (except the pronunciation sequence data) by HV-Script will be described in detail.

예컨대, HV-Script에 의한 발음 데이터의 일례인 「か＿3さがほ^5し＿4い’ 4ね$2－」는 「かさがほしいね－」라고 하는 문장에 소정의 인토네이션을 부가하여 음성 합성시키기 위한 HV-Script에 의한 기술이다. 이 예에 기술된 기호「’」,「^」,「＿」,「$」 등은 문자(가나 문자)에 부가하는 인토네이션의 종별을 나타내는 운율 기호이고, 이 운율 기호의 후속 문자(운율 기호의 직후에 수값이 있는 경우에는, 그 수값에 계속되는 문자)에 대하여 소정의 액센트를 부가하는 것이다. For example, an example of pronunciation data by HV-Script, "Ka3 sa＿k ほ 554 い '4 ね $ 2-", adds a predetermined innation to a sentence called "Kar'a がほしいね-" and synthesizes speech. It is a technique by HV-Script. The symbols "'", "^", "＿", "$", and the like described in this example are rhyme symbols indicating the type of intonation to be added to the character (Kana character), and subsequent characters of the rhyme symbol (the If there is a numerical value immediately after, a predetermined accent is added to the character following the numerical value.

도 13a는 HV-Script에 의한 각 기호(대표예)의 의미를 도시한다. 13A shows the meaning of each symbol (typical example) by HV-Script.

즉, 기호「’」는 어두에서 피치를 올리다(도 13b의 ①참조), 기호「^」는 발음 중의 피치를 올리다(도 13c의 ③참조), 기호「＿」는 어두에서 피치를 내리다(도 13b의 ②참조), 기호「$」는 발음 중의 피치를 내리다(도 13c의 ④참조)와 같이 음성 합성하는 것을 의미한다. That is, the symbol "'" raises the pitch from the dark (see ① in FIG. 13B), the symbol "^" raises the pitch during pronunciation (see ③ in FIG. 13C), and the symbol "＿" lowers the pitch from the dark (FIG. (2) in 13b), and the symbol "$" means synthesizing the speech as in lowering the pitch during pronunciation (see (4) in FIG. 13C).

또한, 상기에 기호의 직후에 수값이 부가되는 경우, 그 수값은 부가하는 액센트의 변화량을 지정하는 것이다. 예컨대, 「か＿3さが」라는 프레이즈(phrase)에서는, 최초의 문자「か」는 표준의 피치로 발음하지만, 다음「さ」의 어두에서 피치를 3의 양만큼 내리고, 다음의 「が」에서는 그 내린 피치로 발음하는 것을 나타낸다. In addition, when a numerical value is added immediately after the symbol, the numerical value designates the change amount of the accent to add. For example, in the phrase "か＿3 さが", the first letter "か" is pronounced at the standard pitch, but the pitch is lowered by an amount of 3 in the next "さ", and in the next "が" It means to pronounce at the lowered pitch.

이와 같이, HV-Script에서는 발음시키는 말에 포함되는 문자에 액센트(인토 네이션)를 부가하는 경우, 그 문자의 직전에 운율 기호(또한, 인토네이션의 변화량을 나타내는 수값)를 부가하여 기술하는 것과 같은 구문으로 되어있다. 또한, 상기의 설명은 피치를 제어하는 기호에 관해서만 기재하였지만, 이 밖에 음의 강약, 속도, 음질 등을 제어하는 기호를 이용할 수도 있다. As described above, when HV-Script adds an accent to a character included in a pronounced word, the syntax is the same as that described by adding a rhyme symbol (a numerical value indicating the amount of variation in the tonation) immediately before the character. It is. In addition, although the above description has been described only for the symbols for controlling the pitch, the symbols for controlling the sound intensity, speed, sound quality, and the like can also be used.

예컨대, 상술의 “It’s very fine, isn’t it?”이라는 영문의 번역문인 「とてもいい天氣ですね」에 대응하는 발음 데이터인 「とっ’ても, S54’い/いて$ん＿き/です＿ね－2*－」에 포함되는 「S54」가 되는 기호의「S」는 속도를 변화시키는 제어 문자 중의 하나이고, 그 제어 문자의 다음부터 발화 속도를 바꾸는 기능을 한다. 이 제어 문자 「S」에 이어지는 「54」는 속도를 나타내는 수값이고, 초기값이 50이므로, 그 초기값보다도 4의 양만 속도를 올리는 것을 지시하고 있다. 이 제어 문자 S에 의해 일단 발화 속도를 변화시키면, 다음에 속도 변경할 때까지 같은 속도가 유지된다. For example, the pronunciation data "とっ 'ても, S54' い / いて $ ん＿き / です", which corresponds to the pronunciation data "とてもいい天氣ですね", the English translation of "It's very fine, isn't it?" "S" in the symbol to be "S54" contained in the word "2-2" is one of the control characters for changing the speed, and functions to change the speech speed after the control characters. "54" following this control character "S" is a numerical value indicating a speed, and since the initial value is 50, it instructs to increase the speed only by the amount of 4 rather than the initial value. Once the speech speed is changed by this control letter S, the same speed is maintained until the next speed change.

또한, 「/」는 액센트 시프트 클리어 기호(즉, 변화시킨 액센트를 되돌리기 위한 기호)중의 하나이고, 운율 기호로 변화시킨 피치를 0으로 되돌린다. 또한, 운율 기호로 변화시킨 피치나 음량은 「,」나 「。」 등의 문절 구분을 나타내는 기호(이하, 「문절 구분 기호」라고 한다.)가 발생할 때까지 그 피치나 음량이 지속하도록 제어된다. 또한, 「*」은 그 다음 문자의 후반에서 피치와 음량을 내리는 것을 지시하는 기호이고, 「－」는 직전의 음절을 신장하여 발음하는 것을 지시하는 기호이다. In addition, "/" is one of the accent shift clear symbols (that is, the symbol for returning the changed accent), and returns the pitch changed to the rhyme symbol to zero. In addition, the pitch or volume changed by the rhyme symbol is controlled so that the pitch or volume continues until a symbol indicating a sentence division such as "," or "。" (hereinafter referred to as a "phrase separator") occurs. . "*" Is a symbol for lowering the pitch and volume in the second half of the next character, and "-" is a symbol for instructing to extend and pronounce the previous syllable.

또한, 「ね－2*－」가 되는 프레이즈에서의「2」는 그 직전의 장음 부호「－」에 관한 것이며, 이 장음을 2배로 하는 것을 지시한다. 즉, 「ね－2*－」는「ね」의 발음 기간을 합계 3의 양만큼 신장하고, 「*」에 의해 최후에 피치와 음량을(1의 양만큼)내리는 것을 지시하고 있다. 또한, 기호「’」,「$」에 관해서는 상기하는 바와 같다. In addition, "2" in the phrase "ner-2 *-" relates to the immediately preceding long sign "-" and instructs to double this long sound. In other words, "ner-2 *-" extends the pronunciation period of "ner" by a total of three amounts, and instructs "*" to finally lower the pitch and volume (by the amount of one). The symbols "'" and "$" are as described above.

이상과 같이, 포르만트 파라미터를 나타내는 데이터 중의 하나인 HV-Script에 의한 발음 데이터는 비교적 적은 정보량으로, 보다 자연스럽게 음성을 발음시킬 수 있으므로, 번역 결과를 발음시키는 것과 같은 용도 등에 적절하다. 또한, 여기서 설명한 HV-Script는 일본어의 음성 합성에 적절한 것이며, 타 언어의 음성 합성에 관해서는 상술의 PSeq형이나 FSeq형을 이용하면 된다. As described above, the pronunciation data by HV-Script, which is one of the data indicating the formant parameter, can be pronounced more naturally with a relatively small amount of information, and therefore, it is suitable for applications such as pronounced translation results. In addition, the HV-Script described here is suitable for speech synthesis in Japanese, and the above-described PSeq type or FSeq type may be used for speech synthesis in other languages.

다음에, 상술의 포르만트 및 포르만트 파라미터에 관해서 상세하게 설명한다. Next, the above formant and formant parameters will be described in detail.

포르만트는 도 14에 도시하는 형태를 하고 있고, 포르만트 주파수, 포르만트 레벨 및 포르만트 대역폭이라는 각종 파라미터(즉, 포르만트 파라미터)에 의해 특정된다. 따라서, 인간의 목소리에 포함되는 포르만트의 수, 각 포르만트의 주파수, 진폭, 대역폭 등은 목소리의 성질을 결정하는 중요한 요소이고, 목소리를 발성하는 인간의 성격, 체격, 연령 등에 의해 크게 상이하다. The formant has the form shown in FIG. 14 and is specified by various parameters (namely, formant parameters) such as formant frequency, formant level, and formant bandwidth. Therefore, the number of formants included in the human voice, the frequency, amplitude, and bandwidth of each formant are important factors that determine the characteristics of the voice, and are greatly influenced by the characteristics, physique, age, etc. of the human voice. Different.

그러나, 가령 누가 말을 하여도, 「あ」라는 말은 「あ」로 발음되고, 「い」라는 말은 「い」로 발음되므로, 동일한 말이라면 동일하게 들린다. 즉, 인간의 목소리로서는 발음하는 단어의 종류마다 특징적인 포르만트의 조합이 결정되어 있기 때문이다. 포르만트를 그 종류에 따라 크게 나누면, 유성음을 합성하기 위한 피치 정보를 갖는 유성 포르만트와, 무성음을 합성하기 위한 피치 정보를 갖지 않은 무성 포르만트로 분류된다. However, even if someone speaks, the word "あ" is pronounced "あ" and the word "い" is pronounced "い", so the same word sounds the same. That is, as the human voice, a characteristic formant combination is determined for each type of word to be pronounced. If the formants are largely divided according to their type, they are classified into voiced formants having pitch information for synthesizing voiced sounds and unvoiced formants having no pitch information for synthesizing unvoiced sounds.

여기서, 유성음이란, 발성할 시에 성대가 진동하는 목소리를 나타내고, 예컨대, 모음이나 반모음, 또한, 일본어의 「バ행」,「ガ행」,「マ행」,「ラ행」 등에서 사용되는 유성 자음이 포함된다. 또한, 무성음이란, 발성할 시에 성대가 진동하지 않은 목소리를 나타내고, 예컨대, 일본어의「ハ행」,「カ행」,「サ행」등에서 사용되는 자음이 포함된다. 하나의 お음소는 도 11에 나타내는 바와 같이, 다수의 포르만트로 구성된다. Here, the voiced sound refers to a voice in which the vocal cords vibrate when uttered, for example, a vowel or a half vowel, and a voiced voice used in Japanese "ba line", "ga line", "ma line", "ra line", and the like. Consonants are included. In addition, the unvoiced voice refers to a voice in which the vocal cords do not vibrate when uttered, and includes, for example, consonants used in Japanese "ha", "ka", "sa", and the like. One phoneme is composed of a plurality of formants, as shown in FIG.

따라서, 어떤 특정인의 발음의 음소마다의 포르만트를 휴대 전화기(1)에 미리 등록시켜 놓고, 각 포르만트에 대하여 상기의 포르만트 파라미터(즉, 포르만트 주파수, 포르만트 레벨, 포르만트 대역폭) 및 포르만트를 형성하는 기본 파형을 텍스트 기술형인 HV-Script에 의한 운율 기호에 따라서 음성 합성하거나, 혹은, 상술의 음소 기술형에서의 운율 제어 정보에 따라서 변경하여 음성 합성함으로써, 다양한 인토네이션을 갖은 음성을 발음할 수 있게 된다. Therefore, a formant for each phoneme of a particular person's pronunciation is registered in advance in the mobile phone 1, and the formant parameters (i.e., formant frequency, formant level, The formant bandwidth) and the basic waveform forming the formant by speech synthesis according to the rhyme symbols of the text description type HV-Script, or by changing the speech control information in the phoneme description type described above. In addition, voices with various intonations can be pronounced.

또한, 상기의 배송 서버(2)에서 메모리 및 CPU(중앙 처리 장치) 등에 의해 구성되는 제어부(2a)가 번역 처리를 실행하는 경우, 번역 대상의 텍스트의 번역 및 그 번역 결과 정보의 회신을 위한 처리 순서로 이루어지는 프로그램을 메모리에 로드하여 실행함으로써 그 기능이 실현된다. 또한, 사전 검색 처리를 실행하는 경우에는 주어진 표제어를 검색키로 한 상기 사전의 검색 및 그 검색 결과 정보의 회신을 위한 처리 순서로 이루어지는 프로그램을 메모리에 로드하여 실행함으로써 그 기능이 실현된다. Further, when the control unit 2a constituted by the memory and the CPU (central processing unit) and the like in the delivery server 2 performs the translation process, a process for translating the text to be translated and returning the translation result information thereof The function is realized by loading and executing a program in sequence into a memory. Further, in the case of performing the dictionary search process, the function is realized by loading and executing a program consisting of a process procedure for searching the dictionary with the given headword as a search key and returning the search result information to the memory.

또한, 배송 서버(2)에는 주변 기기로서 입력 장치, 표시 장치 등(모두 도시 생략.)이 접속된다. 여기서, 입력 장치란, 키보드, 마우스 등의 입력 디바이스를 의미하며, 표시 장치는 CRT(Cathode Ray Tube)나 액정 표시 장치 등을 의미한다. In addition, an input device, a display device, and the like (all of which are not shown) are connected to the delivery server 2 as peripheral devices. Here, the input device means an input device such as a keyboard or a mouse, and the display device means a cathode ray tube (CRT) or a liquid crystal display device.

또한, 사전 데이터베이스(2b)는 하드 디스크, 광 자기 디스크 등의 불휘발성의 기억 장치에 의해 구성되어 있고, 이것을 배송 서버(2)의 내부에 설치해도 되고, 혹은 배송 서버(2)로부터 엑세스 가능한 외부 또는 다른 서버 내에 설치해도 된다. In addition, the dictionary database 2b is comprised by nonvolatile storage devices, such as a hard disk and a magneto-optical disk, and this may be installed in the delivery server 2, or the external which can be accessed from the delivery server 2; Or you may install in another server.

다음에, 본 실시예에 의한 휴대 전화기(1)의 구성 및 동작에 관해서 상세하게 설명한다. Next, the configuration and operation of the cellular phone 1 according to the present embodiment will be described in detail.

도 15는 휴대 전화기(1)의 개략 구성을 도시하는 블록도이다. FIG. 15 is a block diagram showing the schematic configuration of the mobile telephone 1. FIG.

또한, 본 발명은 휴대전화기(Cellular Phone 등)에 한하지 않고, PHS(등록 상표)(Personal Handyphone System)나 무선 통신 가능한 휴대 정보 단말(PDA:Personal Digital Assistant) 등에도 적용할 수 있다. In addition, the present invention can be applied to not only a cellular phone (Cellular Phone, etc.) but also a PHS (Personal Handyphone System), a wireless digital terminal (PDA) capable of wireless communication, and the like.

도 15에서, 부호 11은 CPU(중앙 처리 장치)를 도시하고, 각종 프로그램을 실행하게 함으로써 휴대 전화기(1)의 각 부의 동작을 제어한다. In Fig. 15, reference numeral 11 denotes a CPU (central processing unit), and controls the operation of each unit of the mobile phone 1 by executing various programs.

부호 12는 통신부를 나타내고, 이 통신부(12)에 구비되는 안테나(12a)에서 수신된 신호의 복조(複調)를 행하는 동시에, 송신하는 신호를 변조하여 안테나(12a)에 공급한다. Reference numeral 12 denotes a communication unit, demodulates a signal received by the antenna 12a included in the communication unit 12, modulates a signal to be transmitted, and supplies it to the antenna 12a.

상기 CPU(11)는 통신부(12)에서 복조된 배송 서버(2)로부터의 신호를 소정의 프로토콜에 따라 복호화하는 것이고, HV-Script에 의한 텍스트 기술형의 음성 재생 시퀀스 데이터에 대해서는 상술의 제1 컨버트 처리 및 제2 컨버트 처리를 실행하여 포르만트 파라미터로 이루어지는 프레임 데이터 열을 생성한다. 또한, 통신부(12)는 수신한 파일 중의 데이터가 표시용 데이터인지 발음용 데이터인지에 따라서 표시용 시퀀서(21a) 또는 발음용 시퀀서(16a)에 그 신호를 공급한다. The CPU 11 decodes the signal from the delivery server 2 demodulated by the communication unit 12 according to a predetermined protocol, and the above-described first method for text reproduction type speech reproduction sequence data by HV-Script. The convert process and the second convert process are executed to generate a frame data string consisting of formant parameters. The communication unit 12 supplies the signal to the display sequencer 21a or the pronunciation sequencer 16a in accordance with whether the data in the received file is display data or pronunciation data.

부호 13은 음성 처리부를 나타낸다. 즉, 통신부(12)에서 복조된 전화 회선 경유의 음성 신호가 음성 처리부(13)에서 복호되고, 따라서, 대응하는 음성이 스피커(14)로 발음된다. 한편, 마이크(15)에 의해 채취된 음성 신호는 디지털화되고 음성 처리부(13)에서 압축 부호화된다. 그 후, 통신부(12)로 변조되어 안테나(12a)에서 휴대 전화망의 기지국(도시 생략)으로 송신된다. 음성 처리부(13)는, 예컨대 CELP(Code Excited Linear Predictive Coding) 방식이나 ADPCM(Adaptive Differential Pulse-Code Modulation) 방식에 의해 음성 데이터를 고능률 압축 부호화/복호화하고 있다. Reference numeral 13 denotes a voice processing unit. That is, the voice signal via the telephone line demodulated by the communication unit 12 is decoded by the voice processing unit 13, and thus the corresponding voice is pronounced by the speaker 14. On the other hand, the speech signal collected by the microphone 15 is digitized and compressed and encoded by the speech processor 13. Thereafter, it is modulated by the communication unit 12 and transmitted from the antenna 12a to a base station (not shown) of the cellular phone network. The speech processing unit 13 compresses / decodes speech data with high efficiency by, for example, Code Excited Linear Predictive Coding (CELP) or Adaptive Differential Pulse-Code Modulation (ADPCM).

부호 16a는 발음용 시퀀서를 나타내고, 소정의 음성 또는 악음을 소정의 타이밍으로 사운드 시스템에 발음시키는 것을 지시하는 발음 제어용 시퀀스 데이터를 수신하고, 따라서, 음성 합성 기능이 있는 음원(16b)을 제어한다. Reference numeral 16a indicates a sounding sequencer, receives sounding control sequence data which instructs the sound system to pronounce a predetermined sound or sound at a predetermined timing, and thus controls a sound source 16b having a speech synthesis function.

부호 16b는 음성 합성 기능이 있는 음원을 나타내고, 나타내지 않은 음성 합성 유닛과 FM 음원 디바이스 및/또는 PCM 음원 디바이스로 이루어진다. 이 음성 합성 기능이 있는 음원(16b)은 후술과 같은 음성 합성 처리를 실행하는 것 외에, 착신음으로서 선택된 악곡 데이터를 재생하여 스피커(17)에서 방음한다. 또한, 이 음성 합성 유닛의 구성의 상세에 관해서는 후술한다. 또한, FM 음원 디바이스는 WT(Wave Table) 음원, 고주파 합성 음원 및 구형파 음원이어도 되며, PCM 음원은 MP3 디코더 등이어도 된다. Reference numeral 16b denotes a sound source having a speech synthesis function, which is composed of an unshown speech synthesis unit, an FM sound source device, and / or a PCM sound source device. The sound source 16b having the speech synthesis function not only executes the speech synthesis processing described below, but also reproduces the music data selected as the ringing tone and sounds the sound from the speaker 17. In addition, the detail of the structure of this speech synthesis unit is mentioned later. The FM sound source device may be a WT (Wave Table) sound source, a high frequency synthesized sound source, or a square wave sound source, and the PCM sound source may be an MP3 decoder or the like.

부호 18은 조작부를 나타내고, 이것은 휴대 전화기(1)의 본체(housing)에 설치된 영숫자 버튼을 포함하는 각종 버튼(도시 생략)이나 그 외의 입력 디바이스로부터의 입력을 검지하는 입력 수단이다. Reference numeral 18 denotes an operation unit, which is an input means for detecting input from various buttons (not shown) or other input device including alphanumeric buttons provided on the housing of the cellular phone 1.

부호 19는 RAM(Random-Access Memory)을 나타내고, 여기에는, 상기 CPU(11)의 워크 에어리어, 다운로드 된 악곡 데이터나 반주 데이터(이들은 착신 멜로디의 재생 등에 이용한다)의 기억 에어리어, 수신한 전자 메일 등의 데이터가 기억되는 메일 데이터 기억 에어리어, 배송 서버(2)로부터 수신한 번역 결과 정보나 검색 결과 정보를 기억하는 에어리어 등이 설정되어 있다. Reference numeral 19 denotes a RAM (Random-Access Memory), which includes a work area of the CPU 11, a storage area of downloaded music data and accompaniment data (these are used for reproducing an incoming melody, etc.), received e-mail, and the like. The mail data storage area in which the data of the data is stored, the area for storing the translation result information received from the delivery server 2, the search result information, and the like are set.

부호 20은 ROM(Read-Only Memory)을 나타내고, 여기에는 CPU(11)가 실행하는 발신·착신 등의 제어를 위한 각종 전화 기능 프로그램 및 악곡 재생 처리를 보조하는 프로그램, 전자 메일 등의 송수신을 제어하는 메일 송수신 기능 프로그램, 음성 합성 처리를 보조하는 프로그램이 기억되는 동시에, 상술의 제1 사전 및 제2 사전의 콘텐츠 및 악곡 데이터 등의 각종 데이터가 기억된다. Reference numeral 20 denotes a read-only memory (ROM), which controls transmission and reception of various telephone function programs for controlling outgoing and incoming calls executed by the CPU 11, programs for assisting music reproduction processing, and e-mail. The mail transmission / reception function program and the program assisting the voice synthesis processing are stored, and various data such as contents and music data of the first dictionary and the second dictionary described above are stored.

부호 21a는 표시용 시퀀서를 나타내고, 소정의 화상 또는 텍스트를 소정의 타이밍으로 표시부(21b)에 표시시키는 것을 지시하는 표시 제어용 시퀀스 데이터를 수신하여 표시부(21b)를 제어한다. Reference numeral 21a indicates a display sequencer, and receives display control sequence data instructing display of a predetermined image or text on the display portion 21b at a predetermined timing to control the display portion 21b.

표시부(21b)는 액정 표시기(LCD:Liquid Crystal Display)에 의해 구성되고, CPU(11) 및 표시용 시퀀서(21a)의 제어 하에, 원하는 텍스트나 화상의 표시 및 조작부(18)의 조작에 따른 표시를 행한다. The display portion 21b is constituted by a liquid crystal display (LCD), and under the control of the CPU 11 and the display sequencer 21a, display of desired text or images and display according to the operation of the operation unit 18 Is done.

부호 22는 착신 시에 착신음 대신에 휴대 전화기(1)의 본체를 진동시킴으로써 착신을 사용자에게 통지하는 바이브레이터를 나타낸다. Reference numeral 22 denotes a vibrator that notifies the user of the incoming call by vibrating the main body of the cellular phone 1 at the time of incoming call.

또한, 상기의 각 기능 블록은 버스(30)를 통해 상호 접속되고, 따라서, 데이터나 명령의 송수신이 행해진다. In addition, the above-described functional blocks are interconnected via the bus 30, and thus data and commands are transmitted and received.

다음에, 음성 합성 기능이 있는 음원(16b)에 포함되는 음성 합성 유닛의 구성에 관해서 상세하게 설명한다. Next, the configuration of the speech synthesis unit included in the sound source 16b having the speech synthesis function will be described in detail.

도 16은 음성 합성 유닛의 개략 구성을 도시한다. 16 shows a schematic configuration of a speech synthesis unit.

도 16에 도시하는 음성 합성 유닛은 다수의 포르만트 생성부(40a~40m)와 하나의 피치 생성부(50)를 갖고 있다. 포르만트 생성부(40a~40m)는 발음용 시퀀서(16a)로부터 출력되는 포르만트 파라미터(각 포르만트를 생성하기 위한 포르만트 주파수, 포르만트 레벨 등) 및 피치 정보에 기초하여 포르만트 신호를 생성한다. 이들 포르만트 신호는 믹싱부(60)에서 합성되고, 따라서, 소정의 음소를 생성한다. 또한, 각 포르만트 생성부(40a~40m)는 포르만트 신호를 생성하기 위한 바탕이 되는 기본 파형을 발생하지만, 이 기본 파형의 발생에 관해서는, 예컨대, 주지의 FM 음원의 파형 발생기를 이용할 수 있다. 또한, 피치 생성부(50)는 소정의 연산에 의해 피치(음정)를 생성하는 기능을 갖고 있고, 발음하는 음소가 유성음의 경우에만 연산한 피치를 생성한 음소에 부가한다. The speech synthesis unit shown in FIG. 16 has a large number of formant generators 40a to 40m and one pitch generator 50. The formant generators 40a to 40m are configured based on formant parameters (formant frequency for forming each formant, formant level, etc.) and pitch information output from the pronunciation sequencer 16a. Generate a formant signal. These formant signals are synthesized in the mixing section 60, thus generating predetermined phonemes. In addition, although each formant generating unit 40a to 40m generates a basic waveform that is a basis for generating a formant signal, for example, a waveform generator of a known FM sound source can be used for generating the basic waveform. It is available. In addition, the pitch generator 50 has a function of generating pitch (pitch) by a predetermined operation, and adds the pitch calculated to the phoneme generated only when the phoneme to be pronounced is voiced sound.

다음에 도 17을 참조하여 상기의 포르만트 생성부(40a~40m)의 구성에 관해서 설명한다. Next, with reference to FIG. 17, the structure of said formant generating part 40a-40m is demonstrated.

도 17에 도시하는 바와 같이, 각 포르만트 생성부(40a~40m)는 파형 발생기(41), 노이즈 발생기(42), 가산기(43) 및 증폭기(44)로 구성된다. As shown in FIG. 17, each formant generator 40a to 40m includes a waveform generator 41, a noise generator 42, an adder 43, and an amplifier 44.

파형 발생기(41)는 각 음소의 포르만트마다 지정되는 포르만트 주파수, 포르만트 기본 파형(정현파, 삼각파 등) 및 파형의 위상에 따라, 하나의 음소를 구성하는 하나의 포르만트를 발생한다. 노이즈 발생기(42)는 파형 발생기(41)에서 발생된 포르만트가 유성음인지 무성음인지에 따라 동작하는 것이고, 무성음의 경우에는 소음을 발생하여 가산 공통 전극(43)에 공급한다. The waveform generator 41 generates one formant constituting one phoneme according to the formant frequency, formant fundamental waveform (sine wave, triangle wave, etc.) and waveform phase specified for each formant of each phoneme. Occurs. The noise generator 42 operates according to whether the formant generated by the waveform generator 41 is a voiced sound or an unvoiced sound. In the case of the unvoiced sound, noise is generated and supplied to the addition common electrode 43.

가산기(43)는 파형 발생기에서 생성된 포르만트에 대하여 노이즈 발생기(42)로부터 공급되는 노이즈를 가산한다. 이 가산기(43)의 출력은 증폭기(44)에 의해 소정의 포르만트 레벨에 증폭되어 출력된다. The adder 43 adds noise supplied from the noise generator 42 to the formant generated by the waveform generator. The output of this adder 43 is amplified by the amplifier 44 at a predetermined formant level and output.

각 포르만트 생성부(40a~40m)는 음소를 구성하는 하나의 포르만트에 관한 것이다. 또한, 하나의 음소에 관해서는 다수의 포르만트가 합성되어 형성된다. 따라서, 하나의 음소를 생성하기 위해서는 음소를 구성하는 다수의 포르만트를 생성하고, 이들을 합성해야 할 필요가 있다. 이에 따라, 도 16에 도시하는 바와 같이 다수의 포르만트 생성부(40a~40m)를 설치하고 있다. Each formant generating unit 40a to 40m relates to one formant constituting the phoneme. In addition, a plurality of formants are synthesized with respect to one phoneme. Therefore, in order to generate one phoneme, it is necessary to generate a large number of formants constituting the phoneme and to synthesize them. As a result, as shown in FIG. 16, a large number of formant generating units 40a to 40m are provided.

다음에, 상기하는 바와 같이 구성된 본 실시예에 관한 휴대 전화기(1)및 번역 서버(2)의 동작에 관해서 상세하게 설명한다. 또한, 여기서는 통상의 전화 기능에 의한 발신·착신 시의 동작 등, 주지의 동작에 관해서는 그 설명을 생략한다. Next, the operation of the mobile phone 1 and the translation server 2 according to the present embodiment configured as described above will be described in detail. In addition, here, the description is abbreviate | omitted about well-known operation | movement, such as the operation | movement at the time of origination and reception by a normal telephone function.

우선, 번역을 행하는 경우의 휴대 전화기(1) 및 번역 서버(2)의 동작에 관해 서 도 18 및 도 19에 도시하는 흐름도를 참조하여 설명한다. First, the operation of the mobile phone 1 and the translation server 2 in the case of performing translation is demonstrated with reference to the flowchart shown to FIG. 18 and FIG.

휴대 전화기(1)의 사용자가 번역하고 싶은 텍스트(예컨대, “It’s very fine, isn’t it ?”이라는 영문)를 입력하고, 또한, 번역 언어(여기서는 영어를 일본어로 번역하는 것으로 한다.)를 지정하여, 이들 정보를 포함하는 번역 요구를 송신한다(단계 S101). The user of the mobile phone 1 inputs texts to be translated (for example, "It's very fine, isn't it?"), And also translates a translation language (here, English to Japanese). By designating, a translation request including these information is transmitted (step S101).

배송 서버(2)는 휴대 전화기(1)로부터의 번역 요구를 수신할 때까지 단계 S201의 판정 단계를 반복 실행하여 대기 상태에 있지만, 휴대 전화기(1)로부터 번역 요구를 수신하면, 이 번역 요구에 포함되는 번역 대상의 텍스트를 사전 데이터베이스(2b)의 번역 사전을 이용하여 번역한다(단계 S202). The delivery server 2 repeatedly executes the determination step of step S201 until the translation request from the cellular phone 1 is received, and is in a waiting state. However, when the delivery server 2 receives the translation request from the cellular phone 1, The text to be translated is translated using the translation dictionary of the dictionary database 2b (step S202).

이 사이에, 휴대 전화기(1)는 번역 결과 데이터를 수신하기까지 단계 S101의 판정 단계를 반복 실행하여 대기 상태로 되어진다. In the meantime, the cellular phone 1 repeatedly executes the determination step of step S101 until the translation result data is received, and enters the standby state.

배송 서버(2)는 사전 데이터베이스(2b)의 번역 사전을 이용하여, 번역한 텍스트를 HV-Script에 의한 발음 데이터로 변환한다(단계 S203). 여기서는, 번역 후의 문서, 문절 또는 단어 단위에 대응하는 HV-Script에 의한 발음 데이터로 변환한다. The delivery server 2 converts the translated text into pronunciation data by HV-Script using the translation dictionary of the dictionary database 2b (step S203). Here, conversion is made to pronunciation data by HV-Script corresponding to the document, sentence, or word unit after translation.

그 후, 번역 결과의 텍스트와 그 발음 데이터를 포함하는 상술의 데이터 교환 포맷을 갖는 번역 결과 정보를 생성하여, 그 번역 결과 정보를 휴대 전화기(1)에 회신한다(단계 S204). Thereafter, the translation result information having the above-mentioned data exchange format including the text of the translation result and its pronunciation data is generated, and the translation result information is returned to the cellular phone 1 (step S204).

휴대 전화기(1)는, 배송 서버(2)로부터 번역 결과 정보를 수신하면, 단계 S102의 판정 결과는 「예」가 되고, 흐름은 단계 S103으로 이행하여 수신 데이터를 RAM(19)에 기억한다. When the cellular phone 1 receives the translation result information from the delivery server 2, the determination result in step S102 becomes "Yes", and the flow advances to step S103 to store the received data in the RAM 19.

그 후, 휴대 전화기(1)의 사용자가 소정 키의 조작을 행할 때까지, 단계 S104의 판정을 반복하여 대기 상태가 된다. After that, the determination of step S104 is repeated until the user of the cellular phone 1 performs the operation of the predetermined key, and the standby state is entered.

휴대 전화기(1)의 사용자가 소정 키를 조작하여 번역 결과를 재생하는 경우, 단계 S104의 판정 결과는「예」가 되고, 흐름은 단계 S105로 이행한다. When the user of the cellular phone 1 reproduces the translation result by operating a predetermined key, the determination result of step S104 becomes "Yes", and the flow advances to step S105.

CPU(11)는 단계 S105에서, 배송 서버(2)로부터 수신한 번역 결과 정보를 RAM(19)로부터 독출하고, 당외 번역 결과 정보에 포함되는 텍스트 데이터의 내용을 표시부(21b)에 표시시키는 동시에, 발음 데이터를 음성 합성 기능이 있는 음원(16b)에서 음성 합성시킨다. 상기의 번역 결과 정보의 재생이 완료할 때까지(즉, 단계 S106의 판정 결과가「예」가 될 때까지), 번역 결과의 텍스트 표시 및 그 발음 데이터에 의한 발음을 실행한다. In step S105, the CPU 11 reads the translation result information received from the delivery server 2 from the RAM 19, displays the contents of the text data included in the extra translation result information on the display unit 21b, The pronunciation data is speech synthesized in a sound source 16b having a speech synthesis function. Until the reproduction of the above translation result information is completed (that is, until the determination result of step S106 is YES), the text display of the translation result and pronunciation by the pronunciation data are executed.

이렇게 하여, 휴대 전화기(1) 및 배송 서버(2) 간에서 번역 작업이 행해진다. 다음에, 배송 서버(2)에서의 사전 검색 기능을 이용하는 경우의 휴대 전화기(1)와 배송 서버(2)의 동작에 관해서, 도 20 및 도 21에 도시하는 흐름도를 참조하여 설명한다. In this way, the translation work is performed between the mobile phone 1 and the delivery server 2. Next, operations of the cellular phone 1 and the delivery server 2 in the case of using the pre-search function in the delivery server 2 will be described with reference to flowcharts shown in FIGS. 20 and 21.

우선, 휴대 전화기(1)의 사용자가 검색하고 싶은 텍스트(예컨대, 영어 단어의 “Duck”)를 입력하여 사용하는 사전 종류(여기서는, 영일 사전으로 한다.)를 지정하여, 이들 정보를 포함하는 검색 요구를 송신한다(단계 S111). First, the user of the mobile phone 1 specifies a type of dictionary (eg, English-Japanese dictionary) to be used by inputting text (eg, "Duck" of an English word) to search, and search containing the information. The request is sent (step S111).

배송 서버(2)는 휴대 전화기(1)로부터의 검색 요구를 수신할 때까지 단계 S211의 판정을 반복하여 대기 상태에 있지만, 휴대 전화기(1)로부터 상기의 검색 요구를 수신하면, 이 검색 요구에 포함되는 표제어를 검색키로 하여, 사전 데이터베이스(2b)의 영일 사전을 이용하여 지정된 표제어에 관하여 검색을 실행한다(단계 S212). The delivery server 2 repeats the determination of step S211 until it receives the search request from the mobile phone 1, but is in the waiting state. When the delivery server 2 receives the search request from the mobile phone 1, With the included headword as the search key, a search is performed on the designated headword using the English-Japanese dictionary of the dictionary database 2b (step S212).

이 사이에, 휴대 전화기(1)는 검색 결과 데이터(즉, 검색 결과 정보)를 수신할 때까지, 단계 S112의 판정을 반복하여 대기 상태에 있다. In the meantime, the cellular phone 1 repeats the determination of step S112 until it receives the search result data (that is, the search result information), and is in the standby state.

배송 서버(2)는 검색 처리가 완료하면, 그 검색 결과의 의미 정보를 나타내는 텍스트(예컨대, “Duck”의 번역인 「오리」), 그 읽는 방법의 발음 데이터, 표제어(“Duck”)에 관련되는 음성으로서 오리의 울음소리의 음성 데이터 및 관련되는 정보로서 오리의 화상 데이터를 포함하는 상술의 데이터 교환 포맷을 갖는 검색 결과 정보를 휴대 전화기(1)에 회신한다(단계 S213). When the retrieval process is completed, the delivery server 2 is associated with text (for example, "duck" which is a translation of "Duck"), pronunciation data of the reading method, and heading ("Duck") indicating the semantic information of the search result. The search result information having the above-described data exchange format including the duck's crying data as the voice to be sounded and the duck's image data as the related information is returned to the cellular phone 1 (step S213).

휴대 전화기(1)는 배송 서버(2)로부터 상기의 검색 결과 정보를 수신하면, 단계 S112의 판정 결과가「예」가 되고, 흐름은 단계 S113으로 이행하여 상기 데이터를 RAM(19)에 기억한다. When the cellular phone 1 receives the above search result information from the delivery server 2, the determination result of step S112 becomes "Yes", and the flow advances to step S113 to store the data in the RAM 19. .

그 후, 휴대 전화기(1)의 사용자가 소정의 키 조작을 행할 때까지, 단계 S114의 판정을 반복하여 대기 상태에 있다. After that, the determination of step S114 is repeated until the user of the cellular phone 1 performs a predetermined key operation, and is in the standby state.

휴대 전화기(1)의 사용자가 소정 키를 조작하면, 단계 S114의 판정 결과가「예」가 되고, 흐름은 단계 S115로 이행한다. If the user of the cellular phone 1 operates the predetermined key, the determination result of step S114 becomes "Yes", and the flow advances to step S115.

CPU(11)은 단계 S115에서, 배송 서버(2)로부터 수신한 검색 결과 정보 중, 사용자에 의해 지정된 정보를 RAM(19)로부터 판독하여, 그 재생을 실행한다. 검색 결과 정보에 포함되고, 사용자에 의해 지정된 시퀀스 데이터는 대응하는 표시용 시 퀀서(21a), 발음용 시퀀서(16a)에 공급되고, 이들이 적절하게 제어되어 원하는 표시, 음성 출력이 이루어진다. 예컨대, 휴대 전화기(1)의 사용자가 영일 사전을 이용하여 영어 단어 “Duck”을 검색한 경우, 사용자가 검색 결과의 텍스트 표시를 지정하면, 텍스트의 「오리」가 표시부(21b)에 표시되고, 또한, 그 발음을 지정하면, 음성 합성 기능이 있는 음원(16b)에서 음성 합성이 실행되어 그 발음이 행해진다. 또한, 사용자가 관련되는 음성의 재생을 지정하면, 음성 합성 기능이 있는 음원(16b)에서 관련되는 음성으로서 오리의 울음소리가 재생되고, 관련되는 화상의 재생을 지시하면, 표시부(21b)에 관련되는 화상으로서 오리의 화상이 표시된다. 또한, 휴대 전화기(1)의 사용자에 의해 상기의 데이터의 동시 재생이 지시된 경우에는, 발음용 시퀀서(16a) 및 표시용 시퀀서(21a)의 제어에 의해 각각의 데이터(즉, 텍스트, 제1 및 제2 발음 데이터, 화상 데이터)가 동기 재생된다. The CPU 11 reads the information designated by the user from the RAM 19 among the search result information received from the delivery server 2 in step S115, and executes the reproduction. The sequence data contained in the search result information and designated by the user are supplied to the corresponding display sequencer 21a and pronunciation sequencer 16a, which are appropriately controlled to produce the desired display and audio output. For example, when the user of the cellular phone 1 searches the English word “Duck” using the English-Japanese dictionary, when the user designates the text display of the search result, the “duck” of the text is displayed on the display unit 21b, If the pronunciation is designated, the speech synthesis is performed in the sound source 16b having the speech synthesis function, and the pronunciation is performed. Further, if the user designates the reproduction of the associated voice, the duck's cry is reproduced as the associated voice in the sound source 16b having the speech synthesis function, and if the reproduction of the associated image is instructed, the display unit 21b is related. An image of a duck is displayed as an image to be used. When the simultaneous reproduction of the data is instructed by the user of the cellular phone 1, the data (i.e., text, first) is controlled by the control of the pronunciation sequencer 16a and the display sequencer 21a. And second pronunciation data and image data) are synchronously reproduced.

지정된 검색 결과 정보의 재생이 완료할 시까지, 상기의 단계 S115 및 S116의 처리가 반복된다. The above processing of steps S115 and S116 is repeated until the reproduction of the designated search result information is completed.

또한, 상기의 설명에서 이용된 흐름 및 단계는 일례이고, 따라서, 본 발명은 상기의 처리 흐름에 한정되는 것이 아니다. In addition, the flow and steps used in the above description are examples, and therefore, the present invention is not limited to the above processing flow.

이상, 본 발명의 적절한 실시예에 관해서 도면을 참조하여 설명하여 왔지만, 본 발명의 구체적 구성은 본 실시예에 한정되는 것이 아니고, 따라서, 본 발명의 요지를 일탈하지 않은 범위 내의 구성 등도 포함된다. 예컨대, 배송 서버(2)의 사전 데이터베이스(2b)의 콘텐츠를 휴대 전화기(1)에 기억하고, 번역 기능과 사전 검색 기능을 휴대 전화기(1) 내에 설치하도록 해도 된다. 이 경우, 휴대 전화기(1) 는 번역 또는 사전 검색을 할 시에, 배송 서버(2)와의 통신을 행할 필요가 없게 된다. As mentioned above, although the preferred embodiment of this invention was described with reference to drawings, the specific structure of this invention is not limited to this embodiment, Therefore, the structure etc. which do not deviate from the summary of this invention are included. For example, the contents of the dictionary database 2b of the delivery server 2 may be stored in the mobile phone 1, and a translation function and a dictionary search function may be provided in the mobile phone 1. In this case, the cellular phone 1 does not need to communicate with the delivery server 2 when performing translation or dictionary search.

이상과 같이, 본 발명에는 여러 가지의 효과 및 기술적 특징이 있고 이하에 간단히 기술한다. As described above, the present invention has various effects and technical features and will be briefly described below.

(1) 본 발명에서는 번역 결과가 휴대 단말 장치에 의해 음성으로 재생할 수 있으므로, 사용자는 번역 결과를 청각에 의해 인식·파악할 수 있다. 또한, 번역 수단(예컨대, 외부의 배송 서버 등)으로부터 회신되는 발음 데이터는 포르만트 파라미터를 나타내는 것이고, 이 경우에는 본 발명은 종래 기술과 같은 음성 신호의 회신 형식이 아니므로, 외부 장치 등으로부터의 발음 데이터의 수신에는 큰 전송 용량을 필요로 하지 않는다. (1) In the present invention, since the translation result can be reproduced by voice by the portable terminal device, the user can recognize and grasp the translation result by hearing. The pronunciation data returned from the translation means (e.g., an external delivery server, etc.) indicates a formant parameter. In this case, since the present invention is not in the form of a reply of a voice signal as in the prior art, it is necessary to The reception of the pronunciation data of a does not require a large transmission capacity.

(2) 본 발명에서는, 번역 결과를 표시하는 텍스트를 휴대 단말 장치에 구비한 표시 수단으로 표시할 수 있으므로, 사용자는 번역 결과를 시각에 의해서도 인식할 수 있다. (2) In the present invention, the text displaying the translation result can be displayed by the display means provided in the portable terminal device, so that the user can recognize the translation result also by time.

(3) 본 발명에서는, 표제어 정보를 검색키로서 검색된 의미 정보는 음성으로서 발음되고, 사용자는 이 의미 정보를 청각에 의해 인식할 수 있다. 또한, 사전 데이터베이스로부터 회신되는 검색 결과 정보는 포르만트 파라미터를 나타내는 것이고, 휴대 단말 장치의 외부 장치로부터 검색 결과 정보를 수신하는 경우, 필요로 하는 전송 용량이 적어도 가능하다. (3) In the present invention, the semantic information retrieved as heading information as a retrieval key is pronounced as a voice, and the user can recognize this semantic information by hearing. The search result information returned from the dictionary database indicates a formant parameter, and when the search result information is received from an external device of the portable terminal device, the required transmission capacity can be at least possible.

(4) 본 발명에서는, 사용자는 검색 결과의 의미 정보 뿐 아니라, 관련되는 화상을 볼 수도 있다. 또한, 발음 수단에 의해 의미 정보를 표시하는 텍스트의 음성 뿐 아니라, 표제어에 관련된 음도 발음되므로, 사용자는 검색키인 표제어의 의미 뿐 아니라 관련된 풍부한 정보도 얻을 수 있다. (4) In the present invention, the user can view not only the semantic information of the search result, but also the associated image. Furthermore, not only the voice of the text displaying the semantic information by the pronunciation means but also the sound associated with the headword is pronounced, so that the user can obtain not only the meaning of the headword which is the search key but also the rich information related thereto.

Claims

Transmitting means 12 for transmitting text data input in a predetermined language to a predetermined server 2 through a communication line;

Receives translation result information including a character string representing a translation result of translating input text data into another language and pronunciation data for controlling the intonation of the reading method, from the server through a communication line, and the pronunciation data is a Forman Receiving means 12, which is data representing a network parameter;

Generating means for generating a formant parameter in accordance with the pronunciation data included in the translation result information; And

And a pronunciation means (16a, 16b) for accentuating and pronouncing a voice corresponding to the character string, in accordance with the formant parameter.

A portable terminal apparatus according to claim 1, further comprising display means (21a, 21b), for displaying text data corresponding to the character string.

The method of claim 1, wherein the server has a dictionary database in which heading information and semantic information are associated, and the character string is semantic information obtained by searching for the heading information corresponding to text data input from a dictionary database. Mobile phone.

The portable device according to claim 3, further comprising display means (21a, 21b), and displaying text data representing the semantic information included in the translation result information and image data corresponding to the headword information. Terminal device.

The mobile terminal device according to claim 4, wherein the translation result information further comprises data indicating a sound related to the headword information.

delete

Translates text data input in a predetermined language into another language, and generates translation result information including a character string representing the translation result and pronunciation data for controlling the intonation of the reading method, wherein the pronunciation data is a formant Translation means 11, 19, 20, which are data representing parameters;

And a pronunciation means (16a, 16b) for pronouncing the accented voice corresponding to the character string according to the formant parameter.

8. A portable terminal device according to claim 7, further comprising display means (21a, 21b) to display text data for displaying a translation result included in the translation result information.

10. The method of claim 7, further comprising a dictionary database in which headword information and semantic information are associated, and searching for headword information corresponding to text data input from the dictionary database to generate semantic information obtained as the character string. A portable terminal device.

The portable apparatus according to claim 9, further comprising display means (21a, 21b), and displaying text data representing the semantic information included in the translation result information and image data corresponding to the headword information. Terminal device.

The portable terminal apparatus according to claim 10, wherein the translation result information further comprises pronunciation data indicating a sound related to the headword information.

delete

8. The method according to claim 1 or 7, wherein the sounding means has a plurality of formant generators 40, and synthesizes waveforms of specific formant frequencies generated by each formant generator. A portable terminal device.

The text data input by the portable terminal device in a predetermined language is transmitted to the predetermined server 2 via a communication line,

Receives translation result information including a character string representing a translation result of translating input text data into another language and pronunciation data for controlling the intonation of the reading method, from the server through a communication line, and the pronunciation data is a Forman Data representing the network parameters,

A formant parameter is generated according to the pronunciation data included in the translation result information,

And the accent is pronounced according to the formant parameter, and the voice corresponding to the character string is pronounced.

Translates text data input in a predetermined language into another language, and generates translation result information including a character string representing the translation result and pronunciation data for controlling the intonation of the reading method, wherein the pronunciation data is a formant Data representing a parameter,

And the accented voice corresponding to the character string is pronounced according to the formant parameter.

A storage medium for storing a program executed by a processing apparatus included in a portable terminal device, comprising: transmitting text data input in a predetermined language by the portable terminal device to a predetermined server 2 through a communication line,

Generate a formant parameter according to the pronunciation data included in the translation result information,

And a program in which the accent is pronounced and the voice corresponding to the character string is stored in accordance with the formant parameter.

A storage medium for storing a program executed by a processing apparatus included in a portable terminal device, comprising: translating text data input in a predetermined language into another language, and controlling an accent of a character string representing the translation result and a reading method thereof Generating translation result information including pronunciation data to be used, wherein the pronunciation data is data representing a formant parameter,

And a program configured to pronounce an accented voice corresponding to the character string in accordance with the formant parameter.