JP2580565B2

JP2580565B2 - Voice information dictionary creation device

Info

Publication number: JP2580565B2
Application number: JP61061655A
Authority: JP
Inventors: 美聡石黒
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 1986-03-18
Filing date: 1986-03-18
Publication date: 1997-02-12
Anticipated expiration: 2012-02-12
Also published as: JPS62217294A

Description

【発明の詳細な説明】〔概要〕本発明は文字で表現された単語や文節に対応する音声
データの内の音節を示す文字列を表示装置上に表示し、
アクセントや音程、発音速度等の発声データを視覚的イ
メージに変換して表示し、これをオペレータが画面上で
操作して所望の発声データに対応する視覚的イメージに
変化させ、この画面上の視覚的イメージの表示状態から
音声データに変換して音声情報辞書に登録するようにし
たものである。DETAILED DESCRIPTION OF THE INVENTION [Summary] The present invention displays, on a display device, a character string indicating a syllable in voice data corresponding to a word or a phrase expressed in characters,
The utterance data such as accents, pitches, and pronunciation speeds are converted into a visual image and displayed, and the operator operates the screen to change the visual data to a visual image corresponding to the desired utterance data. The display state of the target image is converted into audio data and registered in the audio information dictionary.

[Industrial applications]

本発明は、音声合成を行うための音声データ（特に発
声データ）の設定を容易にする音声情報辞書の作成装置
に関する。The present invention relates to a voice information dictionary creating apparatus that facilitates setting of voice data (especially utterance data) for performing voice synthesis.

[Conventional technology]

近年音声合成技術の進歩により、パーソナルコンピュ
ータ（以下パソコンと略記する）等においても音声合成
カード等の音声合成手段を用いることにより、比較的容
易に音声合成を行うことが可能となっている。Recent advances in speech synthesis technology have made it possible to relatively easily perform speech synthesis in personal computers (hereinafter abbreviated as personal computers) and the like by using speech synthesis means such as speech synthesis cards.

しかしながら上記音声合成を行うには、何らかの手段
により、音節データ（１つ以上の音節を含む）とアクセ
ントや音程、発音速度等を決める発声データとを含む音
声データを音声合成手段に与える必要がある。However, in order to perform the above-mentioned speech synthesis, it is necessary to provide the speech synthesis unit with speech data including syllable data (including one or more syllables) and utterance data for determining an accent, a pitch, a pronunciation speed, and the like. .

パソコンにおいては、例えばBASIC言語でTALK文をサ
ポートする趨勢にある。以下このようなBASIC言語で音
声データを指定する例を示す。In personal computers, for example, there is a trend to support TALK sentences in the BASIC language. Hereinafter, an example of specifying audio data in such a BASIC language will be described.

一例として「山形」を発声させる場合、指定する音声
データとしては、音節を示す文字列に引き続いて、アク
セント位置、その他の発声データを指定する必要があ
る。As an example, when "Yamagata" is uttered, it is necessary to specify the accent position and other utterance data following the character string indicating the syllable as the voice data to be specified.

例えば上記「山形」を第２音目にアクセントを置いて
発声させるようにするには、 TALK“ヤマガタ:A2" というように記述する。上記TALK文中の“ヤマガタ”は
発声すべき音節を指定する文字列で、その後の“A2"は
第２拍目にアクセントを置くことを指定する発声データ
であり、“:"は区切記号である。For example, to make the above “Yamagata” uttered with an accent on the second note, it is described as TALK “Yamagata: A2”. In the above TALK sentence, “Yamagata” is a character string that specifies the syllable to be uttered, “A2” is utterance data that specifies that the second beat is to be accented, and “:” is a delimiter. .

このようにBASIC言語でサポートしているTALK文で音
声データの指定を記述するには、音節を表す文字列の他
にアクセント位置等の各種のパラメータを、それぞれ記
号と数値で与えなくてはならない。この例では、アクセ
ントパラメータとしてアクセント位置が語の何音目にあ
るかを、“A2"で指定している。この他にも、自然な発
音にするには、イントネーション、音程、発音速度等の
指定が必要である。In this way, in order to describe the specification of audio data in the TALK sentence supported by the BASIC language, various parameters such as accent positions and other parameters must be given as symbols and numerical values in addition to the character string representing the syllable . In this example, "A2" designates the position of the accent position of the word as an accent parameter. In addition to this, it is necessary to specify intonation, pitch, sounding speed, and the like in order to achieve natural sounding.

これら多くのパラメータを記号と数値で指定するの
は、その操作が煩雑であるばかりでなく、指定した結果
がどのようになるか関連がつかみにくく、その場で聞い
て確かめることができないという問題がある。実際の発
声を聞くためには、プログラム中に含まれるTALK文のみ
を一つずつ実行させる必要があり、これは煩雑な操作と
なる。Specifying many of these parameters using symbols and numerical values is not only complicated in operation, but also difficult to grasp the relationship between the specified result and the problem. is there. In order to hear the actual utterance, it is necessary to execute only the TALK sentences included in the program one by one, which is a complicated operation.

そこで音声データを格納した音声情報辞書を設け、こ
れを検索して所望の音声データを読み取り、読み取った
音声データをBASIC言語の場合には上記TALK文に代入す
る等の方法が考えられる。Therefore, there is a method of providing a voice information dictionary storing voice data, searching the voice information dictionary, reading desired voice data, and substituting the read voice data in the TALK sentence in the case of the BASIC language.

[Problems to be solved by the invention]

しかし、上述した如く従来技術では音声データの設定
は煩雑な操作を必要とし、またその場で確かめることが
できないという問題がある。従って、音声情報辞書を作
成するには、前述の煩雑な操作を免れることができな
い。However, as described above, in the related art, there is a problem that setting of audio data requires a complicated operation and cannot be confirmed on the spot. Therefore, in order to create a voice information dictionary, the complicated operation described above cannot be avoided.

本発明は、視覚的イメージを利用して画面上で操作、
指定することにより、音声データを容易に設定可能に
し、容易に音声情報辞書を作成することを目的とする。The present invention operates on a screen using a visual image,
It is an object of the present invention to make it possible to easily set voice data by designating and to easily create a voice information dictionary.

[Means for solving the problem]

本発明の音声情報辞書作成装置は、入力部と表示装置
と画面編集部とイメージ変換部と音声情報辞書と登録部
と制御部とを具えた情報処理装置である。A voice information dictionary creation device according to the present invention is an information processing device including an input unit, a display device, a screen editing unit, an image conversion unit, a voice information dictionary, a registration unit, and a control unit.

本装置は、制御部による装置全体の制御の下に、画面編集部は、入力部を介して指定された文字列、ま
たは音声情報辞書から読み出した文字列によって示され
る、音節データと発声データとからなる音声データのう
ち、音節データはそのまま文字列として表示装置に表示
するとともに、発声データをイメージ変換部によって視
覚的イメージに変換させて表示し、入力部からの指示が
あれば前記の視覚的イメージを修正して表示し、イメージ変換部は、修正された視覚イメージを対応す
る発声データに変換し、登録部は前記の変換された発声データを、音節データ
と検索のためのキーとともに音声情報辞書に登録する。In this device, under the control of the entire device by the control unit, the screen editing unit transmits the syllable data and the utterance data indicated by the character string specified through the input unit or the character string read from the voice information dictionary. The syllable data is displayed as it is on the display device as a character string, and the utterance data is converted into a visual image by the image conversion unit and displayed. The image converter corrects and displays the image, the image converter converts the corrected visual image into corresponding utterance data, and the registerer converts the converted utterance data into syllable data and a key for retrieval along with audio information. Register in the dictionary.

[Action]

音声データの内の発声データ（音の高低、アクセント
位置、速さ等）を単なる記号と数値で指定するのではな
く、音を視覚的なイメージに変換して画面上に表示し、
指定によりその表示を変化させるので、実際の発声状態
との対応が付けやすく、発声データの生成・修正が容易
となる。Rather than specifying the utterance data (pitch, accent position, speed, etc.) of the voice data with simple symbols and numerical values, the sound is converted into a visual image and displayed on the screen,
Since the display is changed by the designation, it is easy to associate with the actual utterance state, and it is easy to generate and correct utterance data.

〔Example〕

第１図は本発明に係る音声情報辞書作成装置の構成を
示す要部ブロック図である。以下、本発明の一実施例を
図面を参照しながら説明する。FIG. 1 is a main block diagram showing a configuration of a speech information dictionary creating apparatus according to the present invention. Hereinafter, an embodiment of the present invention will be described with reference to the drawings.

１は入力部であって、例えばキーボード、２は表示装
置、３は記憶装置で、例えば磁気ディスク装置或いはフ
ロッピディスク装置、４は画面編集部、５はイメージ変
換部、６は登録部、７は制御部、８は上記記憶装置３に
格納された音声情報辞書、９は音声合成部である。1 is an input unit, for example, a keyboard, 2 is a display device, 3 is a storage device, for example, a magnetic disk device or a floppy disk device, 4 is a screen editing unit, 5 is an image conversion unit, 6 is a registration unit, 7 is The control unit 8 is a voice information dictionary stored in the storage device 3, and 9 is a voice synthesis unit.

第２図（ａ）〜（ｉ）は本発明の一実施例としての、
視覚的イメージ利用による音声データの設定操作例を説
明するための図である。2 (a) to 2 (i) show one embodiment of the present invention.
FIG. 9 is a diagram for describing an example of a setting operation of audio data using a visual image.

本実施例では「こんにちは」という発音（音節）を、
所望の発声データで登録する例を説明する。In this embodiment pronunciation of "Hello" (the syllable),
An example of registration with desired utterance data will be described.

まず登録しようとする音声の音節を示す文字列「コン
ニチハ」10を、入力部１から指定する。この文字列10は
同図（ａ）に示すように、画面編集部４により表示装置
２の画面に表示される。また、表示装置２上に各種の発
声データに対応する画面表示（例:11で示す）がなされ
る。この発声データはすべて決定されれば画面に一度に
表示されるが、何も指定してない状態から順次指定する
例を説明する。発声データの種類の表示指定は入力部１
の操作により次々に変化させることができる。なお入力
部１として本実施例ではキーボードを使用する例を説明
するが、これに限定する必要はなく、例えばマウスと称
されるポインティングデバイスや、ライトペン等を用い
得ることは、特に説明するまでもないことである。First, a character string “konnichiha” 10 indicating a syllable of a voice to be registered is designated from the input unit 1. This character string 10 is displayed on the screen of the display device 2 by the screen editing unit 4 as shown in FIG. In addition, a screen display (example: 11) corresponding to various utterance data is performed on the display device 2. If all the utterance data are determined, they are displayed on the screen at once, but an example will be described in which the utterance data is sequentially specified from a state where nothing is specified. Input designation 1 for display type of utterance data
Can be changed one after another. In the present embodiment, an example in which a keyboard is used as the input unit 1 will be described. However, the present invention is not limited to this. There is no such thing.

発声データの種類は多種類あるが、そのうちからアク
セント位置、語尾及び語頭の高低、声の高さ、発音速
度、最後に男声・女声の区別を順に決定する例を説明す
る。これら発声データの決定順序は、処理プログラムで
決めておいてもよく、また図示はしていないが、入力部
１から指定するようにしてもよい。入力部１から指定す
る場合には、ファンクションキーのそれぞれに発声デー
タの種類を割り当てておいて、ファンクションキーを選
択することにより、決定すべき発声データの種類を選択
する方法、あるいは、画面に発声データの種類を表示し
ておき、このうちから所望のものを選択する方法等、種
々の方法で実施し得る。There are many types of utterance data, and an example will be described in which the accent position, the height of the ending and the beginning, the pitch of the voice, the pronunciation speed, and finally the distinction between male and female voices are determined in order. The order of determining these utterance data may be determined by a processing program, and may be specified from the input unit 1 (not shown). When specifying from the input unit 1, a type of utterance data is assigned to each of the function keys, and by selecting the function key, a method of selecting the type of utterance data to be determined, or an utterance on the screen is selected. The method can be implemented by various methods such as displaying the type of data and selecting a desired one from these.

本実施例では前述のように入力部１から指定する。例
えば、アクセント位置の決定が選択されると、同図
（ａ）に示すように、アクセント位置を示す標示マーク
11例えば、‘▼’が文字列10上の所定の位置〔同図では
先頭〕に表示される。この標示マーク11は入力部１のシ
フトキーを操作する等により、同図（ｂ）に示すように
所望位置に移動させることができる〔同図では第３音
節〕。この操作の間、指定したアクセントにより実際に
発声させ、耳で聞いて確認できるようにすることも容易
である。In the present embodiment, designation is made from the input unit 1 as described above. For example, when the determination of the accent position is selected, as shown in FIG.
11 For example, '▼' is displayed at a predetermined position on the character string 10 (the top in the figure). The mark 11 can be moved to a desired position by operating a shift key of the input unit 1 as shown in FIG. 3B (third syllable in the figure). During this operation, it is also easy to actually utter the voice with the specified accent, and to be able to hear and confirm it.

このようにしてアクセント位置を文字列10の先頭の
「コ」に置くことに決定したとすると、標示マーク11を
文字列10の先頭に移動させる〔同図（ｃ）参照］。なお
アクセント位置を示す方法としては、‘▼’のような標
示マーク11を付加する以外に、アクセント位置の文字の
色を変える等の方法を用いることもできる。If it is determined that the accent position is to be placed at the beginning of the character string 10 in this manner, the sign mark 11 is moved to the beginning of the character string 10 (see FIG. 3C). As a method of indicating the accent position, a method of changing the color of the character at the accent position or the like can be used in addition to the addition of the mark 11 such as “▼”.

次に語頭及び語尾の高低の決定を選択すると、同図
（ｄ）に示すように語頭の文字「コ」及び語尾の文字
「ハ」を、シフトキーにより上下させることができる。
そこでこれまた実際に発声させて確認し、同図（ｅ）に
示すように所望の語頭及び語尾の高さを選択する。Next, when the determination of the beginning and the end of the word is selected, the letter "ko" at the beginning and the letter "c" at the end can be moved up and down by the shift key as shown in FIG.
Then, the user again confirms the speech by actually uttering it, and selects a desired beginning and end height as shown in FIG.

次の声の高さの決定を選択すると、同図（ｆ）に示す
ように、文字列10の前に声の高さを示す標示マーク12が
表示される。シフトキーを操作してこの標示マーク12を
上下させ、実際に発声させて所望の高さを決定する〔同
図（ｇ）参照〕。When the determination of the next voice pitch is selected, a mark 12 indicating the voice pitch is displayed in front of the character string 10 as shown in FIG. By operating the shift key, the sign mark 12 is moved up and down, and actual utterance is made to determine a desired height (see FIG. 9G).

次いで話す速さの決定を選択すると、同図（ｈ）に示
すように、話す速さを示す標示マーク13が画面に表示さ
れる。同図の標示マーク13はスピードメータのイメージ
で描かれたもので、これの指針14をシフトキー或いは数
字キーで動かし、これまた実際に発声させて適当な速さ
を選択する。Next, when the user decides to determine the speaking speed, a sign 13 indicating the speaking speed is displayed on the screen as shown in FIG. The marking mark 13 in the figure is drawn by an image of a speedometer, and the pointer 14 thereof is moved by a shift key or a number key, and an appropriate speed is selected by actually speaking.

男性の声と女性の声のどちらにするかの決定を選択す
ると、同図（ｉ）に示すように、男性の姿または女性の
姿の標示マーク15が表示される。これはシフトキー或い
はファンクションキーを押すことにより交互に変換され
るので、いずれか一方を選択する。上記標示マーク15は
両方を表示しておき、選択された方を高輝度表示する
か、或いは選択された方のみを表示する等の方法として
もよく、種々の方法を用い得る。When the user decides whether to use a male voice or a female voice, a mark 15 of a male figure or a female figure is displayed as shown in FIG. Since this is alternately converted by pressing the shift key or the function key, either one is selected. The mark 15 may be displayed in both cases, and the selected one may be displayed with high brightness, or only the selected one may be displayed. Various methods may be used.

以上により、所望の発声データを設定すると、入力部
１から検索用のキーを入力して登録を指定する。なお、
文書処理装置におけるような、文字表現と読みとして指
定された音節データとの対応を使って、検索用のキーの
代わりに文字表現を使ってもよい。登録の指定は、ファ
ンクションキーの一つに‘登録’を割り当てておく方
法、あるいは、画面上に操作項目を表示しておいて、そ
の中から一つを選ぶ方法等により実行できる。As described above, when desired utterance data is set, a key for search is input from the input unit 1 to specify registration. In addition,
A character expression may be used instead of a search key by using a correspondence between a character expression and syllable data designated as a reading as in a document processing device. The designation of registration can be executed by assigning “registration” to one of the function keys, or by displaying operation items on the screen and selecting one of them.

登録が指定されると、上述の画面状態即ち、アクセン
ト位置の標示マーク11の位置や、語頭・語尾の文字の位
置等を、イメージ変換部５が読み取って対応する発声デ
ータ22に変換する。この変換された発声データ22を、登
録部６は指定されたキーとともに、記憶装置３に設けら
れた音声情報辞書８に書き込む。When the registration is designated, the image conversion unit 5 reads the above-mentioned screen state, that is, the position of the marking mark 11 of the accent position, the position of the character at the beginning and end of the word, and converts it into the corresponding utterance data 22. The registration unit 6 writes the converted utterance data 22 into the audio information dictionary 8 provided in the storage device 3 together with the designated key.

第３図は上記音声情報辞書８の構成を示す概念図で、
指定されたキー21とともに、上述のようにして決定され
た文字列10と各種発声データ22が対応して格納されてい
る状態を示す。FIG. 3 is a conceptual diagram showing the structure of the voice information dictionary 8,
This shows a state in which the character string 10 determined as described above and various utterance data 22 are stored in association with the designated key 21.

同図に示すように発声データ22は、すべて所定の記述
形式に従って、発声の項目を示す記号とその度合を示す
数値の組合せにより記録されている。従って所望の音声
データに対応するキー21を指定して文字列10と発声デー
タ22を読み出し、これらを音声合成部９に与えることに
より、所望の音声を合成できる。なお、本実施例では英
数字と数字とを組合わせてキー21を構成したが、これは
一例であって、その構成は任意である。漢字とかなで表
したキーであってもよい。As shown in the figure, the utterance data 22 are all recorded by a combination of a symbol indicating the item of the utterance and a numerical value indicating the degree thereof in accordance with a predetermined description format. Therefore, the character string 10 and the utterance data 22 are read out by designating the key 21 corresponding to the desired voice data, and are given to the voice synthesizer 9 to synthesize the desired voice. In the present embodiment, the keys 21 are configured by combining alphanumeric characters and numbers, but this is an example, and the configuration is arbitrary. It may be a key expressed in kanji and kana.

上述のごとく本実施例では、表示装置２上に音声を視
覚的イメージに変換して表示し、対話形式で画面を操作
して音声データを設定し、さらに公知の技術を使って候
補データを実際に発声させることができる。従って候補
データの設定が容易であるばかりでなく、実際に耳で聞
いて確認できるので、所望の音声データを容易にかつ的
確に設定し、登録することが可能である。いったん登録
した音声データを修正する場合は辞書から読み出して同
様に処理する。As described above, in the present embodiment, the voice is converted into a visual image on the display device 2 and displayed, the voice data is set by operating the screen in an interactive manner, and the candidate data is actually converted using a known technique. Can be uttered. Therefore, the setting of the candidate data is not only easy, but also can be actually heard and confirmed by ear, so that the desired voice data can be easily and accurately set and registered. When the registered voice data is corrected, the voice data is read from the dictionary and processed in the same manner.

このように所望の音声データを簡単且つ的確に設定し
登録できることは、単に汎用の音声情報辞書の作成が容
易となるばかりでなく、音声合成を利用するシステムを
構成するような場合に、特殊な辞書或いは必要な語彙の
みを格納した辞書等を任意に作成できる事となる。The ability to easily and accurately set and register desired voice data in this manner not only facilitates the creation of a general-purpose voice information dictionary, but also allows the use of special voice data synthesis systems. It is possible to arbitrarily create a dictionary or a dictionary storing only necessary vocabulary.

更に本実施例は、システム開発者にとって便利である
ばかりでなく、システムの利用者がシステムの運用時
に、必要に応じて必要な音声データを登録し得るように
するのにも利用できる。Further, the present embodiment is not only convenient for a system developer, but also can be used to enable a user of the system to register necessary voice data as necessary during operation of the system.

なお上記実施例では、文字列10を「コンニチハ」とい
う５文字（５音節）の場合を説明したが、これは一例で
あって、文字例10の文字数は特に限定する必要はない。
従って第３図において、文字列10の欄を５文字分とした
が、これも図示するための都合であって、必要な桁数を
取ることができる。また発声データ22の欄も、必要に応
じた桁数としてよいことは勿論である。In the above embodiment, the case where the character string 10 is composed of five characters (five syllables) of “Konichiha” has been described, but this is an example, and the number of characters of the character example 10 does not need to be particularly limited.
Therefore, in FIG. 3, the column of the character string 10 is set to five characters, but this is also for the sake of illustration, and the required number of digits can be taken. In addition, it goes without saying that the column of the utterance data 22 may be set to the number of digits as required.

〔The invention's effect〕

以上説明した如く本発明によれば、音声データの内の
発声データ（音の高低、アクセント位置、速さ等）を単
なる記号と数値で指定するのではなく、音を視覚的なイ
メージに変換して画面上に表示し、指定によりその表示
を変化させるので、実際の発声状態との対応が付けやす
く、発声データの生成・修正が容易となる。従って、音
声データを設定することが容易かつ的確になるので、所
望の音声情報辞書を容易に作成できる。ひいては音声合
成を利用するシステムの構成やソフトウェアの開発が容
易になる。As described above, according to the present invention, voice data (pitch, accent position, speed, etc.) of voice data is not specified by simple symbols and numerical values, but is converted into a visual image. Display on the screen and change the display according to designation, it is easy to associate with the actual utterance state, and it is easy to generate and correct utterance data. Therefore, it is easy and accurate to set the audio data, so that a desired audio information dictionary can be easily created. As a result, the configuration of a system using speech synthesis and the development of software are facilitated.

[Brief description of the drawings]

第１図は本発明の構成説明図、第２図は本発明の一実施例の説明図第３図は上記一実施例で作成した音声情報辞書の構成を
示す図である。図において、１は入力部、２は表示装置、３は記憶装
置、４は画面編集部、５はイメージ変換部、６は登録
部、７は制御部、８は音声情報辞書であり、10は音節を
表す文字列、11,12,13,14,15は標示マーク、21は検索用
のキー、22は発声データを示す。FIG. 1 is an explanatory diagram of the configuration of the present invention, FIG. 2 is an explanatory diagram of one embodiment of the present invention, and FIG. 3 is a diagram showing the configuration of the voice information dictionary created in the above-described embodiment. In the figure, 1 is an input unit, 2 is a display device, 3 is a storage device, 4 is a screen editing unit, 5 is an image conversion unit, 6 is a registration unit, 7 is a control unit, 8 is a voice information dictionary, 10 is Character strings representing syllables, 11, 12, 13, 14, and 15 are indication marks, 21 is a search key, and 22 is utterance data.

Claims

(57) [Claims]

An information processing apparatus comprising: an input unit, a display device, a screen editing unit, an image conversion unit, a voice information dictionary, a registration unit, and a control unit, wherein the control unit controls the entire device. The screen editing unit displays the syllable data as a character string as it is, of the voice data composed of syllable data and utterance data indicated by a character string specified via the input unit or a character string read from the voice information dictionary In addition to displaying on the device, the utterance data is converted into a visual image by the image conversion unit and displayed, and if there is an instruction from the input unit, the visual image is corrected and displayed, and the image conversion unit is corrected. The registration unit converts the visual image into corresponding utterance data, and registers the converted utterance data in a speech information dictionary together with syllable data and a key for retrieval. Audio information dictionary creating apparatus characterized by was Unishi.