JP2003066991A

JP2003066991A - Method and apparatus for outputting voice recognition result and recording medium with program for outputting and processing voice recognition result recorded thereon

Info

Publication number: JP2003066991A
Application number: JP2001252027A
Authority: JP
Inventors: Masanobu Nishitani; 正信西谷; Yasunaga Miyazawa; 康永宮澤; Hiroshi Hasegawa; 浩長谷川
Original assignee: Seiko Epson Corp
Current assignee: Seiko Epson Corp
Priority date: 2001-08-22
Filing date: 2001-08-22
Publication date: 2003-03-05

Abstract

PROBLEM TO BE SOLVED: To allow a character string as a voice recognition result to express the tone and feeling of a speaker. SOLUTION: An apparatus for outputting voice recognition result has a character parameter outputting part 4 for associating time information, volume information and pitch information obtained from analyzed results obtained by a voice analyzing part 2 with the character string as the recognition result in accordance with each time and outputting the associated information as character parameters, and a character generating part 5 for finding speech duration time based on the time information outputted from the character parameter outputting part 4, generating parameters for expressing a speech speed, voice volume and a voice pitch by using the speech duration time, the volume information and the pitch information and generating the character string for expressing the speech speed, the voice volume and the voice pitch based on the parameters.

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、音声を認識してそ
の認識結果を文字列として出力する音声情報出力方法お
よび音声情報出力装置ならびに音声認識結果出力処理プ
ログラムを記録した記録媒体に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice information output method for recognizing voice and outputting the recognition result as a character string, a voice information output device, and a recording medium recording a voice recognition result output processing program.

【０００２】[0002]

【従来の技術】音声認識部で認識された認識結果はテキ
ストとして出力される。この認識結果を利用するにはデ
ィスプレイ上に表示したり、プリンタからプリンアウト
するのが一般的である。ここで、テキストとは計算機が
扱う文字コードのみの文字情報であり、文字属性などの
付加されていないものを指す。したがって、テキストの
ままでは、ディスプレイ上に表示したり、プリンタから
プリンアウトしたときに、話者の口調（声の高さ、声の
大きさ、話速など）は表現されないので、話者がどのよ
うな感情（喜び、驚き、落胆など）を持って発話してい
るかは全くと言っていいほどわからない。2. Description of the Related Art The recognition result recognized by a voice recognition unit is output as text. In order to utilize this recognition result, it is common to display it on a display or print it out from a printer. Here, the text is the character information of only the character code handled by the computer, and refers to the character information to which no character attribute is added. Therefore, when the text is displayed as it is, the speaker's tone (voice pitch, voice volume, voice speed, etc.) is not expressed when it is displayed on the display or printed out from the printer. It's hard to say at all whether he is speaking with such feelings (joy, surprise, discouragement, etc.).

【０００３】また、会議の議事録のように、複数の人の
発話内容を音声認識してその認識結果を出力するような
場合、得られたテキストからはどの人がどのような状況
で発話したかはわからない。In the case of speech recognition of a plurality of people's utterances and outputting the recognition result like the minutes of a meeting, which person uttered in what situation from the obtained text. I don't know.

【０００４】そこで、従来から、音声認識結果として出
力されるテキストに話者の口調を表現して出力したり、
話者を特定可能とする技術も幾つか提案されている。Therefore, conventionally, the tone of the speaker is expressed and output in the text output as the voice recognition result,
Some technologies have been proposed that enable the speaker to be identified.

【０００５】たとえば、話者の口調を表現できるように
テキストに何らかの文字属性を付加して出力可能とする
技術としては、特開２００１−１３９９３「音声入力情
報処理システム」（以下、第１の従来技術という）、特
開平１１−３５２９０「音声出力装置及びその制御方
法、コンピュータ可読メモリ」（以下、第２の従来技術
という）があり、話者の特定が可能な技術としては、特
開平１０−２９４７９８「マルチメディア議事録作成方
法及びシステム」（以下、第３の従来技術という）など
がその例である。なお、ここでは、認識結果そのものを
テキスト、このテキストに対して何らかの文字属性が付
加されたものを文字列と呼んでいる。For example, Japanese Patent Laid-Open No. 2001-13993 “Voice Input Information Processing System” (hereinafter, referred to as the first conventional technique) is known as a technique for adding some character attribute to text so as to be able to express the tone of a speaker and outputting the text. Technology) and Japanese Patent Laid-Open No. 11-35290 “Voice output device and control method therefor, computer readable memory” (hereinafter referred to as the second conventional technology). An example is 294798 "Multimedia minutes preparation method and system" (hereinafter referred to as the third prior art). Here, the recognition result itself is referred to as text, and a character string added to this text is referred to as a character string.

【０００６】[0006]

【発明が解決しようとする課題】上述した第１の従来技
術は、入力音声の音圧を識別して音圧の大きさに応じた
文字サイズとしたり、入力音声の話速を識別して話速の
速さに応じた文字フォントとするというものである。た
とえば、ワードプロセッサなどで音声入力によってその
音声に対する文字列を出力させるような場合、話者が意
図的に音圧を変えたり話速を変えたりすることで、それ
に対応して文字サイズや文字フォントを制御しようとす
るものである。The above-mentioned first prior art is such that the sound pressure of the input voice is identified to make the character size according to the magnitude of the sound pressure, or the speech speed of the input voice is identified. The character font is set according to the speed. For example, in the case of outputting a character string for that voice by voice input with a word processor etc., the speaker can intentionally change the sound pressure or the voice speed to change the character size and character font accordingly. It's something you want to control.

【０００７】しかし、この第１の従来技術では、音圧の
大きさと話速によって文字サイズやフォントを変えてい
るだけであり、これだけでは話者の口調が適切に表現さ
れるものではなく、話者がどのような感情をもって発話
したかを推測することはできない。また、前述したよう
に、複数の話者の発話内容の認識結果を出力するような
場合には、話者を特定することもできず、どのような話
者がどのような口調で発話したかの推測は難しいと考え
られる。However, in the first prior art, the character size and font are only changed according to the magnitude of sound pressure and the speech speed, and this alone does not properly express the tone of the speaker. It is not possible to infer what emotion a person speaks with. Further, as described above, when outputting the recognition result of the utterance contents of multiple speakers, it is not possible to specify the speaker, and what speaker uttered in what kind of tone. Is difficult to guess.

【０００８】また、第２の従来技術は、入力音声の音声
周波数情報（ピッチ情報）や音量情報を用いて、文字の
属性（フォントや色）を決定する記述がなされている
が、この第２の従来技術では、入力音声の音声周波数
が、基準となる音声周波数よりも大きい場合は、たとえ
ば、それを女性の声とみなして、出力される文字の色を
ピンク色として出力し、また、入力音声の音量が基準と
なる音量よりも大きい場合は、たとえば、出力する文字
の大きさを２０ポイントするというように、基準値との
比較によって、基準値よりも大きいか小さいかによって
文字の色や大きさを決定するといった程度の記載しかな
されていない。In the second prior art, there is a description that determines the attribute (font or color) of a character by using the voice frequency information (pitch information) and the volume information of the input voice. In the related art, when the voice frequency of the input voice is higher than the reference voice frequency, for example, it is regarded as a female voice, the color of the output character is output as pink, and When the volume of the voice is higher than the reference volume, for example, the size of the character to be output is set to 20 points. There is only a description such as determining the size.

【０００９】したがって、この第２の従来技術では、複
数の話者が会議を行った内容を議事録として出力するよ
うな場合、その議事録から話者を特定したり、それぞれ
の話者がどのような口調で発話したのかをきめ細かく表
現することは到底できないと考えられる。Therefore, according to the second prior art, when the contents of a meeting held by a plurality of speakers are output as a minutes, the speakers can be specified from the minutes, and each speaker can be identified. It is thought that it is impossible to express in detail the utterance in such a tone.

【００１０】また、第３の従来技術は、遠隔テレビ会議
において、話者を特定してそれぞれの話者ごとにフォン
トを変更した議事録を作成するというものであり、その
議事録からは、単に、どの話者がどのような発言をした
かの発言内容がわかるのみで、個々の話者の口調などを
知り得るものではない。また、この第３の従来技術にお
ける話者識別は、個々の話者固有の端末やＩＤカードを
利用して行うもので、これら話者を識別するための情報
を入力する必要がある。The third conventional technique is to specify a speaker in a remote video conference and create a minutes in which a font is changed for each speaker. From the minutes, simply However, it is only possible to know what the speaker said and what the speaker said, and it is not possible to know the tone of each speaker. Further, the speaker identification in the third conventional technique is performed using a terminal or ID card unique to each speaker, and it is necessary to input information for identifying these speakers.

【００１１】本発明は、音声認識結果として出力される
テキストに対し、話者の発話の口調を適切に表現できる
ようにすることで、その出力としての文字列を見るだけ
で、話者の口調やそれによる話者の感情などを推測する
ことができ、また、話者識別機能を持つことにより、複
数の話者による会議の議事録の作成などに好適な音声認
識結果出力方法および音声認識結果出力装置ならびに音
声認識結果出力処理プログラムを記録した記録媒体を提
供することを目的とする。The present invention makes it possible to properly express the tone of the utterance of the speaker with respect to the text output as the voice recognition result, and the tone of the speaker can be seen only by looking at the character string as the output. It is possible to guess the emotions of the speaker due to it and the speaker recognition function, and it has a speaker identification function, which is suitable for creating minutes of meetings by multiple speakers. An object of the present invention is to provide an output device and a recording medium recording a voice recognition result output processing program.

【００１２】[0012]

【課題を解決するための手段】上述の目的を達成するた
めに、本発明の音声認識結果出力方法は、入力された音
声を分析し、その音声分析結果を用いて音声認識を行
い、その認識結果を出力する音声認識結果出力方法にお
いて、前記音声分析結果から、発話継続時間情報、音量
情報、ピッチ情報の少なくとも１つの情報を取得し、前
記発話継続時間情報から話速を表現するパラメータ、音
量情報から声の大きさを表現するパラメータ、ピッチ情
報から声の高さを表現するパラメータの少なくとも１つ
のパラメータを生成し、それらのパラメータの中の少な
くとも１つのパラメータによって、話速、声の大きさ、
声の高さの少なくとも１つが表現された文字列を生成す
るようにしている。In order to achieve the above object, a speech recognition result output method of the present invention analyzes an input speech, performs speech recognition using the speech analysis result, and recognizes the speech. In a voice recognition result output method for outputting a result, at least one information of utterance duration information, volume information, and pitch information is acquired from the voice analysis result, and a parameter expressing a speech speed from the utterance duration information, a volume. At least one of a parameter expressing the loudness of the voice from the information and a parameter expressing the pitch of the voice from the pitch information is generated, and the speech speed and the loudness of the voice are generated by at least one of the parameters. ,
A character string in which at least one of the pitches of voices is expressed is generated.

【００１３】また、本発明の音声認識結果出力装置は、
入力された音声を分析し、その音声分析結果を用いて音
声認識を行い、その認識結果を出力する音声認識結果出
力装置において、前記音声分析結果から得られる各時刻
情報に対応付けられた音量情報およびピッチ情報の少な
くとも一方の情報と、認識結果として得られるテキスト
とを前記各時刻情報によって対応付けて、これら対応付
けられた情報を文字パラメータとして出力する文字パラ
メータ出力手段と、前記文字パラメータ出力手段から得
られた前記時刻情報から発話継続時間情報を求め、その
発話継続時間情報から話速を表現するパラメータ、前記
音量情報から声の大きさを表現するパラメータ、前記ピ
ッチ情報から声の高さを表現するパラメータの少なくと
も１つのパラメータを生成し、それらのパラメータの中
の少なくとも１つのパラメータによって、話速、声の大
きさ、声の高さの少なくとも１つが表現された文字列を
生成する文字生成手段と有した構成としている。Further, the speech recognition result output device of the present invention is
In the voice recognition result output device that analyzes the input voice, performs voice recognition using the voice analysis result, and outputs the recognition result, volume information associated with each time information obtained from the voice analysis result. And at least one of pitch information and a text obtained as a recognition result are associated with each other by the time information, and the associated information is output as a character parameter, and the character parameter output means. Obtain the utterance duration information from the time information obtained from, the parameter expressing the speech speed from the utterance duration information, the parameter expressing the volume of the voice from the volume information, the pitch of the voice from the pitch information Generate at least one parameter of the parameters to represent and at least one of those parameters The parameter, speech rate, loudness, at least one of the voice pitch has a configuration having a character generator means for generating a character string expressed.

【００１４】また、本発明の音声認識結果出力処理プロ
グラムを記録した記録媒体は、入力された音声を分析
し、その音声分析結果を用いて音声認識を行い、その認
識結果を出力する音声認識結果出力処理プログラムを記
録した記録媒体であって、その処理プログラムは、前記
音声分析結果から、発話継続時間情報、音声の音量情
報、音声のピッチ情報の少なくとも１つの情報を取得す
る手順と、前記発話継続時間情報から話速を表現する
パラメータ、前記音量情報から声の大きさを表現するパ
ラメータ、前記ピッチ情報から声の高さを表現するパラ
メータの少なくとも１つのパラメータを生成する手順
と、それらのパラメータの中の少なくとも１つのパラメ
ータによって、話速、声の大きさ、声の高さの少なくと
も１つが表現された文字列を生成する手順とを含むもの
である。Further, the recording medium recording the speech recognition result output processing program of the present invention analyzes the inputted speech, performs speech recognition using the speech analysis result, and outputs the recognition result. A recording medium in which an output processing program is recorded, the processing program acquiring at least one information of utterance duration information, voice volume information, and voice pitch information from the voice analysis result; A procedure for generating at least one of a parameter expressing a speech speed from duration information, a parameter expressing a voice volume from the volume information, and a parameter expressing a voice pitch from the pitch information, and those parameters. A character string in which at least one of speech speed, voice volume, and pitch is expressed by at least one parameter in And a procedure for generating.

【００１５】これら各発明において、前記話速を表現し
た文字列を生成する場合、前記文字列を構成する個々の
文字単位で得られる発話継続時間情報に基づいて、個々
の文字ごとにその文字に対して前記話速を表現するパラ
メータを反映させるようにしている。In each of these inventions, when a character string expressing the speech speed is generated, each character is assigned to that character based on the utterance duration information obtained for each character constituting the character string. On the other hand, the parameter expressing the speech speed is reflected.

【００１６】また、前記話速を表現した文字列を生成す
る場合、前記文字列を構成する前記言語構成単位で得ら
れる発話継続時間情報に基づいて、その言語構成単位ご
とにその言語構成単位に対して前記話速を表現するパラ
メータを反映させるようにしている。Further, when the character string representing the speech speed is generated, the language constituent unit is defined for each language constituent unit on the basis of the utterance duration information obtained in the language constituent unit constituting the character string. On the other hand, the parameter expressing the speech speed is reflected.

【００１７】また、前記発話継続時間情報から生成され
る話速を表現するパラメータは、前記文字列を構成する
文字の文字間隔を設定する情報と文字列の並び方向にお
ける個々の文字長さを設定する情報の少なくとも一方の
情報としている。Further, the parameter expressing the speech speed generated from the utterance duration information sets the information for setting the character interval of the characters forming the character string and the individual character length in the arrangement direction of the character string. At least one of the information to be set.

【００１８】また、前記声の大きさを表現した文字列を
生成する場合、前記文字列を構成する個々の文字単位で
得られる音量情報に基づいて、個々の文字ごとにその文
字に対して前記声の大きさを表現するパラメータを反映
させるようにしている。Further, when the character string expressing the loudness of the voice is generated, based on the volume information obtained in units of individual characters forming the character string, the character of each character is described above. The parameter expressing the loudness of the voice is reflected.

【００１９】また、前記声の大きさを表現した文字列を
生成する場合、前記文字列を構成する言語構成単位で得
られる音量情報に基づいて、その言語構成単位ごとにそ
の言語構成単位に対して前記声の大きさを表現するパラ
メータを反映させるようにしている。Further, when the character string representing the loudness of the voice is generated, the language constituent unit is selected for each language constituent unit based on the volume information obtained in the language constituent unit that constitutes the character string. The parameter expressing the loudness of the voice is reflected.

【００２０】また、前記音量情報から生成される声の大
きさを表現するパラメータは、前記文字列を構成する文
字の太さを設定する情報としている。The parameter expressing the loudness of the voice generated from the volume information is information for setting the thickness of the characters forming the character string.

【００２１】また、前記音量情報から生成される声の大
きさを表現するパラメータは、前記文字列を構成する文
字の濃さを設定する情報としてもよい。Further, the parameter expressing the loudness of the voice generated from the volume information may be information for setting the density of the characters forming the character string.

【００２２】また、前記声の高さを表現した文字列を生
成する場合、前記文字列を構成する個々の文字単位で得
られるピッチ情報に基づいて、個々の文字ごとにその文
字に対して前記声の高さを表現するパラメータを反映さ
せるようにしている。Further, when a character string expressing the pitch of the voice is generated, based on pitch information obtained in units of individual characters forming the character string, the above-mentioned The parameter expressing the pitch of the voice is reflected.

【００２３】また、前記声の高さを表現した文字列を生
成する場合、前記文字列を構成する前記言語構成単位で
得られるピッチ情報に基づいて、その言語構成単位ごと
にその言語構成単位に対して前記声の高さを表現するパ
ラメータを反映させるようにしている。Further, when the character string expressing the pitch of the voice is generated, based on the pitch information obtained in the language constituent unit forming the character string, the language constituent unit is changed to the language constituent unit for each language constituent unit. On the other hand, the parameter expressing the pitch of the voice is reflected.

【００２４】また、前記ピッチ情報から生成される声の
高さを表現するパラメータは、前記文字列を構成する文
字の濃さを設定する情報としている。The parameter expressing the pitch of the voice generated from the pitch information is information for setting the density of the characters forming the character string.

【００２５】また、前記ピッチ情報から生成される声の
高さを表現するパラメータは、前記文字列を構成する文
字の太さを設定する情報としてもよい。Further, the parameter expressing the pitch of the voice generated from the pitch information may be information for setting the thickness of the characters forming the character string.

【００２６】また、前記音声の分析結果によって得られ
るデータの中から、個々の話者を特定するための話者識
別情報を得て、その話者識別情報を用いて、それぞれの
話者ごとに、前記話速、声の大きさ、声の高さの少なく
とも１つが表現された文字列を生成するようにしてい
る。Further, speaker identification information for identifying each speaker is obtained from the data obtained as a result of the analysis of the voice, and the speaker identification information is used for each speaker. A character string in which at least one of the speech speed, the volume of the voice, and the pitch of the voice is expressed is generated.

【００２７】また、前記話者を特定するための話者識別
情報の他に、個々の話者が持つ個々の話者固有の話速情
報、音量情報、ピッチ情報の少なくとも１つを用いて、
話速を表現するパラメータの取り得る範囲、声の大きさ
を表現するパラメータの取り得る範囲、声の高さを表現
するパラメータの取り得る範囲を正規化するようにして
いる。In addition to the speaker identification information for specifying the speaker, at least one of the speaker-specific voice speed information, volume information, and pitch information held by each speaker is used.
The possible range of the parameter expressing the speech speed, the possible range of the parameter expressing the loudness of the voice, and the possible range of the parameter expressing the pitch of the voice are normalized.

【００２８】このように本発明は、発話継続時間情報か
ら話速を表現するパラメータ、音量情報から声の大きさ
を表現するパラメータ、ピッチ情報から声の高さを表現
するパラメータの少なくとも１つのパラメータを生成し
て、それらのパラメータの中の少なくとも１つのパラメ
ータによって、話速、声の大きさ、声の高さの少なくと
も１つが表現された文字列を生成するようにしている。As described above, according to the present invention, at least one of the parameter for expressing the speech speed from the utterance duration information, the parameter for expressing the loudness of the voice from the volume information, and the parameter for expressing the pitch of the voice from the pitch information. Is generated, and a character string in which at least one of the speech speed, the loudness of the voice, and the pitch of the voice is expressed by at least one of the parameters is generated.

【００２９】これによって、その文字列を見るだけで、
話者がどのような口調で発話しているのかがわかる。特
に、これらのパラメータを組み合わせて用いれば、出力
される文字列から話者の発話の速さや、声の大きさ、声
の高さを読み取ることができ、その時の話者の口調がわ
かり、それによって、話者の感情を推測することもでき
る。With this, just by looking at the character string,
You can see how the speaker speaks. In particular, if these parameters are used in combination, it is possible to read the speaking rate of the speaker, the loudness and the pitch of the voice from the output character string, and to know the tone of the speaker at that time. Can also infer the speaker's emotions.

【００３０】また、個々の文字ごとにその文字に対して
話速を表現するパラメータを反映させることで、たとえ
ば、ある１つの単語を発話したときに、その単語を構成
する文字ごとに発話の速さの変化を読み取ることがで
き、それによって、その単語を発話したときの話者の微
妙な感情の変化などを出力された文字列から読み取るこ
ともできる。Further, by reflecting the parameter expressing the speech speed for each character, for example, when a certain word is uttered, the utterance speed is increased for each character constituting the word. It is possible to read the change in the length, and thereby read the subtle changes in the emotion of the speaker when the word is uttered from the output character string.

【００３１】また、前述したように単語、文節、連文
節、文、複数文などの言語構成単位ごとに話速を表現す
るパラメータを反映させることで、たとえば、あるひと
まとまりの内容を発話を発話したときに、そのひとまと
まりの内容に含まれる言語構成単位ごとに発話の速さの
変化を読み取ることができ、それによって、そのひとま
とまりの内容を発話したとき、それぞれの言語構成単位
の発話の速さから話者の微妙な感情の変化などを出力さ
れた文字列から読み取ることもできる。Further, as described above, by reflecting the parameter expressing the speech speed for each language constituent unit such as a word, a bunsetsu, a consecutive bunsetsu, a sentence, and a plurality of sentences, for example, a utterance of a certain set of contents is uttered. At times, it is possible to read the change in the utterance speed for each language constituent unit included in the content of the unit, so that when the content of the unit is uttered, the speed of utterance of each language constituent unit can be read. Therefore, subtle changes in the emotion of the speaker can be read from the output character string.

【００３２】また、この話速を表現するパラメータは、
文字列を構成する文字の文字間隔を示す情報と文字列の
文字並び方向における個々の文字長さ（文字幅という）
を示す情報の少なくとも一方の情報であって、話速が速
いほど文字間隔を小としたり文字幅を小とするようにし
ている。The parameter expressing this speech speed is
Information indicating the character spacing between the characters that make up the character string and the individual character length (called the character width) in the character arrangement direction of the character string
Which is at least one of the information indicating that the character interval is smaller and the character width is smaller as the speech speed is faster.

【００３３】このように、話速を文字間隔や文字幅で表
しているので、文字列を見るだけで、話速が速いのか遅
いのかが直感的にわかり、たとえば、間延びした口調で
発話しているのか早口で発話しているのかなどの判断を
文字列から直感的に読み取ることができる。Since the speech speed is represented by the character spacing and the character width in this way, it is possible to intuitively know whether the speech speed is fast or slow by just looking at the character string. For example, uttering in a delayed tone. It is possible to intuitively read from a character string whether the user is speaking or speaking quickly.

【００３４】また、個々の文字ごとにその文字に対して
声の大きさを表現するパラメータを反映させることで、
たとえば、ある１つの単語を発話したときに、その単語
を構成する文字ごとに声の大きさの変化を読み取ること
ができ、それによって、その単語を発話したときの話者
の微妙な感情の変化などを出力された文字列から読み取
ることもできる。Further, by reflecting a parameter expressing the loudness of the voice for each character,
For example, when a certain word is uttered, it is possible to read the change in the loudness of each character forming the word, and thereby the subtle change in the emotion of the speaker when the word is uttered. It is also possible to read such as from the output character string.

【００３５】また、言語構成単位ごとにその言語構成単
位に対して声の大きさを表現するパラメータを反映させ
ることで、たとえば、あるひとまとまりの内容を発話を
発話したときに、そのひとまとまりの内容に含まれる言
語構成単位ごとに声の大きさの変化を読み取ることがで
き、それによって、そのひとまとまりの内容を発話した
とき、それぞれの言語構成単位の声の大きさから話者の
微妙な感情の変化などを出力された文字列から読み取る
こともできる。Further, by reflecting the parameter expressing the loudness of the voice for each linguistic constituent unit, for example, when a utterance of a certain unit of content is uttered, the unit of the unit of utterance is uttered. It is possible to read the change in the loudness of each of the language constituent units included in the content, which makes it possible to detect the subtleties of the speaker from the loudness of each of the language constituent units when uttering the content of the unit. It is also possible to read changes in emotions from the output character string.

【００３６】また、音量情報から生成される声の大きさ
を表現するパラメータは、前記文字列を構成する文字の
太さあるいは文字の濃さを設定する情報としている。こ
のように、声の大きさを文字の太さあるいは文字の濃さ
で表しているので、文字列を見るだけで、声が大きいの
か小さいのかが直感的にわかり、たとえば、ある部分を
強調して発話した場合など、その強調部分の判断を文字
列から直感的に読み取ることができる。Further, the parameter expressing the volume of the voice generated from the volume information is information for setting the thickness or the thickness of the characters forming the character string. In this way, the loudness of the voice is expressed by the thickness of the character or the darkness of the character, so it is possible to intuitively see whether the voice is loud or soft by just looking at the character string. When the user speaks, the judgment of the emphasized part can be intuitively read from the character string.

【００３７】また、個々の文字ごとにその文字に対して
声の高さを表現するパラメータを反映させることで、た
とえば、ある１つの単語を発話したときに、その単語を
構成する文字ごとに声の高さの変化を読み取ることがで
き、それによって、その単語を発話したときの話者の微
妙な感情の変化などを出力された文字列から読み取るこ
ともできる。Further, by reflecting the parameter expressing the pitch of the voice for each character, for example, when a certain word is uttered, the voice for each character constituting the word is uttered. It is possible to read the change in the height of the word, and thereby the subtle changes in the emotion of the speaker when the word is uttered can also be read from the output character string.

【００３８】また、言語構成単位ごとにその言語構成単
位に対して声の高さを表現するパラメータを反映させる
ことで、たとえば、あるひとまとまりの内容を発話を発
話したときの、そのひとまとまりの内容に含まれる言語
構成単位ごとに声の高さの変化を読み取ることができ、
それによって、そのひとまとまりの内容を発話したと
き、それぞれの言語構成単位の声の高さから話者の微妙
な感情の変化などを出力された文字列から読み取ること
もできる。Further, by reflecting the parameter expressing the pitch of the voice for each linguistic constituent unit, for example, when uttering the content of a certain unit, the unit of the unit It is possible to read changes in voice pitch for each language unit included in the content,
Thereby, when the unit of content is uttered, it is possible to read the subtle changes in the emotion of the speaker from the output character string from the pitch of each language constituent unit.

【００３９】また、音量情報から生成される声の高さを
表現するパラメータは、前記文字列を構成する文字の濃
さあるいは文字の太さを設定する情報としている。この
ように、声の高さを文字の濃さや文字の太さで表してい
るので、文字列を見るだけで、声が高いのか低いのかが
直感的にわかり、たとえば、驚きや落胆などを文字列に
表現することができる。The parameter expressing the pitch of the voice generated from the volume information is information for setting the darkness or the thickness of the characters forming the character string. In this way, the pitch of a voice is expressed by the density and thickness of the character, so you can intuitively know whether your voice is high or low just by looking at the character string. Can be expressed in columns.

【００４０】また、音声分析によって得られる分析結果
から個々の話者を特定するための話者識別情報を得て、
その話者識別情報を用いて、それぞれの話者ごとに、前
記話速、声の大きさ、声の高さの少なくとも１つが表現
された文字列を生成するようにしている。Further, speaker identification information for identifying each speaker is obtained from the analysis result obtained by the voice analysis,
The speaker identification information is used to generate a character string in which at least one of the speech speed, the volume of the voice, and the pitch of the voice is expressed for each speaker.

【００４１】これは、たとえば、話者ごとの文字列を色
分けして出力したり、字体を変えて出力することによっ
て、文字列を見るだけで、どの話者の発話内容かを一目
でわかるようにし、しかも、それぞれの話者ごとに、そ
の話者の発話内容に対し、話速、声の大きさ、声の高低
の少なくとも１つが表現された文字列として出力するよ
うにしたものであり、これによって、出力された文字列
を見るだけで、どの話者がどのような口調で発話してい
るかが一目でわかり、それによって、個々の話者の感情
を推測することもできる。This is because, for example, the character string for each speaker is color-coded and output, or the character style is changed so that the speaker can see at a glance what character the speaker is speaking. Moreover, for each speaker, at least one of the speech speed, the volume of the voice, and the pitch of the voice is output as a character string with respect to the utterance content of the speaker, With this, it is possible to know at a glance which speaker is speaking in what kind of tone just by looking at the output character string, and it is also possible to infer the emotion of each speaker.

【００４２】また、話者を特定するための話者識別情報
の他に、個々の話者が持つ個々の話者固有の話速情報、
音量情報、ピッチ情報の少なくとも１つを用いて、それ
ぞれの話者に対する文字列の話速を表すパラメータの取
り得る範囲、声の大きさを表すパラメータの取り得る範
囲、声の高さを表すパラメータの取り得る範囲を正規化
することも可能である。In addition to the speaker identification information for specifying the speaker, the talk speed information peculiar to each speaker,
Using at least one of the volume information and the pitch information, a range of parameters that represent the speech speed of the character string for each speaker, a range of parameters that represent the loudness of the voice, and a parameter that represents the pitch of the voice. It is also possible to normalize the possible range of.

【００４３】これによって、個々の話者がもともと持っ
ている話速、声の大きさ、声の高さに関係なく、話速を
表現するパラメータの取り得る範囲、声の大きさを表現
するパラメータの取り得る範囲、声の高さを表現するパ
ラメータの取り得る範囲を全ての話者共通に設定するこ
とができ、これらのパラメータの取り得る範囲を最大限
使って、話速に対応する文字間隔や文字幅、声の大きさ
に対応する文字の太さ、声の高さに対応する文字の濃さ
などを決めることができる。As a result, the range of parameters for expressing the voice speed and the parameters for expressing the voice volume are irrelevant regardless of the voice speed, the volume of the voice, and the pitch of the voice originally possessed by each speaker. The range of parameters that can be taken by the speaker and the range of parameters that express the pitch of the voice can be set in common to all speakers. It is possible to determine the character width, the character thickness corresponding to the loudness of the voice, and the density of the character corresponding to the pitch of the voice.

【００４４】それによって、複数の話者が会議して得ら
れた議事録として出力するような場合、それぞれの話者
の話速、声の大きさ、声の高さなど同じ基準で比べるこ
とができ、その場の状況の推測がしやすいものとなる。Therefore, when a plurality of speakers output the minutes obtained by the conference, it is possible to compare the speed of each speaker, the volume of the voice, the pitch of the voice, etc. by the same standard. It will be possible and it will be easier to guess the situation.

【００４５】[0045]

【発明の実施の形態】以下、本発明の実施の形態につい
て説明する。なお、この実施の形態で説明する内容は、
本発明の音声認識結果出力方法、音声認識結果出力装置
についての説明であるとともに、本発明の音声認識結果
出力処理プログラムを記録した記録媒体における処理プ
ログラムの具体的な処理内容をも含むものである。BEST MODE FOR CARRYING OUT THE INVENTION Embodiments of the present invention will be described below. The contents explained in this embodiment are
This is a description of the voice recognition result output method and the voice recognition result output device of the present invention, and also includes the specific processing contents of the processing program in the recording medium recording the voice recognition result output processing program of the present invention.

【００４６】〔第１の実施の形態〕図１は本発明の音声
情報出力装置の第１の実施の形態を説明する構成図であ
り、音声入力部１、音声分析部２、音声認識部３、文字
パラメータ出力部４、文字生成部５、文字出力部６を有
した構成となっている。[First Embodiment] FIG. 1 is a block diagram for explaining a first embodiment of a voice information output device of the present invention. A voice input unit 1, a voice analysis unit 2, and a voice recognition unit 3 are shown. The character parameter output unit 4, the character generation unit 5, and the character output unit 6 are provided.

【００４７】音声入力部１は話者の発話した音声を音声
分析部２に渡すものであるが、この音声入力部１に入力
される音声としては、話者がその場でマイクロホンなど
に向かって発話した音声であってもよく、また、ボイス
レコーダなどに録音された音声であってもよい。The voice input unit 1 transfers the voice uttered by the speaker to the voice analysis unit 2. As the voice input to the voice input unit 1, the speaker directs to a microphone or the like on the spot. It may be a spoken voice or a voice recorded in a voice recorder or the like.

【００４８】音声分析部２は音声入力部１から渡された
音声を分析して、その特徴を抽出するものである。たと
えば、音声入力部１に入力された図２（ａ）に示すよう
な音声を短い時間（たとえば、２０msec程度）のフレー
ムごとに分析し、その結果をフレーム番号ｔ０，ｔ１，
ｔ２，・・・，ｔｎとして出力する。The voice analysis unit 2 analyzes the voice passed from the voice input unit 1 and extracts its features. For example, a voice as shown in FIG. 2A input to the voice input unit 1 is analyzed for each frame of a short time (for example, about 20 msec), and the result is analyzed as frame numbers t0, t1,
Output as t2, ..., Tn.

【００４９】ここで抽出される分析結果としての特徴ベ
クトル、音量情報（以下、単に音量という）、ピッチ情
報（以下、単にピッチという）は、図２（ｂ）に示すよ
うに、フレーム番号ｔ０（以下では、フレームｔ０とい
い、これは、他のフレーム番号ｔ１，ｔ２，・・・にお
いても同様とする）においては特徴ベクトルｃ０、音量
ｖ０、ピッチｐ０、フレームｔ１においては特徴ベクト
ルｃ１、音量ｖ１、ピッチｐ１というように、それぞれ
のフレームｔ０，ｔ１，ｔ２，・・・，ｔｎごとの情報
として得ることができる。The feature vector, volume information (hereinafter, simply referred to as volume), and pitch information (hereinafter, simply referred to as pitch) as the analysis result extracted here are, as shown in FIG. 2B, a frame number t0 ( Hereinafter, it is referred to as a frame t0, which is the same for the other frame numbers t1, t2, ...). The feature vector c0, the volume v0, the pitch p0, and the feature vector c1, the volume v1 for the frame t1. , Pitch p1 can be obtained as information for each frame t0, t1, t2, ..., Tn.

【００５０】この図２（ｂ）に示すデータのうち、特徴
ベクトルはフレーム番号とともに音声認識部３に渡さ
れ、音量、ピッチはフレーム番号とともに文字パラメー
タ出力部４に渡される。Of the data shown in FIG. 2B, the feature vector is passed to the voice recognition section 3 together with the frame number, and the volume and pitch are passed to the character parameter output section 4 together with the frame number.

【００５１】音声認識部３は音声分析部２から渡された
それぞれの時刻ごとの特徴ベクトル列を用いて音声認識
処理を行う。たとえば、図３（ａ）に示すように、音声
分析部２から渡されたフレームｔ０，ｔ１，・・・，ｔ
ｎごとの特徴ベクトルｃ０，ｃ１，・・・，ｃｎに基づ
いて、図３（ｂ）に示すように、それぞれのフレームｔ
０，ｔ１，・・・，ｔｎごとの音素、たとえば、フレー
ムｔ０は音素「ｓ」、フレームｔ１〜ｔ５は音素
「ｏ」、フレームｔ６は音素「ｄ」、フレームｔ７は音
素「ｅ」というように、それぞれのフレーム番号ごとの
音素を音素列として出力するとともに、その音素列から
テキストを生成する。この場合、生成されるテキストと
しては、「そうですか」であり、フレームｔ０，ｔ１が
「そ」、フレームｔ２〜ｔ５が「う」、フレームｔ３，
ｔ４が「で」というように対応付けられることで、「そ
うですか」というテキストが生成される。The voice recognition unit 3 performs a voice recognition process using the feature vector sequence for each time passed from the voice analysis unit 2. For example, as shown in FIG. 3A, the frames t0, t1, ...
Based on the feature vectors c0, c1, ..., Cn for each n, as shown in FIG.
Phoneme every 0, t1, ..., Tn, for example, frame t0 is phoneme “s”, frames t1 to t5 is phoneme “o”, frame t6 is phoneme “d”, frame t7 is phoneme “e”. In addition, the phoneme for each frame number is output as a phoneme string, and a text is generated from the phoneme string. In this case, the generated text is "Is it so?", The frames t0 and t1 are "so", the frames t2 to t5 are "u", the frame t3.
By associating t4 with "de", the text "Is that so?" is generated.

【００５２】この図３（ｂ）に示されるフレームｔ０，
ｔ１，・・・，ｔｎと、このそれぞれのフレームｔ０，
ｔ１，・・・，ｔｎに対応する音素と、この音素により
作成されるテキストは認識結果として文字パラメータ出
力部４に渡される。Frame t0 shown in FIG. 3 (b),
t1, ..., Tn and their respective frames t0,
The phonemes corresponding to t1, ..., Tn and the text created by these phonemes are passed to the character parameter output unit 4 as a recognition result.

【００５３】文字パラメータ出力部４は、音声認識部３
から渡されたフレーム番号、音素、テキスト（図３(b)
参照）と音声分析部２から渡されたフレーム番号、音
量、ピッチ（図２(b)参照）を、フレーム番号を用いて
統合し、図４のようなフレームｔ０，ｔ１，・・・，ｔ
ｎに対応する音素、テキスト、音量、ピッチの各情報を
文字パラメータとして文字生成部５に渡す。The character parameter output unit 4 is the voice recognition unit 3
Frame number, phoneme, and text passed from (Fig. 3 (b)
4) and the frame number, the volume, and the pitch (see FIG. 2B) passed from the voice analysis unit 2 are integrated using the frame number, and frames t0, t1, ..., T as shown in FIG.
The phoneme, text, volume, and pitch information corresponding to n are passed to the character generation unit 5 as character parameters.

【００５４】文字生成部５は文字パラメータ出力部４か
ら渡された各種の文字パラメータ（図４参照）に基づい
て、図５に示すように、文字の属性（文字間隔や文字
幅、文字の太さ、文字の濃さなど）を決定するパラメー
タを生成して、そのパラメータに基づいた文字を生成す
る。この文字生成部５で生成された文字は文字出力部６
へ渡される。以下、この文字生成部５の処理について説
明する。As shown in FIG. 5, the character generation unit 5 uses the various character parameters (see FIG. 4) passed from the character parameter output unit 4 to determine the attributes of the characters (character spacing, character width, character weight, etc.). , A character density, etc.), and a character based on the parameter is generated. The character generated by the character generation unit 5 is the character output unit 6
Is passed to. Hereinafter, the processing of the character generator 5 will be described.

【００５５】この文字生成部５は、文字パラメータ出力
部４から渡された文字パラメータ（図４参照）を用い
て、図５に示すように、文字属性を決めるためのパラメ
ータを生成する。The character generation unit 5 uses the character parameter (see FIG. 4) passed from the character parameter output unit 4 to generate a parameter for determining the character attribute as shown in FIG.

【００５６】すなわち、図４に示すような文字パラメー
タ出力部４から渡された文字パラメータのうち、フレー
ム番号、テキスト（音素情報を含む）を用いてテキスト
を構成する各文字の発話継続時間を求め、その発話継続
時間から文字属性を決めるパラメータとして、話速を表
現する文字間隔と文字幅の少なくとも一方を決定するた
めのパラメータを生成する。That is, of the character parameters passed from the character parameter output unit 4 as shown in FIG. 4, the frame number and the text (including phoneme information) are used to obtain the utterance duration of each character forming the text. As a parameter for determining a character attribute from the utterance duration, a parameter for determining at least one of a character interval and a character width expressing a speech speed is generated.

【００５７】また、文字パラメータ出力部４から渡され
た文字パラメータのうち、フレーム番号、テキスト（音
素情報を含む）、音量を用い、これらフレーム番号、テ
キスト（音素情報を含む）、音量から、文字属性を決め
るパラメータとして、声の大きさに対応した文字の太さ
を決定するためのパラメータを生成する。Of the character parameters passed from the character parameter output unit 4, the frame number, the text (including the phoneme information), and the volume are used, and the character is extracted from the frame number, the text (including the phoneme information), and the volume. As a parameter for determining the attribute, a parameter for determining the character thickness corresponding to the volume of the voice is generated.

【００５８】さらに、文字パラメータ出力部４から渡さ
れる文字パラメータのうち、フレーム番号、テキスト
（音素情報を含む）、ピッチ情報を用い、これらフレー
ム番号、テキスト（音素情報を含む）、ピッチ情報か
ら、文字属性を決めるパラメータとして、声の高さに対
応した文字の濃さ（以下、濃度という）を決定するため
のパラメータを生成する。Further, of the character parameters passed from the character parameter output unit 4, the frame number, the text (including the phoneme information) and the pitch information are used, and from the frame number, the text (including the phoneme information) and the pitch information, As a parameter for determining the character attribute, a parameter for determining the density (hereinafter referred to as density) of the character corresponding to the voice pitch is generated.

【００５９】そして、これらそれぞれの文字属性を決め
るパラメータを用いて、それぞれの文字属性を有した文
字を生成し、それを文字出力部６に渡し、文字出力部６
から文字列として出力する。以下に、具体例を用いて本
発明の実施の形態を詳細に説明する。Then, using the parameters for determining the respective character attributes, a character having the respective character attributes is generated, and this is passed to the character output unit 6, and the character output unit 6
Is output as a character string. Hereinafter, embodiments of the present invention will be described in detail using specific examples.

【００６０】まず、発話継続時間から文字間隔と文字幅
の少なくとも一方を決定する処理について説明するが、
まず、文字間隔を決定する場合について説明する。ここ
で、発話継続時間は、正確にはそれぞれの文字を形成す
る音素の始端と終端の時刻の差であるが、これは音素を
構成するフレーム数にほぼ比例するので、以下の説明で
は、発話継続時間をフレーム数で表す。また、それぞれ
の音素に対応するフレーム数は実際には多数のフレーム
で構成されるが、ここでは、説明を簡単にするために、
特に、短い子音などについては１つのフレームで構成さ
れているような表記となっている。First, the process of determining at least one of the character interval and the character width from the utterance duration will be described.
First, the case of determining the character spacing will be described. Here, the utterance duration is the difference between the start and end times of the phonemes that form each character, but since this is almost proportional to the number of frames that make up the phoneme, in the following description, The duration is represented by the number of frames. Also, the number of frames corresponding to each phoneme is actually composed of a large number of frames, but here, in order to simplify the explanation,
In particular, short consonants and the like are described as being composed of one frame.

【００６１】図６（ａ）において、「そうですか」の
「そう」について見ると、フレームｔ０が音素「ｓ」に
対応し、フレームｔ１〜ｔ５が音素「ｏ」に対応してい
る。このフレームｔ１〜ｔ５の音素「ｏ」は、「そ」を
構成する「ｏ」とその後に続く「う」に対応する「ｏ」
でもある。なお、音声認識側では、音素「ｏ」に対応す
るフレームｔ１〜ｔ５のどこまでのフレームが「そ」を
構成する音素「ｏ」であって、どこからのフレームがそ
の後に続く「う」に対応するフレームであるかを区切る
ことができないので、その区切りを何らかの規則によっ
て設定しておく必要がある。In FIG. 6 (a), looking at "yes" of "yes?", The frame t0 corresponds to the phoneme "s", and the frames t1 to t5 correspond to the phoneme "o". The phoneme "o" of the frames t1 to t5 corresponds to "o" that constitutes "so" and "o" that follows "u".
But also. On the speech recognition side, the frames up to frames t1 to t5 corresponding to the phoneme “o” are the phonemes “o” that make up “so”, and the frames from which correspond to the subsequent “u”. Since it is not possible to delimit whether it is a frame, it is necessary to set the delimitation by some rule.

【００６２】ここでは、フレームｔ１，ｔ２を「そ」を
構成する「ｏ」に対応するフレームとし、フレームｔ２
〜ｔ５を「う」を構成する「ｏ」に対応するフレームと
している。したがって、この場合、「そ」の発話継続時
間はフレームｔ０〜ｔ２に対応し、「う」の発話継続時
間はフレームｔ３〜ｔ５に対応する。Here, the frames t1 and t2 are defined as the frames corresponding to the "o" forming "the", and the frame t2
~ T5 is a frame corresponding to "o" forming "u". Therefore, in this case, the utterance duration of "so" corresponds to the frames t0 to t2, and the utterance duration of "u" corresponds to the frames t3 to t5.

【００６３】このようにして、それぞれの文字に対する
発話継続時間を求める。この発話継続時間は、発話継続
時間が短ければ話速が速く、発話継続時間が長ければ話
速が遅いので話速を表している。In this way, the utterance duration for each character is obtained. This utterance duration indicates the speech speed because the speech speed is high when the speech duration time is short, and the speech speed is slow when the speech duration time is long.

【００６４】なお、このそれぞれの文字ごとの発話継続
時間は、それぞれの文字ごとに独立した発話継続時間と
して用いることもできるし、たとえば、ある１つの言語
構成単位（この言語構成単位とは前述したように、単
語、文節、連文節、文、複数文などを指している）ごと
に、これら言語構成単位を構成する個々の文字について
得られた発話継続時間の平均値を求め、その平均値をそ
の言語構成単位の発話継続時間とすることもできる。The utterance duration for each character can be used as an utterance duration independent for each character. For example, a certain linguistic constituent unit (this linguistic constituent unit has been described above). The average value of the utterance durations obtained for the individual characters that make up these language building units is calculated for each word, phrase, consecutive phrase, sentence, multiple sentences, etc. It can also be the utterance duration of a language constituent unit.

【００６５】たとえば、この実施の形態で用いた「そう
ですか」について考えた場合、文字ごとに独立した発話
継続時間として用いる場合は、「そ」の発話継続時間、
「う」の発話継続時間、「で」の発話継続時間、「す」
の発話継続時間、「か」の発話継続時間というように、
それぞれの文字ごとの発話継続時間を求めることができ
る。この場合、文字ごとに文字間隔（次の文字との間の
文字間隔）が設定されるので、文字ごとに話速が表現さ
れる。For example, considering "yes" used in this embodiment, when it is used as an utterance duration independent for each character, the utterance duration of "so",
"U" utterance duration, "de" utterance duration, "su"
Utterance duration, "ka" utterance duration,
The utterance duration for each character can be obtained. In this case, the character interval (character interval between characters) is set for each character, so that the speech speed is expressed for each character.

【００６６】また、１つの言語構成単位ごとの発話継続
時間として用いる場合は、「そ」、「う」、「で」、
「す」、「か」の個々の文字について求められた発話継
続時間の平均値を求め、その平均値を「そうですか」と
いう言語構成単位の発話継続時間とみなすことができ
る。この場合、１つの言語構成単位ごとにその言語構成
単位を構成する文字の文字間隔が設定されるので、１つ
の言語構成単位で話速が表現される。When used as the utterance duration for each language constituent unit, "so", "u", "de",
The average value of the utterance durations obtained for the individual characters "su" and "ka" can be obtained, and the average value can be regarded as the utterance duration of the language constituent unit "Yes?". In this case, since the character spacing of the characters forming the language constituent unit is set for each language constituent unit, the speech speed is expressed by one language constituent unit.

【００６７】このようにして、個々の文字ごと、あるい
は、１つの言語構成単位ごとに発話継続時間が得られた
ら、その発話継続時間をいくつかの段階に分けて、それ
ぞれの段階ごとに文字間隔を決定する。In this way, when the utterance duration is obtained for each character or for each language constituent unit, the utterance duration is divided into several stages, and the character interval is set for each stage. To decide.

【００６８】たとえば、発話継続時間を発話継続時間の
短い順に、Ｔ１，Ｔ２，・・・，Ｔ５というように５段
階に設定したとすれば、図６（ｂ）に示すように、これ
ら各段階に応じた文字間隔を取得できるテーブルを用意
しておく。For example, if it is assumed that the utterance durations are set in five stages, such as T1, T2, ..., T5, in the order of shorter utterance durations, as shown in FIG. Prepare a table that can acquire the character spacing according to.

【００６９】この図６（ｂ）のテーブルの例では、発話
継続時間がＴ３に属する場合を標準とし、得られた発話
継続時間がこのＴ３に属する場合には、標準の文字間隔
（これをｄで表す）としている。In the example of the table of FIG. 6 (b), the case where the utterance duration time belongs to T3 is standard, and when the obtained utterance duration time belongs to this T3, the standard character spacing (this is d It represents with).

【００７０】そして、発話継続時間が短いほど（話速が
速いほど）、文字間隔を小さくして、発話継続時間が長
いほど（話速が遅いほど）、文字間隔を大きくするよう
な設定としている。この図６（ｂ）のテーブル例では、
発話継続時間がＴ１に属する場合には文字間隔はｄ／４
とし、発話継続時間がＴ２に属する場合には文字間隔は
ｄ／２とし、発話継続時間がＴ４に属する場合には文字
間隔は1.5ｄとし、発話継続時間がＴ５に属する場合に
は文字間隔は２ｄとしている。The shorter the utterance duration (the faster the speech speed), the smaller the character interval, and the longer the utterance duration (the slower the speech speed), the larger the character interval. . In the table example of FIG. 6B,
When the utterance duration time belongs to T1, the character interval is d / 4.
If the utterance duration is T2, the character spacing is d / 2, if the utterance duration is T4, the character spacing is 1.5d, and if the utterance duration is T5, the character spacing is 2d.

【００７１】したがって、発話継続時間を文字ごとに独
立して文字間隔を設定する場合、つまり、「そうです
か」のそれぞれの文字ごとに文字間隔を設定する場合に
は、それぞれの文字の発話継続時間が、Ｔ１からＴ５の
どの段階に属するかを判定するだけで、図６（ｂ）のテ
ーブルにより、その段階に対応する文字間隔（次に続く
文字との間隔）を得ることができる。Therefore, when the character duration is set independently for each character of the utterance duration, that is, when the character interval is set for each character of "Yes?", The utterance continuation of each character is continued. The character interval (interval with the next succeeding character) corresponding to the stage can be obtained from the table of FIG. 6B by simply determining which stage T1 to T5 belongs to.

【００７２】たとえば、話者が「そうですか」を「そ
う」の部分を間延びして発話し、その後に続く「です
か」を短く発話したとする。この場合、「そ」の発話継
続時間がＴ４に属し、「う」の発話継続時間もＴ４に属
し、「で」の発話継続時間がＴ３に属し、「す」と
「か」の発話継続時間がＴ１に属するとすれば、このと
きの「そうですか」の発話に対して生成される文字列
は、ぞれぞれの文字間隔が図７のように設定された文字
列となる。For example, it is assumed that the speaker utters “yes?” By extending the “yes” portion and then uttering “?” That follows. In this case, the utterance duration of "so" belongs to T4, the utterance duration of "u" also belongs to T4, the utterance duration of "de" belongs to T3, and the utterance duration of "su" and "ka". , Belongs to T1, the character strings generated for the utterance “Is that so?” At this time are character strings in which the respective character intervals are set as shown in FIG.

【００７３】また、文字間隔を１つの言語構成単位で設
定することもできる。この場合、「そうですか」を１つ
の言語構成単位として、その言語構成単位を構成する文
字全体をある文字間隔とするが、そのときの発話継続時
間は、前述したように、「そうですか」を構成するそれ
ぞれの文字ごとに求められた発話継続時間の平均値を求
め、その平均値が、図６（ｂ）のテーブルのＴ１，Ｔ
２，・・・，Ｔ５のどの発話継続時間に属するかを判定
し、それに対応する文字間隔を取得して、その取得した
文字間隔を用いて「そうですか」という単語を生成す
る。Further, the character spacing can be set in one language constituent unit. In this case, "Yes?" Is set as one language building block, and the entire characters that make up that language building block are set to a certain character interval. The utterance duration at that time is, as described above, The average value of the utterance durations obtained for each of the characters constituting "" is calculated, and the average value is T1, T in the table of FIG. 6B.
It is determined which utterance duration of 2, ..., T5 belongs to, the character spacing corresponding to the utterance duration is acquired, and the word “yes?” Is generated using the acquired character spacing.

【００７４】たとえば、「そうですか」を構成するそれ
ぞれの文字ごとに求められた発話継続時間の平均値が、
Ｔ２に属するとすれば、この「そうですか」は図８に示
すように、「そ」、「う」、「で」、「す」、「か」の
それぞれ文字は、ｄ／２の文字間隔となる。For example, the average value of the utterance durations obtained for the respective characters that make up "Is it so?"
Assuming that it belongs to T2, this “is it” means that the characters “so”, “u”, “de”, “su”, and “ka” are d / 2 characters as shown in FIG. It becomes an interval.

【００７５】また、上述の例では、それぞれの文字の発
話継続時間が短いか長いか、つまり、発話速度が速いか
遅いかによって、文字間隔を変えるようにしたが、文字
間隔ではなく、１つ１つの文字の幅を変えるようにする
こともできる。In the above example, the character interval is changed depending on whether the utterance duration of each character is short or long, that is, whether the utterance speed is fast or slow. It is also possible to change the width of one character.

【００７６】この場合も上述同様、発話継続時間をＴ
１，Ｔ２，・・・，Ｔ５というように５段階に設定した
とすれば、図９に示すように、それぞれの発話継続時間
ごとに、文字幅を取得できるテーブルを用意しておく。
なお、ここでいう文字幅というのは、それぞれの文字に
おいて、文字列の並び方向における個々の文字の長さを
いう。In this case as well, the utterance duration is T
If five levels such as 1, T2, ..., T5 are set, as shown in FIG. 9, a table that can acquire the character width for each utterance duration is prepared.
The character width here means the length of each character in the arrangement direction of the character string in each character.

【００７７】この図８のテーブル例でも、発話継続時間
がＴ３に属する場合を標準とし、発話継続時間がｓ３に
属する場合には、標準の文字幅（これをｗで表す）とす
るものとしている。そして、発話継続時間が短いほど
（話速が速いほど）、文字幅を小さくし、発話継続時間
が長いほど（話速が遅いほど）、文字幅を大きくするよ
うな設定となっている。In the table example of FIG. 8 as well, when the utterance duration time belongs to T3, the standard is used, and when the utterance duration time belongs to s3, the standard character width (this is represented by w) is used. . The shorter the utterance duration (the faster the speech speed), the smaller the character width, and the longer the utterance duration (the slower the speech speed), the larger the character width.

【００７８】この図９のテーブル例では、発話継続時間
がＴ１に属する場合には、文字幅はｗ／４とし、発話継
続時間がＴ２に属する場合には、文字幅はｗ／２とし、
発話継続時間がＴ４に属する場合には、文字幅は1.5ｗ
とし、発話継続時間がＴ５に属する場合には、文字幅は
２ｗとしている。In the table example of FIG. 9, when the utterance duration time belongs to T1, the character width is w / 4, and when the utterance duration time belongs to T2, the character width is w / 2.
If the utterance duration belongs to T4, the character width is 1.5w
When the utterance duration time belongs to T5, the character width is 2w.

【００７９】たとえば、上述した例をそのまま用いて、
話者が「そうですか」を「そう」の部分を間延びして発
話し、その後に続く「ですか」を短く発話したとする。
この場合、「そ」と「う」の発話継続時間がそれぞれＴ
４に属し、「で」の発話継続時間がＴ３に属し、「す」
と「か」の発話継続時間がＴ２に属するとすれば、この
ときの「そうですか」の発話に対して生成される文字列
は、図１０のような文字列となる。For example, using the above example as it is,
It is assumed that the speaker utters "yes?" With the "yes" portion delayed and utters "?" That follows shortly.
In this case, the utterance durations of "so" and "u" are T
4 belongs, the duration of the utterance of "de" belongs to T3, and "su"
Assuming that the utterance duration of "and" belongs to T2, the character string generated for the utterance of "yes" at this time becomes the character string as shown in FIG.

【００８０】また、文字幅を１つの言語構成単位で設定
することもできる。この場合、「そうですか」を１つの
言語構成単位として、その言語構成単位を構成する文字
全体をある文字幅とするが、そのときの発話継続時間
は、前述したように、「そうですか」を構成するそれぞ
れの文字ごとに求められた発話継続時間の平均値を求
め、その平均値が、図９のテーブルのＴ１，Ｔ２，・・
・，Ｔ５のどの発話継続時間に属するかを判定し、それ
に対応する文字幅を取得して、その取得した文字幅を用
いて「そうですか」という単語を出力する。Further, the character width can be set in one language constituent unit. In this case, "Yes?" Is set as one language building block, and the entire characters that make up that language building block are set to a certain character width. The utterance duration at that time is, as described above, "Well? , The average value of the utterance durations obtained for the respective characters constituting the character string is calculated, and the average value is T1, T2, ... In the table of FIG.
., Which of the utterance durations of T5 it belongs to is determined, the character width corresponding to it is acquired, and the word "Yes?" Is output using the acquired character width.

【００８１】たとえば、「そうですか」を構成するそれ
ぞれの文字ごとに求められた発話継続時間の平均値が、
Ｔ２に属するとすれば、この「そうですか」は図１１に
示すように、「そ」、「う」、「で」、「す」、「か」
のそれぞれ文字は、ｗ／２の文字幅となる。For example, the average value of the utterance durations obtained for each of the characters that make up "Is that so?"
Assuming that it belongs to T2, this "is it?" Is, as shown in FIG. 11, "so", "uu", "de", "su", "ka".
Each character has a width of w / 2.

【００８２】なお、上述の例では発話継続時間から文字
間隔または文字幅を決定するようにしたが、たとえば、
発話継続時間が短い場合には、文字間隔を狭め、かつ、
文字幅を小さくするというように、両者を組み合わせる
ようにしてもよい。In the above example, the character interval or character width is determined from the utterance duration, but, for example,
If the utterance duration is short, narrow the character spacing, and
Both may be combined such that the character width is reduced.

【００８３】次に、時刻、テキスト、音量から文字の太
さを決定する場合について説明する。ここで用いる音量
とは、それぞれの文字に対応するフレームごとに得られ
た音量の平均値を求めてその平均値をその文字の音量と
している。Next, a case where the thickness of a character is determined from time, text and volume will be described. The volume used here is an average value of volume obtained for each frame corresponding to each character, and the average value is used as the volume of the character.

【００８４】たとえば、「そ」の文字について考える
と、この例では、図１２（ａ）に示すように、「そ」は
フレームｔ０の音素「ｓ」とフレームｔ１の音素「ｏ」
の２つの音素で構成されているので、フレームｔ０に対
応する音素「ｓ」の音量ｖ０とフレームｔ１に対応する
音素「ｏ」の音量ｖ１の平均値を求めそれを「そ」の音
量とする。For example, considering the character "so", in this example, as shown in FIG. 12A, "so" is the phoneme "s" of the frame t0 and the phoneme "o" of the frame t1.
, The volume v0 of the phoneme "s" corresponding to the frame t0 and the volume v1 of the phoneme "o" corresponding to the frame t1 are calculated and used as the volume of "so". .

【００８５】同様に、「う」の文字について考えると、
この「う」は、フレームｔ２の音素「ｏ」からフレーム
ｔ５の音素「ｏ」までの音素で構成され、フレームｔ２
に対応する音素「ｏ」の音量ｖ２、フレームｔ３に対応
する音素「ｏ」の音量ｖ３、フレームｔ４に対応する音
素「ｏ」の音量ｖ４、フレームｔ５に対応する音素
「ｏ」の音量ｖ５の平均値を求め、それを「う」の音量
とする。Similarly, considering the character "U",
The "u" is composed of phonemes from the phoneme "o" of the frame t2 to the phoneme "o" of the frame t5.
Of the volume v2 of the phoneme "o" corresponding to, the volume v3 of the phoneme "o" corresponding to the frame t3, the volume v4 of the phoneme "o" corresponding to the frame t4, and the volume v5 of the phoneme "o" corresponding to the frame t5. Obtain the average value and use it as the volume of "U".

【００８６】このようにして、それぞれの文字に対する
音量を求める。なお、このそれぞれの文字ごとの音量
は、それぞれの文字ごとに独立した音量として用いるこ
ともできるし、たとえば、ある１つの言語構成単位の音
量として用いることもできる。In this way, the volume of each character is obtained. The volume of each character can be used as an independent volume for each character, or can be used as the volume of a certain language constituent unit, for example.

【００８７】たとえば、この実施の形態で用いた「そう
ですか」について考えた場合、文字ごとに独立した音量
として用いる場合は、「そ」の音量、「う」の音量、
「で」の音量、「す」の音量、「か」の音量というよう
にそれぞれの文字ごとの音量として求めてそれをそれぞ
れの文字に反映させることができる。この場合、文字ご
とに文字の太さが設定されるので、文字ごとに声の大き
さが表現される。For example, in the case of "Is that so?" Used in this embodiment, when used as an independent volume for each character, the volume of "So", the volume of "U",
It is possible to obtain the volume for each character such as the volume of “de”, the volume of “su”, and the volume of “ka” and reflect the volume in each character. In this case, since the character thickness is set for each character, the voice volume is expressed for each character.

【００８８】また、１つの言語構成単位の音量として用
いる場合は、「そ」、「う」、「で」、「す」、「か」
のそれぞれについて求められた音量の平均値を求め、そ
の平均値を「そうですか」という言語構成単位全体の音
量としみなし、それを「そうでずか」という言語構成単
位全体に反映させることもできる。この場合、１つの言
語構成単位ごとにその言語構成単位を構成する文字の文
字の太さが設定されるので、１つの言語構成単位で声の
大きさが表現される。When used as the volume of one language constituent unit, "so", "u", "de", "su", "ka"
It is also possible to obtain the average value of the volume obtained for each of the above, and regard that average value as the volume of the entire language building block "Is it?" it can. In this case, since the thickness of the characters forming the language constituent unit is set for each language constituent unit, the volume of the voice is expressed by one language constituent unit.

【００８９】このようにして、１つ１つの文字ごとある
いは１つの言語構成単位ごとに音量が得られたら、その
音量をいくつかの段階に分けて、それぞれの段階ごとに
文字の太さ決定する。In this way, when the volume is obtained for each character or for each language constituent unit, the volume is divided into several stages and the thickness of the character is determined for each stage. .

【００９０】たとえば、音量を音量の小さい順に、Ｖ
１，Ｖ２，・・・，Ｖ５というように５段階に設定した
とすれば、図１２（ｂ）に示すように、それぞれの段階
に対応した文字の太さを取得できるテーブルを用意して
おく。なお、このようなテーブルを文字ごとあるいは単
語ごとに用意しておいてもよい。For example, the volume is set to V
If V, 1, V2, ..., V5 are set in five stages, a table is prepared which can acquire the thickness of the character corresponding to each stage, as shown in FIG. 12 (b). . Note that such a table may be prepared for each character or each word.

【００９１】この図１２（ｂ）のテーブル例では、段階
Ｖ３を標準とし、得られた音量がこのＶ３に属する場合
には、標準の文字の太さ（これをｂで表す）としてい
る。そして、音量が大きいほど、文字の太さを太くし、
音量が小さいほど、文字の太さを細くするような設定と
なっている。In the example of the table in FIG. 12B, the stage V3 is the standard, and when the obtained volume belongs to this V3, the standard character thickness (this is represented by b) is used. And the louder the volume, the thicker the letters will be,
The lower the volume, the thinner the thickness of the characters.

【００９２】この図１２（ｂ）のテーブル例では、求め
られた音量がＶ１に属する場合には、文字太さはｂ／４
とし、求められた音量がＶ２に属する場合には、文字太
さはｂ／２とし、求められた音量がＶ３に属する場合に
は、文字太さは標準の太さｂとし、求められた音量がＶ
４に属する場合には文字太さは1.5ｂとし、求められた
音量がＶ５に属する場合には、文字太さは２ｂとしてい
る。In the example of the table of FIG. 12B, when the obtained volume belongs to V1, the character thickness is b / 4.
If the calculated volume belongs to V2, the character thickness is b / 2, and if the calculated volume belongs to V3, the character thickness is the standard thickness b, and the calculated volume is Is V
If it belongs to No. 4, the character thickness is set to 1.5b, and if the obtained volume belongs to V5, the character thickness is set to 2b.

【００９３】したがって、文字ごとに独立して文字の太
さを設定する場合、つまり、「そうですか」のそれぞれ
の文字ごとに文字の太さを設定する場合には、それぞれ
の文字の音量がＶ１からＶ５のどれに属するかを判定だ
けで、図１２（ｂ）のテーブルにより、その段階に対応
する文字の太さを得ることができる。Therefore, when the character thickness is set for each character independently, that is, when the character thickness is set for each character of "Yes?", The volume of each character is set. The thickness of the character corresponding to that stage can be obtained from the table of FIG. 12B only by determining which of V1 to V5 it belongs to.

【００９４】たとえば、話者が「そうですか」を「そ
う」の部分を強調して大きな声で発話し、その後に続く
「ですか」を弱い口調で発話したとする。この場合、
「そ」の音量がＶ５に属し、「う」の音量がＶ４に属
し、「で」の音量がＶ３に属し、「す」と「か」の音量
がＶ１に属するとすれば、このときの「そうですか」の
発話に対して生成される文字列は、図１３のよう文字列
となる。For example, it is assumed that the speaker utters “yes?” In a loud voice by emphasizing the “yes” part, and then utters “?” Followed by a weak tone. in this case,
If the volume of "so" belongs to V5, the volume of "u" belongs to V4, the volume of "de" belongs to V3, and the volumes of "su" and "ka" belong to V1. The character string generated for the utterance "Is it so?" Is a character string as shown in FIG.

【００９５】この図１３によれば、「そうですか」とい
う文字列は、「そう」が強調され（特に「そ」の部分が
強調されている）た文字列となっている。According to FIG. 13, the character string "Yes?" Is a character string in which "Yes" is emphasized (particularly the portion "S" is emphasized).

【００９６】また、文字の太さを１つの言語構成単位で
設定することもできる。この場合、「そうですか」を１
つの言語構成単位として、その言語構成単位を構成する
文字全体をある文字の太さとするが、そのときの音量
は、前述したように、「そうですか」を構成するそれぞ
れの文字ごとに求められた音量の平均値を求め、その平
均値が、図１２（ｂ）のテーブルのＶ１，Ｖ２，・・
・，Ｖ５のどの音量に属するかを判定し、それに対応す
る文字太さを取得して、その取得した文字太さを用いて
「そうですか」という言語構成単位を生成する。Further, the thickness of characters can be set in one language constituent unit. In this case, "Is it so?"
As one language building unit, the entire characters that make up that language building unit are the thickness of a certain character, and the volume at that time is calculated for each character that makes up "Yes?" The average value of the sound volume is calculated, and the average value is V1, V2, ... In the table of FIG.
.., V5 to which the volume belongs, and the corresponding character thickness is acquired, and the acquired character thickness is used to generate the language structural unit "Is that so?"

【００９７】たとえば、「そうですか」を構成するそれ
ぞれの文字ごとに求められた音量の平均値が、Ｖ２であ
ったとすると、この「そうですか」は図１４に示すよう
に、「そ」、「う」、「で」、「す」、「か」のそれぞ
れ文字がｂ／２の太さで表された文字列となる。For example, suppose that the average value of the volume obtained for each of the characters that make up "is it?" Is V2, this "is it?" , “U”, “de”, “su”, and “ka” are character strings represented by a thickness of b / 2.

【００９８】次に、時刻、テキスト、ピッチから文字の
濃度を決定する場合について説明する。ここで用いるピ
ッチとは、それぞれの文字に対応する時刻ごとに得られ
たピッチの平均値を求めてその平均値をその文字のピッ
チとしている。なお、ここでは、便宜上、全ての区間で
ピッチを求めるような記述となっているが、実際は、無
声子音などのピッチの求められない部分は除外して計算
する場合もある。このようなピッチの求められない部分
については、たとえば、その前後あるいは後続の母音の
部分を用いてピッチを求めるようにすることもできる。Next, the case of determining the character density from the time, the text, and the pitch will be described. With the pitch used here, the average value of the pitches obtained at each time corresponding to each character is obtained, and the average value is taken as the pitch of that character. Note that, here, for convenience, the description is such that the pitch is obtained in all the sections, but in reality, there may be a case where a portion where the pitch cannot be obtained such as unvoiced consonants is excluded. For such a portion where the pitch cannot be obtained, for example, the pitch can be obtained by using the vowel portion before or after the vowel.

【００９９】たとえば、「そ」の文字について考える
と、この例では、図１５（ａ）に示すように、「そ」は
フレームｔ０の音素「ｓ」とフレームｔ１の音素「ｏ」
の２つの音素で構成されているので、フレームｔ０にお
ける音素「ｓ」のピッチｐ０とフレームｔ１における音
素「ｏ」のピッチｐ１の平均値を求めそれを「そ」のピ
ッチとする。For example, considering the character "so", in this example, as shown in FIG. 15A, "so" is a phoneme "s" of frame t0 and a phoneme "o" of frame t1.
, The pitch p0 of the phoneme "s" in the frame t0 and the pitch p1 of the phoneme "o" in the frame t1 are calculated and used as the "so" pitch.

【０１００】同様に、「う」の文字について考えると、
この「う」は、フレームｔ２の音素「ｏ」からフレーム
ｔ５の音素「ｏ」までの音素で構成され、フレームｔ２
における音素「ｏ」のピッチｐ２、フレームｔ３におけ
る音素「ｏ」のピッチｐ３、フレームｔ４における音素
「ｏ」のピッチｐ４、フレームｔ５における音素「ｏ」
のピッチｐ５の平均値を求め、それを「う」のピッチと
する。Similarly, considering the character "U",
The "u" is composed of phonemes from the phoneme "o" of the frame t2 to the phoneme "o" of the frame t5.
Pitch p2 of phoneme "o" in frame t3, pitch p3 of phoneme "o" in frame t3, pitch p4 of phoneme "o" in frame t4, phoneme "o" in frame t5
The average value of the pitches p5 of p.

【０１０１】このようにして、それぞれの文字に対する
ピッチを求める。なお、このそれぞれの文字ごとのピッ
チは、それぞれの文字ごとに独立したピッチとして用い
ることもできるし、たとえば、ある１つの言語構成単位
のピッチとして用いることもできる。In this way, the pitch for each character is obtained. The pitch for each character can be used as an independent pitch for each character, or can be used, for example, as the pitch of a certain language constituent unit.

【０１０２】たとえば、この実施の形態で用いた「そう
ですか」について考えた場合、文字ごとに独立したピッ
チとして用いる場合は、「そ」のピッチ、「う」のピッ
チ、「で」のピッチ、「す」のピッチ、「か」のピッチ
というようにそれぞれの文字ごとのピッチとして求めて
それをそれぞれの文字に反映させることができる。この
場合、文字ごとに濃度が設定されるので、文字ごとに声
の高さが表現される。For example, in the case of "Is that so?" Used in this embodiment, when used as an independent pitch for each character, the pitch of "So", the pitch of "U", the pitch of "de" , “Su” pitch, “ka” pitch, etc., and the pitch can be obtained for each character and reflected in each character. In this case, since the density is set for each character, the pitch of the voice is expressed for each character.

【０１０３】また、ある１つの言語構成単位の音量とし
て用いる場合は、「そ」、「う」、「で」、「す」、
「か」のそれぞれについて求められたピッチの平均値を
求め、その平均値を「そうですか」という１つの言語構
成単位のピッチとみなし、それを「そうですか」という
１つの言語構成単位全体に反映させることもできる。こ
の場合、１つの言語構成単位ごとに濃度が設定されるの
で、１つの言語構成単位で声の高さが表現されたものと
なる。When used as the volume of one language constituent unit, "so", "u", "de", "su",
The average value of the pitches obtained for each "ka" is calculated, and the average value is regarded as the pitch of one linguistic building unit "is it?" Can be reflected in. In this case, since the density is set for each language constituent unit, the voice pitch is expressed by one language constituent unit.

【０１０４】たとえば、文字の濃度を図１５（ｂ）に示
すように、Ｄ０からＤｎまでの範囲内で変化させること
が可能であるとすれば、上述の文字ごとに求められたピ
ッチがこの図１５（ｂ）に示す濃度の変化のどこに対応
させるか決めておき、求められたピッチに対応した濃度
を取得して、その取得した濃度を用いてその文字を出力
する。For example, if it is possible to change the density of characters within the range from D0 to Dn as shown in FIG. 15B, the pitch obtained for each of the characters described above can be obtained in this figure. The position corresponding to the change in density shown in 15 (b) is determined, the density corresponding to the obtained pitch is acquired, and the character is output using the acquired density.

【０１０５】ここでは、ピッチが高ければ高いほど文字
の濃度（Ｄ０側）を高くし、ピッチが小さければ低けれ
ば低いほど文字の濃度を低く（Ｄｎ側）するような設定
とし、それぞれの文字ごとまたは単語やセンテンスのピ
ッチの取り得る範囲を図１５の濃度の変化の範囲Ｄ０〜
Ｄｎに対応させ、それぞれ求められるピッチごとに濃度
を取得するものとする。Here, the higher the pitch, the higher the character density (D0 side), and the lower the pitch, the lower the character density (Dn side). Alternatively, the range in which the pitch of the word or sentence can be taken is the range D0 of the change in density shown in FIG.
It is assumed that the density is obtained for each of the pitches obtained in correspondence with Dn.

【０１０６】たとえば、「そ」の文字に対して求められ
たピッチがＰ１であったとし、そのＰＩが図１５（ｂ）
において、Ｄ０に対応付けられているとすれば、濃度Ｄ
０がパラメータとして出力され、それによって「そ」は
濃度Ｄ０（この例では、濃度が最大）で作成される。For example, if the pitch obtained for the character "so" is P1, its PI is shown in FIG. 15 (b).
, If it is associated with D0, the density D
0 is output as a parameter, whereby "so" is created with the density D0 (in this example, the density is maximum).

【０１０７】また、「う」の文字に対して求められた平
均ピッチがＰ５であったとし、そのＰ５が図１５（ｂ）
において、Ｄ２に対応付けられているとすれば、濃度Ｄ
５がパラメータとして出力され、それによって「う」は
濃度Ｄ２で作成される。Further, assuming that the average pitch obtained for the character "U" is P5, that P5 is shown in FIG. 15 (b).
, If it is associated with D2, the density D
5 is output as a parameter, whereby "U" is created with the density D2.

【０１０８】たとえば、話者が「そうですか」を「そ
う」の部分をきわめて高い声で発話し、その後に続く
「ですか」を中程度の高さで発話したとする。For example, it is assumed that the speaker utters “yes?” With a very high voice in the “yes” part, and utters the following “?” With a medium height.

【０１０９】この場合、「そ」と「う」のピッチがそれ
ぞれ図１５（ｂ）の濃度Ｄ０に対応付けられ、「で」、
「す」、「か」のピッチがそれぞれ図１５（ｂ）の濃度
Ｄｉに対応付けられたとすれば、このときの「そうです
か」の発話に対する文字出力部からの出力は図１６のよ
うになる。In this case, the pitches of "so" and "u" are respectively associated with the density D0 of FIG.
If the pitches of "su" and "ka" are associated with the densities Di of FIG. 15B, the output from the character output unit for the utterance of "Yes?" Is as shown in FIG. Become.

【０１１０】この図１６によれば、「そうですか」とい
う文字列は、「そう」が高い声で発話され、「ですか」
がそれよりも低い声で発話された文字列として表された
ものとなっている。According to this FIG. 16, the character string "is it?"
Is expressed as a character string uttered in a voice lower than that.

【０１１１】また、１つの言語構成単位で文字の濃度を
決定することもできる。この場合、「そうですか」を１
つの言語構成単位として、その言語構成単位を構成する
文字全体をある濃度で表現するが、そのときのピッチ
は、前述したように、「そうですか」を構成するそれぞ
れの文字ごとに求められた平均のピッチをさらに平均し
た平均値を求め、その平均値が、図１５（ｂ）のどの濃
度に対応するかを調べ、それに対応する濃度を取得し
て、その取得した濃度を用いて「そうですか」という言
語構成単位を出力する。It is also possible to determine the character density with one language constituent unit. In this case, "Is it so?"
As one linguistic building unit, the entire characters that make up the linguistic building unit are expressed with a certain density, and the pitch at that time was calculated for each character that makes up "Is that so?" An average value obtained by further averaging the average pitches is obtained, which density in FIG. 15 (b) the average value corresponds to is checked, the density corresponding to that is acquired, and the acquired density is used. Is the language unit "?" Output.

【０１１２】たとえば、「そうですか」を構成するそれ
ぞれの文字ごとに求められた平均のピッチをさらに平均
して言語構成単位全体の平均値が、図１５（ｂ）のＤ０
に対応するとすれば、この「そうですか」は図１７に示
すように、この場合、その言語構成単位全体が最大の濃
度で表現された文字列として表される。For example, by averaging the average pitches obtained for the respective characters that make up "Is so?", The average value of the entire language constituent unit is D0 in FIG.
As shown in FIG. 17, in this case, the entire language constituent unit is represented as a character string expressed in the maximum density.

【０１１３】なお、この濃度を決定する際、上述の例で
は、求められたピッチ（平均ピッチ）の値ごとにその値
に対応した濃度を得るようにしたが、図１５（ｂ）に示
す濃度の範囲を幾つかの段階に分けて、求められた平均
ピッチの値がどの段階に属するかを判定して、その段階
に対応した濃度を得るようにしてもよい。When determining the density, in the above-mentioned example, the density corresponding to the obtained value of the pitch (average pitch) is obtained, but the density shown in FIG. It is also possible to divide the range into several stages, determine which stage the obtained average pitch value belongs to, and obtain the density corresponding to that stage.

【０１１４】たとえば、図示しないが、図１５（ｂ）に
示す濃度の範囲Ｄ０〜Ｄｎを５つの段階Ｑ１，Ｑ２，・
・・，Ｑ５に分けて、求められたピッチが段階Ｑ１に属
する値であれば、その代表値としてたとえば濃度Ｄ０を
パラメータとして出力し、求められた平均ピッチが段階
Ｑ２に属する値であれば、その代表値として濃度Ｄ２を
パラメータとして出力するというように、それぞれの段
階ごとに定められた濃度を得るようにすることも可能で
ある。For example, although not shown, the concentration range D0 to Dn shown in FIG. 15B is divided into five stages Q1, Q2 ,.
.., Q5, if the obtained pitch is a value belonging to the stage Q1, the representative value, for example, the density D0 is output as a parameter, and if the obtained average pitch is a value belonging to the stage Q2, It is also possible to obtain the density determined for each stage such that the density D2 is output as a parameter as the representative value.

【０１１５】この場合、それぞれの段階対応の濃度を持
ったテーブルを作成しておき、求められた平均のピッチ
がどの区間かを判定してから、テーブル参照を行うこと
で、容易に、そのピッチに対応した濃度を得ることがで
きる。In this case, a table having the densities corresponding to the respective levels is prepared, it is possible to easily determine the pitch by determining which section the calculated average pitch is and then referring to the table. It is possible to obtain a concentration corresponding to.

【０１１６】また、ピッチに基づいて文字ごとに濃度を
決める場合、それぞれの文字単位でピッチの平均を求
め、その平均値に応じてその文字の濃度を決めたが、こ
れに限られるものではなく、それぞれの文字を構成する
音素ごとのピッチ（各時刻ごとのピッチ）を用いて濃度
を決めるようにすることもできる。When the density is determined for each character based on the pitch, the average of the pitch is calculated for each character and the density of the character is determined according to the average value, but the invention is not limited to this. , It is also possible to determine the density by using the pitch for each phoneme that constitutes each character (the pitch for each time).

【０１１７】これによって、ある１つの文字において、
個々の文字の中で濃度を変化、つまり、時刻の変化方向
（文字列方向）に濃度を変化させることもできる。Thus, in one character,
It is also possible to change the density within each character, that is, to change the density in the direction of time change (character string direction).

【０１１８】たとえば、図１８は、「そうですか」を徐
々に高音となるように（徐々にピッチを上げるような）
発話を行った場合に出力される文字列の例であり、全体
的に、濃度が滑らかに濃くなっており、１つ１つの文字
においても濃度の変化が見られる。For example, in FIG. 18, "Is that so?" Is gradually raised to a high pitch (the pitch is gradually raised).
This is an example of a character string that is output when a utterance is made, and the density becomes smooth and dark as a whole, and a change in density can be seen in each character.

【０１１９】ここで、「す」に注目して考えると、
「す」を構成するそれぞれの時刻対応の音素ごとに得ら
れたピッチの値を用いて濃度が求められているので、こ
の場合、徐々にピッチが上がる発話の仕方であるため、
この図１８のように、「す」という１つの文字の中にお
いても徐々に濃度が大きくなるような文字として出力さ
れる。これは、図１８における他の文字においても同様
のことが言える。Here, considering "su",
Since the density is obtained using the value of the pitch obtained for each phoneme corresponding to each time that constitutes "su", in this case, since the pitch is gradually raised,
As shown in FIG. 18, even in one character "su", the character is output as a character whose density gradually increases. The same can be said for other characters in FIG.

【０１２０】以上説明したように、この第１の実施の形
態では、音声分析によって得られるフレーム番号、テキ
スト、音量、ピッチを用いて、フレーム数に基づく発話
継続時間（発話速度）から文字間隔または文字幅を決定
するためのパラメータを取得して、このパラメータを用
いて話速に応じた文字間隔や文字幅あるいはその組み合
わせで文字列を出力し、音量からは文字太さを決定する
パラメータを取得し、このパラメータを用いて声の大き
さに応じた文字の太さで文字列を出力し、また、ピッチ
からは各文字の濃度を決めるパラメータを取得して、こ
のパラメータを用いて声の高さに応じた文字の濃度で文
字列を出力するようにしている。As described above, in the first embodiment, the frame number, the text, the volume, and the pitch obtained by the voice analysis are used to change the utterance duration (utterance speed) based on the number of frames to the character interval or the character interval. Acquires the parameter for determining the character width, outputs the character string with character spacing and character width or combination thereof according to the speech speed using this parameter, and acquires the parameter for determining the character thickness from the volume. Then, using this parameter, a character string is output with the thickness of the character according to the volume of the voice, and the parameter that determines the density of each character is obtained from the pitch, and the pitch of the voice is calculated using this parameter. The character string is output with the density of the character according to the size.

【０１２１】これによって、文字出力部６から出力され
る文字列は、単に認識結果を表す文字の羅列だけでな
く、文字間隔や文字幅、文字の太さ、さらには、文字の
濃度がそれぞれの文字ごとに、あるいは、単語、文節、
連文節、文、複数文などの言語構成単位ごとに異なった
状態で出力されるので、文字間隔や文字幅によって、話
者が速い口調で発話したのか、ゆっくりした口調で発話
したのかを知ることができ、また、文字の太さによっ
て、話者が大きな声で発話したのか、小さな声で発話し
たのか、さらに、文字の濃度によって、話者が高音で発
話したのか低音で発話したのかなど、そのときの話者の
口調を知ることができ、それによって、話者の感情など
を推測することができる。As a result, the character string output from the character output unit 6 is not limited to a list of characters representing the recognition result, but the character spacing, the character width, the character thickness, and the character density are different. Character by character, word, phrase,
It is output in a different state for each language constituent unit such as continuous clauses, sentences, and multiple sentences, so it is possible to know whether the speaker uttered in a fast tone or a slow tone depending on the character spacing and character width. Depending on the thickness of the character, whether the speaker uttered a loud voice or a small voice, and depending on the density of the character, whether the speaker uttered a high tone or a low tone, It is possible to know the speaker's tone at that time, and it is possible to infer the speaker's emotions.

【０１２２】また、上述の発話継続時間、音量、ピッチ
は組み合わせて用いることも可能であることは勿論であ
る。It is needless to say that the utterance duration, volume and pitch described above can be used in combination.

【０１２３】たとえば、話速が速く、大きな声で、しか
も、高い声で発話したとすれば、太く濃い文字が狭い間
隔で並んだ文字列となり、また、逆に、話速が遅く、小
さな声で、しかも、低い声で発話したとすれば、細く濃
度の低い文字が広い間隔で並んだ文字列となり、また、
話速が速く、大きな声で、しかも、低い声で発話したと
すれば、太く濃度の低い文字が狭い間隔で並んだ文字列
となるというように、話者の口調の特徴がよく表現され
た文字列として出力することができ、より一層、話者の
感情が表現されたものとなる。For example, if the user speaks with a high speed, a loud voice, and a high voice, it becomes a character string in which thick and dark characters are arranged at narrow intervals, and conversely, the speech speed is slow and the voice is small. Moreover, if you speak with a low voice, it becomes a character string in which thin and low density characters are arranged at wide intervals,
The characteristics of the speaker's tone were well expressed, such that if the user speaks at high speed, in a loud voice, and in a low voice, it will be a character string in which thick and low-density characters are arranged at narrow intervals. It can be output as a character string, and the emotion of the speaker is further expressed.

【０１２４】図１９は発話継続時間、音量、ピッチによ
って得られる文字属性を示すパラメータを組み合わせる
ことによって出力された文字列の例であり、図１９
（ａ）は本発明を適用しないで出力された「そうです
か」の文字列であり、これをここでは標準として考え
る。FIG. 19 is an example of a character string output by combining parameters indicating character attributes obtained by the utterance duration, volume, and pitch.
(A) is a character string of "is it?" Output without applying the present invention, and this is considered as a standard here.

【０１２５】図１９（ｂ）は「そうですか」を強い口調
でやや疑問形で発話した場合に本発明が適用されて生成
された文字列の例を示すものであり、先頭部分の「そ
う」の部分が太くしかも濃度がやや濃く表現されている
ので、その部分が大きな声で高めの声で発話されたこと
がわかり、また、語尾の「か」の部分が濃度が高く、し
かも、太く表現されているので、その部分のピッチが高
く、声も大きいことがわかる。FIG. 19B shows an example of a character string generated by applying the present invention when "Yes?" Is uttered in a strong tone and in a somewhat interrogative form. ”Is thick and the density is slightly dark, so it can be seen that the part was uttered with a loud voice and a high voice. Also, the“ ka ”at the end has a high density and is thick. Since it is expressed, it can be seen that the pitch of that part is high and the voice is loud.

【０１２６】また、「です」の部分は濃度が低く、太さ
も標準形に近いので、やや低音で普通の声の大きさであ
ることがわかる。また、話速については、「で」と
「す」の間が狭い間隔であるので、その部分の話速はや
や速く、その他の部分は標準形よりもゆっくりした話速
となっている。Further, since the density of "" is low and the thickness is close to the standard form, it can be seen that the sound is a little low and has a normal voice volume. Regarding the speech speed, since there is a narrow interval between "de" and "su", the speech speed of that part is slightly faster, and the other parts are slower than the standard speed.

【０１２７】これらのことから、この場合の「そうです
か」は全体的にやや強い口調で語尾が高音となっている
ことから、比較的強い口調の疑問形で発話したとの推測
ができる。From these facts, it can be inferred that "Yes?" In this case was uttered in the interrogative form of a relatively strong tone, since the ending was treble with a slightly strong tone.

【０１２８】図１９（ｃ）は「そうですか」をやや落胆
した口調で発話した場合に本発明が適用されて生成され
た文字列の例を示すものであり、この場合、全体的に話
速はゆっくりした速さであり、特に、「そうで」の部分
における「う」から「で」までの部分が間延びし、か
つ、「ですか」の部分が文字の濃度が小さく、文字の太
さもやや細くなっているので、低くやや小さな声となっ
ていることがわかる。FIG. 19 (c) shows an example of a character string generated by applying the present invention when "Yes?" Is uttered in a slightly discouraged tone. The speed is a slow speed. Especially, the part from "u" to "de" in the "so-de" part is extended, and the "?" Part has a small character density, Since it is also slightly thin, it can be seen that the voice is low and slightly low.

【０１２９】これらのことから、この場合の「そうです
か」は全体的にやや落胆した口調で発話したとの推測が
できる。From these things, it can be inferred that "Yes?" In this case was uttered in a slightly disappointed tone.

【０１３０】以上のように、発話継続時間、音量、ピッ
チによって得られるパラメータを組み合わせて文字列に
表現させることで、その文字列を見るだけで、その時の
話者の口調がわかり、それによって話者の感情を推測す
ることができるようになる。As described above, the parameters obtained by the utterance duration, the volume, and the pitch are combined and expressed in a character string. By just looking at the character string, the tone of the speaker at that time can be understood, and the speech You will be able to guess the feelings of the person.

【０１３１】また、ピッチ情報は声の高低を示すもので
もあり、特に、男女の声質を顕著に表すことができるた
め、このピッチ情報を用いて、性別を文字列に表現させ
ることも可能である。Further, since the pitch information also indicates the pitch of the voice, and in particular, the voice quality of men and women can be remarkably expressed. Therefore, it is also possible to express the gender in a character string by using this pitch information. .

【０１３２】たとえば、予めある基準となるピッチの値
を設定し、得られたピッチがその基準値以上である場合
には、話者は女性と判断し、文字の色を赤色系の色と
し、得られたピッチがその基準値未満である場合には、
話者は男性と判断し、文字の色を黒色系の色とするな
ど、性別に応じた色分けを行うこともできる。これによ
って、認識結果として出力された文字列を見るだけで、
話者が男性であるか女性であるかを知ることもできる。
また、色を表現できない場合には、男性と女性とで字体
などを変えて出力することもできる。For example, if a pitch value as a reference is set in advance and the obtained pitch is equal to or more than the reference value, the speaker is judged to be a woman, and the character color is set to a reddish color, If the obtained pitch is less than the reference value,
The speaker can judge that the character is male, and the characters can be color-coded according to gender, such as black characters. With this, just by looking at the character string output as a recognition result,
You can also know if the speaker is male or female.
Further, when the color cannot be expressed, it is possible to output by changing the font and the like between the male and the female.

【０１３３】このように、上述の発話継続時間、音量、
ピッチを組み合わせ、さらに、ピッチによって性別を表
現するすようにすれば、認識結果として出力された文字
列から、どのような話者がどのような口調で、どのよう
な感情を持って発話しているのかが、より一層、わかり
やすいものとなる。As described above, the utterance duration, the volume,
If the pitch is combined and the gender is expressed by the pitch, what kind of speaker, what tone, and feeling can be uttered from the character string output as the recognition result. It will be even easier to understand if there are any.

【０１３４】〔第２の実施の形態〕図２０は本発明の第
２の実施の形態を説明する構成図であり、図１で示した
構成図に対し、話者識別部７が付加されている点が異な
っている。その他は図１と同様であるので、同一部分に
は同一符号が付されている。[Second Embodiment] FIG. 20 is a block diagram for explaining the second embodiment of the present invention, in which a speaker identification unit 7 is added to the block diagram shown in FIG. The difference is that The other parts are the same as those in FIG. 1, and therefore, the same reference numerals are given to the same parts.

【０１３５】この話者識別部７は音声分析部２で得られ
たフレーム番号、特徴ベクトル、音量、ピッチを用い
て、話者識別を行い、それによって得られた話者識別情
報（たとえば、話者Ａ、話者Ｂ、話者Ｃなどというよう
に話者を特定できる情報）と、これらそれぞれの話者が
持つ話者固有の話速、音量、ピッチから得られたそれぞ
れの話者ごとの標準的な話速情報としての平均話速、そ
れぞれの話者ごとの標準的な音量情報としての平均音
量、それぞれの話者ごとの標準的なピッチ情報としての
平均ピッチとを話者情報として文字パラメータ出力部４
に渡す。The speaker identifying unit 7 uses the frame number, the feature vector, the volume, and the pitch obtained by the voice analyzing unit 2 to identify the speaker, and the speaker identifying information (for example, the speaker identifying information) thus obtained. Information such as speaker A, speaker B, speaker C, etc. that can identify the speaker) and the speaker-specific speed, volume, and pitch of each speaker. The average voice speed as standard voice speed information, the average volume as standard volume information for each speaker, and the average pitch as standard pitch information for each speaker Parameter output unit 4
Pass to.

【０１３６】なお、これら個々の話者ごとの平均話速、
平均音量、平均ピッチは、予め個々の話者ごとに求めて
おいたものを用いるようにしてもよく、音声分析部２で
分析された結果を用いて学習することによって得られる
ものであってもよい。The average speech speed of each individual speaker,
The average sound volume and the average pitch may be obtained in advance for each speaker, or may be obtained by learning using the result analyzed by the voice analysis unit 2. Good.

【０１３７】このように、この第２の実施の形態では、
文字パラメータ出力部４に入力される情報としては、音
声認識部３から渡された認識結果と音声分析部２から渡
されたピッチ、音量、フレーム番号の他に、話者識別部
７から渡された話者情報が新たに加わる。As described above, in the second embodiment,
The information input to the character parameter output unit 4 is passed from the speaker identification unit 7 in addition to the recognition result passed from the voice recognition unit 3 and the pitch, volume, and frame number passed from the voice analysis unit 2. New speaker information added.

【０１３８】なお、この話者情報は、図２１（ａ）に示
すように、音声分析部２からのフレーム番号、特徴ベク
トル、音量、ピッチの各情報から得られた個々の話者を
特定する情報（たとえば、話者Ａ、話者Ｂ、話者Ｃな
ど）と、図２１（ｂ）に示すように、これらそれぞれの
話者に対する平均話速、平均音量、平均ピッチなどから
なり、これらの情報が文字パラメータ出力部４に渡され
る。As shown in FIG. 21A, this speaker information specifies an individual speaker obtained from the frame number, feature vector, volume, and pitch information from the voice analysis unit 2. Information (for example, speaker A, speaker B, speaker C, etc.) and, as shown in FIG. 21 (b), the average speech speed, the average volume, the average pitch, etc. for these respective speakers. Information is passed to the character parameter output unit 4.

【０１３９】そして、音声認識部３から渡された認識結
果と音声分析部２から渡されたピッチ、音量、フレーム
番号と、話者識別部７から渡された話者情報とを統合し
て、その統合した情報（フレーム番号、認識結果として
のテキスト、音量、ピッチ、話者情報）を文字生成部５
に渡す。Then, the recognition result passed from the voice recognition unit 3, the pitch, the volume, the frame number passed from the voice analysis unit 2 and the speaker information passed from the speaker identification unit 7 are integrated, The integrated information (frame number, text as a recognition result, volume, pitch, speaker information) is used by the character generator 5
Pass to.

【０１４０】このように、第２の実施の形態では、フレ
ーム番号、認識結果としてのテキスト、音量、ピッチと
ともに話者情報が文字生成部５に渡されるので、認識結
果として出力される文字列がどの話者に対する文字列で
あるかを示すことができる。As described above, in the second embodiment, since the speaker information is passed to the character generator 5 together with the frame number, the text as the recognition result, the volume, and the pitch, the character string output as the recognition result is It is possible to indicate which speaker the character string is for.

【０１４１】たとえば、複数の話者（話者Ａ、話者Ｂ、
話者Ｃ、・・・）が会議をしているような場合、それぞ
れの発話内容に対する認識結果と話者を特定する情報
（話者Ａ、話者Ｂ、話者Ｃ、・・・など）とを対応付け
ることができる。For example, a plurality of speakers (speaker A, speaker B,
In the case where the speaker C, ...) Has a meeting, the recognition result for each utterance content and the information specifying the speaker (speaker A, speaker B, speaker C, ...) Can be associated with.

【０１４２】これによって、たとえば、前述の第１の実
施の形態で説明した文字生成部５で生成される文字列に
話者を特定できるような情報を加えることができる。た
とえば、それぞれ話者ごとに文字列の色を変える（話者
Ａの発話内容に対する文字列は赤色、話者Ｂの発話内容
に対する文字列は黒色、話者Ｃの発話内容に対する文字
列は青色など）ことができ、どの話者の発話内容かが一
目でわかる。As a result, for example, it is possible to add information for identifying the speaker to the character string generated by the character generator 5 described in the first embodiment. For example, the color of the character string is changed for each speaker (the character string for the utterance content of speaker A is red, the character string for the utterance content of speaker B is black, the character string for the utterance content of speaker C is blue, etc.). ) Can be understood at a glance which speaker's utterance content.

【０１４３】また、白黒の表示しかできないシステムの
場合には、文字列の字体を話者ごとに変える（たとえ
ば、話者Ａは明朝体、話者Ｂはゴシック体、話者Ｃは斜
体など）ようにして話者を区別するようにしてもよい。Further, in the case of a system capable of only displaying in black and white, the font of the character string is changed for each speaker (for example, speaker A is Mincho, speaker B is Gothic, speaker C is italic, etc.). ) Thus, the speakers may be distinguished.

【０１４４】このように、文字生成部５で生成される文
字列に話者を特定できる情報が加えれると、複数の話者
が会議を行った内容を音声認識し、その認識結果を議事
録として出力する場合、議事録を見れば、どの話者が発
話したのかが一目でわかる。As described above, when the information for identifying the speakers is added to the character string generated by the character generation unit 5, the contents of the conference by the plurality of speakers are voice-recognized, and the recognition result is recorded in the minutes. When output as, the minutes can be seen at a glance as to which speaker uttered.

【０１４５】なお、それぞれの話者ごとに生成された文
字列には、前述の第１の実施の形態で説明したように、
文字間隔や文字幅による話速、文字の太さによる声の大
きさ、文字の濃度による声の高さを表現することができ
るので、文字列を見ることによって、どの話者がどのよ
うな口調で発話したのかが一目でわかり、会議における
場の雰囲気などを推測することも可能である。The character string generated for each speaker is as described in the above-mentioned first embodiment.
It is possible to express the voice speed based on the character spacing, the voice speed based on the character thickness, and the voice speed based on the character spacing. You can see at a glance if you uttered at and can guess the atmosphere of the place at the meeting.

【０１４６】また、話者情報には、話者を識別する情報
だけでなく、上述したように、それぞれの話者の平均話
速、平均音量、平均ピッチなども含まれるので、文字生
成部５では、これら個々の話者ごとの平均話速、平均音
量、平均ピッチを用いて、話速を表現するパラメータの
取り得る範囲、声の大きさを表現するパラメータの取り
得る範囲、声の高さを表現するパラメータの取り得る範
囲を正規化した上で、個々の話者に対する話速を表現す
るパラメータ、声の大きさを表現するパラメータとして
の文字太さ、声の高さを表現するパラメータとしての濃
さを決めることもできる。Further, the speaker information includes not only the information for identifying the speaker but also the average speech speed, the average volume, the average pitch, etc. of each speaker, as described above. Then, using the average voice speed, average volume, and average pitch for each individual speaker, the range of parameters that express the voice speed, the range of parameters that express the volume of the voice, the pitch of the voice, After normalizing the range of the parameters that express, the parameter that expresses the speech speed for each speaker, the character thickness that expresses the volume of voice, and the parameter that expresses the pitch of voice. You can also decide the intensity of.

【０１４７】これは、個々の話者はそれぞれがもともと
有する声の質が異なり、もともと声の大きい人、もとも
と早口な人、もともと声が高い人など様々であり、個々
の話者がもともと持っている声の質に影響されることな
く、同じ基準で、話速、声の大きさ、声の高さを、これ
らのパラメータの取り得る範囲を最大限使って表現えき
るようにすることも必要であるからである。This is because each speaker has different voice quality originally, and there are various people such as a person who originally has a loud voice, a person who originally speaks fast, and a person who originally has a high voice. It is also necessary to be able to express the speech speed, voice volume, and voice pitch with the same criteria, using the maximum possible range of these parameters, without being affected by the voice quality. Because it is.

【０１４８】たとえば、もともと声の大きい話者が、そ
の話者にとっては普通の大きさの声で発話した発話内容
の認識結果に対し、その話者の音声を分析して得られた
情報をそのまま用いて、声の大きさを表現するパラメー
タを反映させて文字列を生成したとすると、その文字列
は全体的に太い文字で表現されてしまい、また、その話
者が、その話者にとっては小さい声で発話したつもりで
も、他の話者の大きめの声と同じ文字の太さで表現され
る場合もある。For example, the information obtained by analyzing the voice of the speaker is directly used for the recognition result of the utterance content in which the speaker having a large voice originally utters a voice of a normal volume for the speaker. If a character string is generated by using a parameter that expresses the loudness of the voice, the character string will be expressed as a thick character as a whole, and the speaker will be Even if you intend to speak in a small voice, it may be expressed with the same character thickness as the loud voices of other speakers.

【０１４９】たとえば、この大きな声の話者が、その話
者にとって小さな声で発話しても、前述したテーブル
（図１２(b)参照）により、文字太さ1.5ｔの文字として
出力されるような場合もあり、これは、一般的には、や
や大きな声で発話したとされる太さである。For example, even if this loud speaker utters a soft voice for the speaker, the above-mentioned table (see FIG. 12 (b)) allows the speaker to output a character with a character thickness of 1.5t. In some cases, this is a thickness that is generally considered to be a slightly loud voice.

【０１５０】逆に、もともと声の小さい話者が、その話
者にとっては大きな声で発話した場合でもその文字列は
普通の人が普通の声の大きさで発話した場合よりも細い
文字列で表現されることもある。たとえば、このもとも
と小さな声で発話する話者が、その話者にとって大きな
声で発話したとしても、前述したテーブル（図１２(b)
参照）により、文字の太さｔ／２の文字としてしか出力
されない場合もある。これは、一般的には、やや小さな
声で発話したとされる太さである。On the contrary, even if a speaker who originally has a low voice utters a loud voice for the speaker, the character string is thinner than that when an ordinary person utters with a normal voice volume. Sometimes expressed. For example, even if a speaker who originally speaks in a small voice speaks loudly to the speaker, the above-mentioned table (FIG. 12 (b)) is used.
In some cases, it may be output only as a character with a character thickness t / 2. This is the thickness that is generally said to have been spoken in a slightly soft voice.

【０１５１】このことは、たとえば、全体的に太い文字
列であれば話者Ａ、全体的に濃い文字列であれば話者Ｂ
というように、出力された文字列を見れば、自ずから話
者を特定できる利点もあるので、それを生かすこともで
きるが、会議の議事録などのように、複数の話者に対す
る文字列が出力される場合、個々の話者の声の質によっ
て、話速、声の大きさ、声の高さを同じ基準で比較でき
ないことになり、その会議における個々の話者の感情な
どが読み取りにくいものとなる。This means that, for example, if the character string is entirely thick, the speaker A, and if the character string is entirely thick, the speaker B is used.
In this way, if you look at the output character string, there is an advantage that you can identify the speaker automatically, so you can make use of it, but like the minutes of a meeting, you can output the character string for multiple speakers. In such cases, it is difficult to compare the voice speed, loudness, and pitch based on the same criteria, depending on the voice quality of each speaker, and it is difficult to read the emotion of each speaker in the conference. Becomes

【０１５２】そこで、それぞれの話者の平均話速、平均
音量、平均ピッチを用い、これらそれぞれの話者の平均
話速、平均音量、平均ピッチを文字間隔や文字幅を表す
パラメータの取り得る範囲（話速の表現可能範囲）、文
字太さを表すパラメータの取り得る範囲（声の大きさの
表現可能範囲）、文字の濃さを表すパラメータの取り得
る範囲（声の高さの表現可能範囲）の中心あるいは中心
付近とし、それぞれの話者の話速、声の大きさ、声の高
さの表現可能範囲をそれぞれ正規化すれば、それぞれの
表現可能範囲を最大限有効に使って、それぞれの話者の
話速、声の大きさ、声の高さの表現を行うことができ
る。Therefore, the average speech speed, the average volume, and the average pitch of each speaker are used, and the average speech speed, the average volume, and the average pitch of each of these speakers are in the range that can be taken by the parameter indicating the character spacing and the character width. (Speech speed expressible range), Character thickness parameter range (voice volume expression range), Character density parameter range (voice pitch expression range) ) Center or near the center and normalize the expressible range of each speaker's voice speed, voice volume, and pitch, and use each expressible range to the maximum extent. You can express the speaking speed, loudness, and pitch of the speaker.

【０１５３】これによって、個々の話者がもともと持っ
ている声の質に影響されることなく、同じ基準で、話
速、声の大きさ、声の高さをそれぞれのパラメータの取
り得る範囲を最大限有効に使って表現することができ、
それぞれの話者の口調や感情をより豊かに表現すること
ができる。By this, the range in which each parameter can take the voice speed, the voice volume, and the voice pitch on the same basis is influenced by the voice quality originally possessed by each speaker. It can be expressed with maximum effectiveness,
The tone and feelings of each speaker can be expressed more richly.

【０１５４】なお、本発明は以上説明した実施の形態に
限定されるものではなく、本発明の要旨を逸脱しない範
囲で種々変形実施可能となるものである。たとえば、前
述の各実施の形態では、発話継続時間、音量、ピッチを
取得し、発話継続時間から話速を表現するパラメータ、
音量から声の大きさを表現するパラメータ、音声のピッ
チから声の高さを表現するパラメータを生成し、それら
のパラメータによって、話速、声の大きさ、声の高さが
表現された文字列を生成するようにしたが、これらは全
てを用いる必要もなく、たとえば、話速のみが表現され
た文字列、声の大きさのみが表現された文字列、声の高
さのみが表現された文字列というように、何を表現させ
るかを必要に応じて決めることもできる。The present invention is not limited to the embodiments described above, and various modifications can be made without departing from the spirit of the present invention. For example, in each of the above-described embodiments, a parameter that acquires the utterance duration, the volume, and the pitch, and expresses the speech speed from the utterance duration,
A character string that expresses the voice volume from the volume and the voice pitch from the voice pitch, and expresses the voice speed, the voice volume, and the voice pitch with these parameters. However, it is not necessary to use all of them, for example, a character string expressing only the speech speed, a character string expressing only the loudness of the voice, and only the pitch of the voice. You can also decide what you want to express, such as a character string, as needed.

【０１５５】また、発話継続時間（発話速度）を表現す
るパラメータとして、文字間隔や文字幅を用い、声の大
きさを表現するパラメータとして文字の太さを用い、声
の高さを表現するパラメータとして、文字の濃度を用い
たが、これらは１つの例であって、これに限られるもの
ではなく、これら発話継続時間（発話速度）、声の大き
さ、声の高さは、他のパラメータによっても表現可能で
ある。たとえば、声の大きさを表現するパラメータとし
て文字の濃さを用いてもよく、また、声の高さを表現す
るパラメータとして文字の太さを用いてもよい。Further, a character interval or a character width is used as a parameter expressing the utterance duration (speaking speed), and a character thickness is used as a parameter expressing the loudness of the voice, and a parameter expressing the pitch of the voice. As an example, the character density is used, but these are only examples, and the present invention is not limited to this, and the utterance duration (speaking speed), the loudness, and the loudness of the voice are other parameters. Can also be expressed by. For example, the density of characters may be used as a parameter expressing the volume of a voice, and the thickness of characters may be used as a parameter expressing the pitch of a voice.

【０１５６】また、本発明は、以上説明した本発明を実
現するための処理手順が記述された処理プログラムを作
成し、その処理プログラムをフロッピィディスク、光デ
ィスク、ハードディスクなどの記録媒体に記録させてお
くことができ、本発明はその処理プログラムが記録され
た記録媒体をも含むものである。また、ネットワークか
ら当該処理プログラムを得るようにしてもよい。Further, according to the present invention, a processing program in which the processing procedure for implementing the present invention described above is described is created, and the processing program is recorded in a recording medium such as a floppy disk, an optical disk, a hard disk. The present invention also includes a recording medium in which the processing program is recorded. Further, the processing program may be obtained from the network.

【０１５７】[0157]

【発明の効果】以上説明したように本発明によれば、音
声の継続時間から話速を表現するパラメータ、音量から
声の大きさを表現するパラメータ、ピッチから声の高さ
を表するパラメータの少なくとも１つのパラメータを生
成して、それらのパラメータの中の少なくとも１つのパ
ラメータによって、話速、声の大きさ、声の高さの少な
くとも１つが表現された文字列を生成するようにしてい
る。As described above, according to the present invention, the parameter for expressing the speech speed from the duration of the voice, the parameter for expressing the volume of the voice from the volume, and the parameter for expressing the pitch of the voice from the pitch are used. At least one parameter is generated, and a character string in which at least one of the speech speed, the loudness of the voice, and the pitch of the voice is expressed by at least one of the parameters is generated.

【０１５８】これによって、認識結果として出力される
文字列を見るだけで、話者がどのような口調で発話して
いるのかがわかる。特に、これらのパラメータをすべて
用いれば、出力される文字列から話者の発話の速さや、
声の大きさ、声の高さを読み取ることができ、その時の
話者の感情など推測することもできる。As a result, it is possible to know what tone the speaker is speaking just by looking at the character string output as the recognition result. In particular, if all these parameters are used, the speed of the speaker's utterance from the output character string,
The loudness and pitch of the voice can be read, and the emotion of the speaker at that time can be estimated.

【０１５９】また、個々の文字ごとにその文字に対して
話速、声の大きさ、声の高さを表現するパラメータを反
映させることで、たとえば、ある１つの単語を発話した
ときに、その単語を構成する文字ごとに発話の速さの変
化や、声の大きさの変化、声の高さの変化を読み取るこ
とができ、それによって、その単語を発話したときの話
者の微妙な感情の変化などを出力された文字列から推測
することもできる。Further, by reflecting the parameters expressing the speech speed, the volume of the voice, and the pitch of the voice for each character, for example, when a certain word is uttered, It is possible to read changes in the speed of speech, changes in voice volume, and changes in voice pitch for each character that makes up a word, and thus the subtle emotions of the speaker when the word is spoken. It is also possible to infer the change in the character string from the output character string.

【０１６０】また、言語構成単位ごとにその言語構成単
位に対して話速、声の大きさ、声の高さをを表現する文
字属性パラメータを反映させることもできる。たとえ
ば、あるひとまとまりの内容を発話を発話したときに、
そのひとまとまりの内容に含まれる言語構成単位ごとに
発話の速さの変化を読み取ることができ、それによっ
て、そのひとまとまりの内容を発話したとき、それぞれ
の言語構成単位の発話の速さ、声の大きさ、声の高さか
ら話者の微妙な感情の変化などを出力された文字列から
推測することもできる。It is also possible to reflect the character attribute parameters expressing the speech speed, the volume of the voice, and the pitch of the voice for each language constituent unit. For example, when you utter a lump of content,
It is possible to read the change in the utterance speed for each language constituent unit included in the content of the group, and thereby, when the content of the group is uttered, the utterance speed and voice of each language constituent unit. It is also possible to deduce from the output character string the subtle changes in the speaker's emotions based on the loudness of the voice and the pitch of the voice.

【０１６１】また、音声分析によって得られる分析結果
から個々の話者を特定するための話者識別情報を得て、
その話者識別情報を用いて、それぞれの話者ごとに、話
速、声の大きさ、声の高さの少なくとも１つが表現され
た文字列を生成するようにしている。From the analysis result obtained by the voice analysis, speaker identification information for identifying each speaker is obtained,
The speaker identification information is used to generate a character string in which at least one of the voice speed, the voice volume, and the voice pitch is expressed for each speaker.

【０１６２】これは、たとえば、それぞれの話者を特定
できるように、話者ごとの発話に対する文字列を色分け
して出力したり、字体を変えて出力することによって、
文字列を見るだけで、どの話者の発話内容かを一目でわ
かるようにし、しかも、それぞれの話者ごとに、その話
者の発話内容に対し、話速、声の大きさ、声の高さの少
なくとも１つが表現された文字列として出力するように
したものであり、これによって、出力された文字列を見
るだけで、どの話者がどのような状況で発話しているか
が一目でわかる。This is done by, for example, color-coding and outputting the character strings for the utterances of each speaker or changing the font so that each speaker can be specified.
It is possible to see at a glance which speaker's utterance content just by looking at the character string, and for each speaker, with respect to the utterance content of that speaker, the speaking speed, loudness, and pitch. At least one of the above is output as a character string that is expressed. With this, it is possible to see at a glance which speaker is speaking in what situation just by looking at the output character string. .

【０１６３】また、話者を特定するための話者識別情報
の他に、個々の話者が持つ個々の話者固有の話速情報、
音量情報、ピッチ情報の少なくとも１つを用いて、それ
ぞれの話者に対する文字列の話速を表すパラメータの取
り得る範囲、声の大きさを表すパラメータの取り得る範
囲、声の高さを表すパラメータの取り得る範囲を正規化
することができ、これによって、個々の話者がもともと
持っている声の質に影響されることなく、同じ基準で、
話速、声の大きさ、声の高さをそれぞれのパラメータの
取り得る範囲を最大限使って表現することができ、それ
ぞれの話者の口調や感情をより豊かに表現することがで
きる。In addition to the speaker identification information for specifying the speaker, the talk speed information peculiar to each speaker,
Using at least one of the volume information and the pitch information, a range of parameters that represent the speech speed of the character string for each speaker, a range of parameters that represent the loudness of the voice, and a parameter that represents the pitch of the voice. Can be normalized so that the same criteria can be used, without being affected by the voice quality of the individual speaker.
It is possible to express the speech speed, the loudness of the voice, and the pitch of the voice by maximizing the range that each parameter can take, and to express the tone and emotion of each speaker more richly.

[Brief description of drawings]

【図１】本発明の第１の実施の形態を説明する音声認識
結果出力装置の構成図である。FIG. 1 is a configuration diagram of a voice recognition result output device for explaining a first embodiment of the present invention.

【図２】（ａ）は図１の音声入力部１に入力された音声
の一例を示す図であり、（ｂ）はその音声を音声分析し
て得られた分析結果としての特徴ベクトル列、音量、ピ
ッチの各情報を各時刻に対応付けて示す図である。2A is a diagram showing an example of a voice input to a voice input unit 1 of FIG. 1, and FIG. 2B is a feature vector sequence as an analysis result obtained by voice analysis of the voice, It is a figure which matches each information of a volume and a pitch with each time.

【図３】図２で示した音声分析結果を用いて音声認識部
３での音声認識処理の例を説明する図である。FIG. 3 is a diagram illustrating an example of a voice recognition process in a voice recognition unit 3 using the voice analysis result shown in FIG.

【図４】図２で示した音声分析結果で得られるフレーム
番号、音量、ピッチと音声認識部で得られる認識結果を
統合してフレーム番号、テキスト、音量、ピッチを対応
付けした例を示す図である。FIG. 4 is a diagram showing an example in which a frame number, a volume, and a pitch obtained by the voice analysis result shown in FIG. 2 are integrated with a recognition result obtained by a voice recognition unit, and a frame number, a text, a volume, and a pitch are associated with each other. Is.

【図５】図４で示される時刻、テキスト、音量、ピッチ
から文字生成部５が行う文字間隔や文字幅の設定、文字
太さの設定、文字濃度の設定を行うに必要な文字パラメ
ータの取得を説明する図である。FIG. 5 is a diagram illustrating acquisition of character parameters necessary for setting character spacing, character width, character thickness, and character density performed by the character generator 5 based on time, text, volume, and pitch shown in FIG. It is a figure explaining.

【図６】発話継続時間から話速を表現するための文字間
隔や文字幅の設定を説明する図であり、（ａ）は発話継
続時間を求める処理を説明する図であり、（ｂ）は発話
継続時間をkeyとして文字間隔を取得可能なテーブル例
を示す図である。6A and 6B are diagrams illustrating setting of a character interval and a character width for expressing a speech speed from an utterance duration, FIG. 6A is a diagram illustrating processing for obtaining an utterance duration, and FIG. It is a figure which shows the example of a table which can acquire a character space by making a speech continuation time into a key.

【図７】文字ごとに求められた発話継続時間によって個
々の文字ごとに文字間隔が設定された文字列の例を示す
図である。FIG. 7 is a diagram showing an example of a character string in which a character interval is set for each character according to the utterance duration obtained for each character.

【図８】１つの単語ごとに求められた発話継続時間によ
ってその単語全体で文字間隔が設定された文字列の例を
示す図である。FIG. 8 is a diagram showing an example of a character string in which a character interval is set for the entire word according to the utterance duration obtained for each word.

【図９】発話継続時間をkeyとして文字幅を取得可能な
テーブル例を示す図である。FIG. 9 is a diagram showing an example of a table in which the character width can be acquired with the utterance duration as a key.

【図１０】図９のテーブルを用いることにより個々の文
字ごとに文字幅が設定された文字列の例を示す図であ
る。FIG. 10 is a diagram showing an example of a character string in which a character width is set for each character by using the table of FIG.

【図１１】図９のテーブルを用いることにより１つの単
語ごとに文字幅が設定された文字列の例を示す図であ
る。11 is a diagram showing an example of a character string in which a character width is set for each word by using the table of FIG.

【図１２】音量から声の大きさを表現するための文字太
さの設定を説明する図であり、（ａ）は音量を求める処
理を説明する図であり、（ｂ）は音量をkeyとして文字
太さを取得可能なテーブル例を示す図である。12A and 12B are diagrams illustrating setting of character thickness for expressing the volume of a voice from volume, FIG. 12A is a diagram illustrating a process of obtaining volume, and FIG. It is a figure which shows the example of a table which can acquire the character thickness.

【図１３】文字ごとに求められた音量によって個々の文
字ごとに文字太さが設定された文字列の例を示す図であ
る。FIG. 13 is a diagram showing an example of a character string in which the character thickness is set for each character according to the volume obtained for each character.

【図１４】１つの単語ごとに求められた音量によってそ
の単語全体で文字太さが設定された文字列の例を示す図
である。FIG. 14 is a diagram showing an example of a character string in which the character weight is set for the entire word according to the volume obtained for each word.

【図１５】ピッチから声の高さを表現するための文字の
濃度の設定を説明する図であり、（ａ）はピッチを求め
る処理を説明する図であり、（ｂ）は濃度の取り得る範
囲とピッチとの対応付けを説明する図である。15A and 15B are diagrams illustrating setting of the density of characters for expressing the pitch of a voice from the pitch, FIG. 15A is a diagram illustrating a process of obtaining a pitch, and FIG. It is a figure explaining correspondence of a range and a pitch.

【図１６】文字ごとに求められたピッチによって個々の
文字ごとに文字の濃度が設定された文字列の例を示す図
である。FIG. 16 is a diagram showing an example of a character string in which the character density is set for each character according to the pitch obtained for each character.

【図１７】１つの単語ごとに求められたピッチによって
その単語全体で文字の濃度が設定された文字列の例を示
す図である。FIG. 17 is a diagram showing an example of a character string in which the density of characters is set for the entire word by the pitch obtained for each word.

【図１８】それぞれの文字における時刻ごとに求められ
たピッチによってその文字内で濃度が設定された文字列
の例を示す図である。FIG. 18 is a diagram showing an example of a character string in which the density is set in each character by the pitch obtained for each character at each time.

【図１９】発話継続時間、音量、ピッチによって得られ
るパラメータを組み合わせることによってそれらが表現
された文字列の例を説明する図である。FIG. 19 is a diagram illustrating an example of a character string in which parameters are obtained by combining parameters obtained by the utterance duration, volume, and pitch.

【図２０】本発明の第２の実施の形態を説明する音声認
識結果出力装置の構成図である。でありFIG. 20 is a configuration diagram of a voice recognition result output device for explaining a second embodiment of the present invention. And

【図２１】話者識別を行うに必要な音声分析部２からの
データ例と、それぞれの話者対応の平均話速、平均音
量、平均ピッチを話者固有の情報として示す図である。FIG. 21 is a diagram showing an example of data from the voice analysis unit 2 necessary for speaker identification, and an average voice speed, an average volume, and an average pitch corresponding to each speaker, as speaker-specific information.

[Explanation of symbols]

１音声入力部２音声分析部３音声認識部４文字パラメータ出力部５文字生成部６文字出力部７話者識別部ｔ０，ｔ１，・・・，ｔｎフレーム番号ｖ０，ｖ１，・・・，ｖｎ音量情報ｐ０，ｐ１，・・・，ｐｎピッチ情報Ｔ１，Ｔ２，・・・，Ｔ５求められた発話継続時間の
段階ｄ文字間隔ｗ文字幅Ｖ１，Ｖ２，・・・，Ｖ５求められた音量情報の段階ｂ文字の太さＰ１，Ｐ２，・・・求められたピッチ情報Ｄ０，Ｄ１，・・・，Ｄｎ濃度1 voice input unit 2 voice analysis unit 3 voice recognition unit 4 character parameter output unit 5 character generation unit 6 character output unit 7 speaker identification units t0, t1, ..., Tn frame numbers v0, v1 ,. Volume information p0, p1, ..., Pn Pitch information T1, T2, ..., T5 Obtained utterance duration stage d Character spacing w Character width V1, V2, ..., V5 Obtained volume information Stage b Character thickness P1, P2, ... Obtained pitch information D0, D1, ..., Dn Density

───────────────────────────────────────────────────── フロントページの続き (72)発明者長谷川浩長野県諏訪市大和３丁目３番５号セイコーエプソン株式会社内Ｆターム(参考） 5D015 AA06 CC12 CC14 CC15 FF03 HH00 LL05 ─────────────────────────────────────────────────── ─── Continued front page (72) Inventor Hiroshi Hasegawa Seiko, 3-3-3 Yamato, Suwa City, Nagano Prefecture -In Epson Corporation F-term (reference) 5D015 AA06 CC12 CC14 CC15 FF03 HH00 LL05

Claims

[Claims]

1. A voice recognition result output method for analyzing input voice, performing voice recognition using the voice analysis result, and outputting the recognition result, comprising: utterance duration information and volume based on the voice analysis result. information,
A parameter that acquires at least one piece of pitch information, and expresses a speech speed from the utterance duration information,
At least one of a parameter expressing the volume of voice from the volume information and a parameter expressing the pitch of voice from the pitch information
At least one of the speech speed, the loudness, and the pitch of the voice is generated by at least one of the parameters.
A method for outputting a speech recognition result, characterized in that a character string in which one is expressed is generated.

2. When the character string representing the speech speed is generated, the character is described for each character based on the utterance duration information obtained for each character constituting the character string. The method for outputting a voice recognition result according to claim 1, wherein a parameter expressing a speech speed is reflected.

3. When generating a character string representing the speech speed, based on utterance voice duration information obtained in a language constituent unit such as a word, a clause, a continuous clause, a sentence, or a plurality of sentences that constitute the character string. The speech recognition result output method according to claim 1, wherein a parameter expressing the speech speed is reflected in the language constituent unit for each language constituent unit.

4. The parameter expressing the speech speed generated from the utterance duration information sets information for setting the character spacing of the characters forming the character string and individual character lengths in the arrangement direction of the character string. 4. The voice recognition result output method according to claim 1, wherein the voice recognition result output method is at least one of the following information.

5. When generating a character string expressing the loudness of the voice, based on the volume information obtained in units of individual characters forming the character string, the character is described for each character. The voice recognition result output method according to claim 1, wherein a parameter expressing the volume of the voice is reflected.

6. When a character string representing the loudness of the voice is generated, the language constituent unit is selected for each language constituent unit based on the volume information obtained in the language constituent unit that constitutes the character string. The voice recognition result output method according to claim 1, wherein a parameter expressing the volume of the voice is reflected.

7. The parameter expressing the loudness of a voice generated from the volume information is information for setting the thickness of characters forming the character string.
5. The voice recognition result output method according to any one of 5 and 6.

8. The parameter expressing the volume of a voice generated from the volume information is information for setting the density of characters forming the character string.
5. The voice recognition result output method according to any one of 5 and 6.

9. When generating a character string expressing the pitch of the voice, based on pitch information obtained for each character constituting the character string, for each character, The voice recognition result output method according to claim 1, wherein a parameter expressing the pitch of the voice is reflected.

10. When a character string representing the pitch of the voice is generated, the language constituent unit is assigned to each language constituent unit on the basis of pitch information obtained in the language constituent unit that constitutes the character string. The voice recognition result output method according to claim 1, wherein a parameter expressing the pitch of the voice is reflected.

11. The parameter representing the pitch of a voice generated from the pitch information is information for setting the density of characters forming the character string. The voice recognition result output method according to any one of 1.

12. The parameter expressing the pitch of a voice generated from the pitch information is information for setting the thickness of characters forming the character string. The voice recognition result output method according to any one of 1.

13. Speaker identification information for identifying an individual speaker is obtained from the data obtained by the analysis result of the voice, and the speaker identification information is used for each speaker. 13. The voice recognition result output method according to claim 1, wherein a character string in which at least one of the speech speed, the volume of the voice, and the pitch of the voice is expressed is generated.

14. In addition to speaker identification information for specifying the speaker, at least one of speaker-specific voice speed information, volume information, and pitch information held by each speaker is used,
14. The normal range of a parameter expressing a speech rate, the range of a parameter expressing a loudness of a voice, and the range of a parameter expressing a pitch of a voice are normalized. Speech recognition result output method.

15. A voice recognition result output device which analyzes input voice, performs voice recognition using the voice analysis result, and outputs the recognition result, corresponding to each time information obtained from the voice analysis result. At least one of the attached volume information and pitch information, and a text obtained as a recognition result are associated with each other by the time information, and a character parameter output means for outputting the associated information as a character parameter, Obtaining speech duration information from the time information obtained from the character parameter output means, a parameter expressing the speech speed from the speech duration information, a parameter expressing the volume of the voice from the volume information, from the pitch information Generate at least one of the parameters expressing the pitch of the voice, and By at least one parameter, speech rate, loudness, the voice recognition result output device, characterized in that at least one of the voice pitch having a character generator means for generating a character string expressed.

16. The character generation unit, when generating a character string expressing the speech speed, for each character based on utterance duration information obtained for each character that constitutes the character string. 16. The voice recognition result output device according to claim 15, wherein a parameter expressing the speech speed is reflected on the character.

17. The character generation unit, when generating a character string expressing the speech speed, for each language constituent unit based on utterance duration information obtained in the language constituent unit that constitutes the character string. 16. The speech recognition result output device according to claim 15, wherein the parameter expressing the speech speed is reflected in the language constituent unit.

18. The parameter for expressing the speech speed generated from the utterance duration information sets information for setting the character interval between the characters forming the character string and individual character lengths in the arrangement direction of the character string. 18. The voice recognition result output device according to claim 15, wherein the voice recognition result output device is at least one of information to be output.

19. The character generation unit, when generating a character string expressing the loudness of the voice, for each character based on volume information obtained in units of individual characters forming the character string. 16. The voice recognition result output device according to claim 15, wherein a parameter expressing the loudness of the voice is reflected on the character.

20. The character generating unit, when generating a character string expressing the loudness of the voice, for each language constituent unit based on volume information obtained in the language constituent unit that constitutes the character string. 16. The speech recognition result output device according to claim 15, wherein a parameter expressing the loudness of the voice is reflected in the language constituent unit.

21. The parameter representing the loudness of a voice generated from the volume information is information for setting the thickness of a character forming the character string.
5. The voice recognition result output device according to any one of 5, 19, and 20.

22. The parameter expressing the loudness of the voice generated from the volume information is information for setting the density of the characters forming the character string.
5. The voice recognition result output device according to any one of 5, 19, and 20.

23. In the case of generating a character string expressing the pitch of the voice, the character generating unit, for each character, based on pitch information obtained in units of individual characters forming the character string. 16. The voice recognition result output device according to claim 15, wherein a parameter expressing the pitch of the voice is reflected on the character.

24. In the case of generating a character string expressing the pitch of the voice, the character generation unit, based on pitch information obtained in the language structural unit that constitutes the character string, for each language structural unit. 16. The voice recognition result output device according to claim 15, wherein the parameter expressing the pitch of the voice is reflected in the language constituent unit.

25. The parameter representing the pitch of a voice generated from the pitch information is information for setting the density of characters forming the character string.
5. The voice recognition result output device according to any one of 5, 23, and 24.

26. The parameter expressing the pitch of a voice generated from the pitch information is information for setting the thickness of a character forming the character string.
5. The voice recognition result output device according to any one of 5, 23, and 24.

27. A speaker identifying unit for identifying an individual speaker from data obtained by the analysis result of the voice is used, and the speaker identifying information identified by the speaker identifying unit is used. 27. A character string representing at least one of the speech speed, the loudness of the voice, and the pitch of the voice is generated for each speaker, and the character string is generated. Speech recognition result output device.

28. In addition to speaker identification information for specifying the speaker, at least one of speaker-specific voice speed information, volume information, and pitch information held by each speaker is used,
28. The range of possible parameters for expressing the speed of speech, the range of possible parameters for expressing the loudness of voice, and the range of possible parameters for expressing the pitch of voice are normalized. Speech recognition result output device.

29. A recording medium which records a voice recognition result output processing program for analyzing input voice, performing voice recognition using the voice analysis result, and outputting the recognition result, wherein the processing program is A procedure for obtaining at least one piece of information of speech duration information, speech volume information, and speech pitch information from the speech analysis result, and a parameter expressing a speech speed from the speech duration time information,
A procedure for generating at least one of a parameter expressing the loudness of the voice from the volume information and a parameter expressing the pitch of the voice from the pitch information, and a speech speed according to at least one of the parameters. , At least one of loudness and pitch
A recording medium on which a voice recognition result output processing program is recorded, including a procedure for generating a character string in which one is expressed.

30. In the case of generating a character string representing the speech speed, the character is described for each character based on utterance duration information obtained for each character forming the character string. 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein a parameter expressing a voice speed is reflected.

31. In the case of generating a character string representing the speech speed, the language constituent unit is set for each language constituent unit based on the utterance duration information obtained in the language constituent unit that constitutes the character string. 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein the parameter expressing the speech speed is reflected.

32. The character attribute parameter expressing the speech speed generated from the utterance duration information is information for setting a character interval between characters forming the character string and individual character lengths in a character string arrangement direction. 32. The information is at least one of information for setting
A recording medium recording the voice recognition result output processing program according to any one of 1.

33. In the case of generating a character string representing the loudness of the voice, based on volume information obtained for each character constituting the character string, for each character, 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein a parameter expressing the volume of the voice is reflected.

34. In the case of generating a character string expressing the loudness of the voice, based on the volume information obtained in the language constituent unit forming the character string, the language constituent unit is set to each language constituent unit. 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein a parameter expressing the volume of the voice is reflected.

35. The parameter expressing the loudness of the voice generated from the volume information is information for setting the thickness of the characters forming the character string.
A recording medium recording the voice recognition result output processing program according to any one of 9, 33, and 34.

36. The parameter expressing the volume of a voice generated from the volume information is information for setting the darkness of the characters forming the character string.
A recording medium recording the voice recognition result output processing program according to any one of 9, 33, and 34.

37. When a character string representing the pitch of the voice is generated, the character is described for each character based on pitch information obtained for each character that constitutes the character string. 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein a parameter expressing the pitch of a voice is reflected.

38. In the case of generating a character string representing the pitch of the voice, based on pitch information obtained in the language constituent unit forming the character string, the language constituent unit is set to each language constituent unit. 30. The recording medium recording the voice recognition result output processing program according to claim 29, wherein a parameter expressing the voice pitch is reflected.

39. The parameter expressing the pitch of a voice generated from the pitch information is information for setting the density of characters forming the character string.
A recording medium recording the voice recognition result output processing program according to any one of 9, 37, and 38.

40. The parameter expressing the pitch of a voice generated from the pitch information is information for setting the thickness of characters forming the character string.
A recording medium recording the voice recognition result output processing program according to any one of 9, 37, and 38.

41. Speaker identification information for identifying an individual speaker is obtained from the data obtained by the analysis result of the voice, and the speaker identification information is used for each speaker. 41. The voice recognition result output processing program according to claim 29, wherein a character string in which at least one of the speech speed, the volume of the voice, and the pitch of the voice is expressed is generated. Recording medium.

42. In addition to speaker identification information for identifying the speaker, at least one of speaker-specific voice speed information, volume information, and pitch information held by each speaker is used,
42. The possible range of a parameter expressing a speech speed, the possible range of a parameter expressing a loudness of a voice, and the possible range of a parameter expressing a voice pitch are normalized. A recording medium in which a voice recognition result output processing program is recorded.