JP6995907B2

JP6995907B2 - Speech processing equipment, audio processing methods and programs

Info

Publication number: JP6995907B2
Application number: JP2020039595A
Authority: JP
Inventors: 雅裕山本
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2020-03-09
Filing date: 2020-03-09
Publication date: 2022-01-17
Anticipated expiration: 2037-03-22
Also published as: JP2020098367A

Description

本発明の実施形態は、音声処理装置、音声処理方法およびプログラムに関する。 Embodiments of the present invention relate to speech processing devices, speech processing methods and programs.

日常環境の中で適切なメッセージを伝えることは非常に重要である。特にカーナビゲーションの中での注意喚起および危険通知、さらには緊急災害放送において周囲の環境音に埋もれることなく通知すべきメッセージなどは、その後の行動を考えても確実に届ける必要がある。 Communicating the right message in the everyday environment is very important. In particular, it is necessary to reliably deliver alerts and danger notifications in car navigation, as well as messages that should be notified without being buried in the surrounding environmental sounds in emergency disaster broadcasting, even when considering subsequent actions.

カーナビゲーションの中で注意喚起および危険通知を行うために広く行われている方法として、光による刺激、および、ブザー音の追加等が挙げられる。 Widely used methods for alerting and notifying danger in car navigation include stimulation by light and addition of a buzzer sound.

特開２００７－０１９９８０号公報Japanese Unexamined Patent Publication No. 2007-019980

しかしながら、従来技術では、通常の音声ガイドから刺激を増やすことで注意喚起を行っているため、注意喚起の瞬間にドライバーなどの利用者が驚くという現象を生じさせる。驚いた後の利用者の行動は遅れる傾向があり、本来刺激によりスムーズな危機回避行動を促すはずが、かえって行動を制限する結果になる場合がある。 However, in the prior art, since the attention is alerted by increasing the stimulus from the normal voice guide, a phenomenon that the user such as the driver is surprised at the moment of the alert is caused. After being surprised, the user's behavior tends to be delayed, and although the stimulus should originally promote smooth crisis avoidance behavior, it may result in restricting the behavior.

実施形態の音声処理装置は、特定部と、変調部と、を備える。特定部は、出力させる音声に含まれる１以上の音声のうちいずれか１以上を前記音声の属性に基づいて強調部分として特定する。変調部は、第１出力部に出力させる第１音声の強調部分と第２出力部に出力させる第２音声の強調部分との間で、ピッチおよび位相の少なくとも一方が異なるように、第１音声および第２音声の少なくとも一方の強調部分を変調する。 The voice processing device of the embodiment includes a specific unit and a modulation unit. The specific unit specifies any one or more of the one or more voices included in the voice to be output as the emphasized portion based on the attributes of the voice. The modulation unit is the first sound so that at least one of the pitch and the phase is different between the emphasized part of the first sound output to the first output part and the emphasized part of the second sound output to the second output part. And at least one of the emphasized parts of the second voice is modulated.

第１の実施形態にかかる音声処理装置のブロック図。The block diagram of the voice processing apparatus which concerns on 1st Embodiment. 実施形態のスピーカの配置の一例を示す図。The figure which shows an example of the arrangement of the speaker of an embodiment. 測定結果の一例を示す図。The figure which shows an example of the measurement result. 実施形態のスピーカの配置の他の例を示す図。The figure which shows the other example of the arrangement of the speaker of an embodiment. 実施形態のスピーカの配置の他の例を示す図。The figure which shows the other example of the arrangement of the speaker of an embodiment. ピッチ変調および位相変調について説明するための図。The figure for demonstrating pitch modulation and phase modulation. 位相の差（度）と背景音の音圧（ｄＢ）との関係を示す図。The figure which shows the relationship between the phase difference (degree) and the sound pressure (dB) of a background sound. 周波数差（Ｈｚ）と背景音の音圧（ｄＢ）との関係を示す図。The figure which shows the relationship between the frequency difference (Hz) and the sound pressure (dB) of a background sound. 第１の実施形態における音声出力処理のフローチャート。The flowchart of audio output processing in 1st Embodiment. 第２の実施形態にかかる音声処理装置のブロック図。The block diagram of the voice processing apparatus which concerns on 2nd Embodiment. 第２の実施形態における音声出力処理のフローチャート。The flowchart of audio output processing in 2nd Embodiment. 第３の実施形態にかかる音声処理装置のブロック図。The block diagram of the voice processing apparatus which concerns on 3rd Embodiment. 第３の実施形態における音声出力処理のフローチャート。The flowchart of audio output processing in 3rd Embodiment. 第４の実施形態にかかる音声処理装置のブロック図。The block diagram of the voice processing apparatus which concerns on 4th Embodiment. 記憶部に記憶されるデータの構造の一例を示す図。The figure which shows an example of the structure of the data stored in a storage part. 第４の実施形態における音声出力処理のフローチャート。The flowchart of audio output processing in 4th Embodiment. 学習の対象とする箇所を指定するための指定画面の一例を示す図。The figure which shows an example of the specification screen for specifying the part to be learned. 学習画面の一例を示す図。The figure which shows an example of the learning screen. 学習画面の他の例を示す図。The figure which shows the other example of the learning screen. 学習画面の他の例を示す図。The figure which shows the other example of the learning screen. 学習画面の他の例を示す図。The figure which shows the other example of the learning screen. 実施形態にかかる音声処理装置のハードウェア構成図。The hardware block diagram of the voice processing apparatus which concerns on embodiment.

以下に添付図面を参照して、この発明にかかる音声処理装置の好適な実施形態を詳細に説明する。 Hereinafter, preferred embodiments of the voice processing apparatus according to the present invention will be described in detail with reference to the accompanying drawings.

発明者の実験では、複数の音声出力装置（スピーカ、ヘッドフォンなど）のそれぞれから、ピッチおよび位相の少なくとも一方が異なる音声を聴く場合に、音声の物理的な大きさ（ラウドネス）によらず知覚による明瞭さが大きくなり、かつ、注意レベルが上昇することが確認されている。このとき、驚きの感覚はほとんど観測されない。 In the inventor's experiment, when listening to audio with different pitch and phase from each of multiple audio output devices (speakers, headphones, etc.), it is perceived regardless of the physical loudness of the audio. It has been confirmed that the clarity is increased and the attention level is increased. At this time, almost no sense of surprise is observed.

これまでの考え方では、複数の音声出力装置のそれぞれから、ピッチおよび位相のいずれかが異なる音声を聴く場合には、明瞭さが減少するため聞き取りが悪化するとされてきた。しかし、上記のように発明者の実験では、ピッチおよび位相の少なくとも一方が異なる音声を左右の耳で聴く場合に明瞭さが上昇し、注意レベルが上昇することが確認できた。 The conventional way of thinking is that when listening to audio having a different pitch or phase from each of a plurality of audio output devices, the clarity is reduced and the hearing is deteriorated. However, as described above, in the inventor's experiment, it was confirmed that the clarity is increased and the attention level is increased when listening to sounds having different pitches and phases with the left and right ears.

これは、聴覚が両耳を使用して音声をより明確に知覚しようとする働きを示しており、これまでにはない新しい発見である。以下の実施形態は、この発見を基にしており、左右の耳に対してピッチおよび位相の少なくとも一方が異なる音声による知覚上昇を利用して注意喚起および危険通知を可能とする。 This is a new discovery that has never been seen before, as hearing works by using both ears to perceive speech more clearly. The following embodiments are based on this finding and allow alerting and danger notification to the left and right ears by utilizing voice perception with different pitch and phase.

（第１の実施形態）
第１の実施形態にかかる音声処理装置は、強調部分に対応する音声のピッチおよび位相の少なくとも一方を変調し、変調した音声を出力する。これにより、音声信号の強度を変えることなく、利用者の注意力を増大させ、次動作をスムーズに実行させることが可能となる。 (First Embodiment)
The voice processing apparatus according to the first embodiment modulates at least one of the pitch and the phase of the voice corresponding to the emphasized portion, and outputs the modulated voice. This makes it possible to increase the user's attention and smoothly execute the next operation without changing the strength of the audio signal.

図１は、第１の実施形態にかかる音声処理装置１００の構成の一例を示すブロック図である。図１に示すように、音声処理装置１００は、記憶部１２１と、受付部１０１と、特定部１０２と、変調部１０３と、出力制御部１０４と、スピーカ１０５－１～１０５－ｎ（ｎは２以上の整数）と、を備えている。 FIG. 1 is a block diagram showing an example of the configuration of the voice processing device 100 according to the first embodiment. As shown in FIG. 1, the voice processing device 100 includes a storage unit 121, a reception unit 101, a specific unit 102, a modulation unit 103, an output control unit 104, and speakers 105-1 to 105-n (n is). An integer of 2 or more) and.

記憶部１２１は、音声処理装置１００で使用される各種データを記憶する。例えば記憶部１２１は、入力されたテキストデータ、および、テキストデータから特定された強調部分を示すデータなどを記憶する。記憶部１２１は、ＨＤＤ（Hard Disk Drive）、ＳＳＤ（Solid State Drive）、光ディスク、メモリカード、ＲＡＭ（Random Access Memory）などの一般的に利用されているあらゆる記憶媒体により構成することができる。 The storage unit 121 stores various data used in the voice processing device 100. For example, the storage unit 121 stores the input text data, data indicating an emphasized portion specified from the text data, and the like. The storage unit 121 can be configured by any commonly used storage medium such as an HDD (Hard Disk Drive), SSD (Solid State Drive), optical disk, memory card, and RAM (Random Access Memory).

スピーカ１０５－１～１０５－ｎは、出力制御部１０４からの指示に従い音声を出力する出力部である。スピーカ１０５－１～１０５－ｎは、同様の構成を備えるため、区別する必要がない場合は単にスピーカ１０５という場合がある。以下では、スピーカ１０５－１（第１出力部）およびスピーカ１０５－２（第２出力部）の２つのスピーカの組に出力する音声の間でピッチおよび位相の少なくとも一方を変調する場合を例に説明する。２以上の組に対して同様の処理を適用してもよい。 The speakers 105-1 to 105-n are output units that output voice according to instructions from the output control unit 104. Since the speakers 105-1 to 105-n have the same configuration, they may be simply referred to as the speaker 105 when it is not necessary to distinguish them. In the following, at least one of the pitch and the phase is modulated between the sounds output to the set of two speakers, the speaker 105-1 (first output unit) and the speaker 105-2 (second output unit), as an example. explain. Similar processing may be applied to two or more sets.

受付部１０１は、処理対象とする各種データを受け付ける。例えば受付部１０１は、音声に変換して出力させるテキストデータの入力を受け付ける。 The reception unit 101 receives various data to be processed. For example, the reception unit 101 receives an input of text data to be converted into voice and output.

特定部１０２は、出力させる音声のうち、強調して出力する部分を表す強調部分を特定する。強調部分は、注意喚起および危険通知等を行うためにピッチおよび位相の少なくとも一方を変調して出力する部分に相当する。例えば特定部１０２は、入力されたテキストデータから強調部分を特定する。入力するテキストデータに事前に強調部分を特定するための情報が付加されている場合は、特定部１０２は、付加された情報（付加情報）を参照して強調部分を特定することができる。特定部１０２は、テキストデータと、予め定められた強調部分を示すデータとを照合することにより、強調部分を特定してもよい。特定部１０２は、付加情報による特定およびデータ照合による特定の両方を実行してもよい。強調部分を示すデータは、記憶部１２１に記憶されてもよいし、音声処理装置１００の外部の記憶装置に記憶されてもよい。 The specific unit 102 specifies an emphasized portion representing a portion to be emphasized and output in the sound to be output. The emphasized portion corresponds to a portion where at least one of the pitch and the phase is modulated and output in order to call attention and give a danger notification. For example, the specific unit 102 identifies the emphasized portion from the input text data. When the information for specifying the emphasized portion is added to the input text data in advance, the specifying unit 102 can specify the emphasized portion by referring to the added information (additional information). The specifying unit 102 may specify the emphasized portion by collating the text data with the data indicating the predetermined emphasized portion. The identification unit 102 may execute both the identification by the additional information and the identification by the data collation. The data indicating the emphasized portion may be stored in the storage unit 121 or may be stored in a storage device external to the voice processing device 100.

特定部１０２は、特定した強調部分を強調することを示す情報（付加情報）をテキストデータに付加する符号化処理を実行してもよい。後段の変調部１０３は、このようにして付加された付加情報を参照して、変調する強調部分を判定することが可能となる。付加情報は、強調部分であることを判定可能であればどのような形式であってもよい。また、特定部１０２は、符号化処理を実行したテキストデータを記憶部１２１などの記憶媒体に保存してもよい。これにより、以降の音声出力処理では、事前に付加情報が付加されたテキストデータを利用することが可能となる。 The specific unit 102 may execute a coding process for adding information (additional information) indicating that the specified emphasized portion is emphasized to the text data. The modulation unit 103 in the subsequent stage can determine the emphasized portion to be modulated by referring to the additional information added in this way. The additional information may be in any format as long as it can be determined that it is an emphasized portion. Further, the specific unit 102 may store the text data subjected to the coding process in a storage medium such as the storage unit 121. As a result, in the subsequent voice output processing, it becomes possible to use the text data to which additional information is added in advance.

変調部１０３は、出力させる音声のピッチおよび位相の少なくとも一方である変調対象を変調する。例えば変調部１０３は、スピーカ１０５－１に出力させる音声（第１音声）の強調部分と、スピーカ１０５－２に出力させる音声（第２音声）の強調部分との間で変調対象が異なるように、少なくとも一方の音声の強調部分の変調対象を変調する。 The modulation unit 103 modulates a modulation target that is at least one of the pitch and the phase of the sound to be output. For example, in the modulation unit 103, the modulation target is different between the emphasized portion of the voice (first voice) output to the speaker 105-1 and the emphasized portion of the voice (second voice) output to the speaker 105-2. , Modulates the modulation target of at least one of the emphasized parts of the voice.

本実施形態では、変調部１０３は、テキストデータを変換した音声を生成するときに、テキストデータが強調部分であるかを逐次判定し、強調部分に対して変調処理を実行する。すなわち変調部１０３は、テキストデータを変換して、スピーカ１０５－１に出力させる音声（第１音声）およびスピーカ１０５－２に出力させる音声（第２音声）を生成するときに、強調部分のテキストデータに対しては、変調対象が相互に異なるように、少なくとも一方の変調対象を変調した第１音声および第２音声を生成する。 In the present embodiment, the modulation unit 103 sequentially determines whether the text data is the emphasized portion when generating the voice obtained by converting the text data, and executes the modulation process on the emphasized portion. That is, when the modulation unit 103 converts the text data to generate a voice (first voice) to be output to the speaker 105-1 and a voice (second voice) to be output to the speaker 105-2, the text of the emphasized portion is generated. For the data, the first voice and the second voice in which at least one modulation target is modulated are generated so that the modulation targets are different from each other.

テキストデータを音声に変換する処理（音声合成処理）は、フォルマント音声合成および音声コーパスベースの音声合成などの従来から用いられているあらゆる方法を用いることができる。 For the process of converting text data into speech (speech synthesis process), any conventionally used method such as formant speech synthesis and speech corpus-based speech synthesis can be used.

位相を変調する場合、変調部１０３は、スピーカ１０５－１およびスピーカ１０５－２のうち一方に入力する信号の極性を反転してもよい。これにより、スピーカ１０５の一方が他方に対して逆相になり、音声データの位相を変調する場合と同様の機能を実現できる。 When the phase is modulated, the modulation unit 103 may invert the polarity of the signal input to one of the speaker 105-1 and the speaker 105-2. As a result, one of the speakers 105 is out of phase with respect to the other, and the same function as in the case of modulating the phase of the voice data can be realized.

変調部１０３は、処理対象のデータの完全性を確認し、完全性が確認された場合に変調処理を行ってもよい。例えばテキストデータに付加された付加情報が、強調部分の開始を示す情報と、強調部分の終了を示す情報とを指定する形式の場合、変調部１０３は、開始を示す情報と終了を示す情報とが対応することが確認できた場合に変調処理を行ってもよい。 The modulation unit 103 may confirm the completeness of the data to be processed, and may perform the modulation processing when the completeness is confirmed. For example, when the additional information added to the text data is in a format in which information indicating the start of the emphasized portion and information indicating the end of the emphasized portion are specified, the modulation unit 103 includes information indicating the start and information indicating the end. If it is confirmed that the above corresponds to, the modulation processing may be performed.

出力制御部１０４は、スピーカ１０５からの音声の出力を制御する。例えば出力制御部１０４は、変調対象が変調された第１音声をスピーカ１０５－１に出力させ、第２音声をスピーカ１０５－２から出力させる。スピーカ１０５－１およびスピーカ１０５－２以外のスピーカ１０５が備えられている場合は、出力制御部１０４は、各スピーカ１０５に最適な音声を割り当てて出力させる。各スピーカ１０５は、出力制御部１０４からの出力データに基づいて音声を出力する。 The output control unit 104 controls the output of audio from the speaker 105. For example, the output control unit 104 outputs the first voice to which the modulation target is modulated to the speaker 105-1 and outputs the second voice from the speaker 105-2. When a speaker 105 other than the speaker 105-1 and the speaker 105-2 is provided, the output control unit 104 assigns the optimum voice to each speaker 105 and outputs the speaker 105. Each speaker 105 outputs sound based on the output data from the output control unit 104.

出力制御部１０４は、スピーカ１０５の位置および特性等のパラメータを用いて、各スピーカ１０５への出力（アンプ出力）を計算する。これらのパラメータは、例えば記憶部１２１に記憶される。 The output control unit 104 calculates the output (amplifier output) to each speaker 105 by using parameters such as the position and characteristics of the speaker 105. These parameters are stored, for example, in the storage unit 121.

例えば、２つのスピーカ１０５において必要な音圧を揃える場合には、以下のように、各スピーカへのアンプ出力Ｗ１、Ｗ２を計算する。２つのスピーカの距離をＬ１、Ｌ２とする。Ｌ１（Ｌ２）は、例えば、スピーカ１０５－１（スピーカ１０５－２）と頭部の中心との間の距離である。各スピーカ１０５から、最も近い耳までの距離を用いてもよい。使用する音声の可聴領域のスピーカ１０５－１（スピーカ１０５－２）のゲインをＧｓ１（Ｇｓ２）とする。距離が２倍になると６ｄＢの低下となり、３ｄＢの音圧上昇にアンプ出力が２倍必要であるとする。両耳での音圧を揃えるために、出力制御部１０４は、以下の式が成り立つように、アンプ出力Ｗ１、Ｗ２を計算して決定する。
－６×（Ｌ１／Ｌ２）×（１／２）＋（２／３）×Ｇｓ１×Ｗ１＝
－６×（Ｌ２／Ｌ１）×（１／２）＋（２／３）×Ｇｓ２×Ｗ２ For example, when the required sound pressures of the two speakers 105 are made uniform, the amplifier outputs W1 and W2 to each speaker are calculated as follows. Let the distance between the two speakers be L1 and L2. L1 (L2) is, for example, the distance between the speaker 105-1 (speaker 105-2) and the center of the head. The distance from each speaker 105 to the nearest ear may be used. The gain of the speaker 105-1 (speaker 105-2) in the audible region of the voice to be used is Gs1 (Gs2). It is assumed that when the distance is doubled, the sound pressure drops by 6 dB, and the amplifier output needs to be doubled to raise the sound pressure by 3 dB. In order to make the sound pressures in both ears uniform, the output control unit 104 calculates and determines the amplifier outputs W1 and W2 so that the following equation holds.
-6 x (L1 / L2) x (1/2) + (2/3) x Gs1 x W1 =
-6 x (L2 / L1) x (1/2) + (2/3) x Gs2 x W2

受付部１０１、特定部１０２、変調部１０３、および、出力制御部１０４は、例えば、ＣＰＵ（Central Processing Unit）などの１以上のプロセッサにプログラムを実行させること、すなわち、ソフトウェアにより実現してもよいし、ＩＣ（Integrated Circuit）などの１以上のプロセッサ、すなわちハードウェアにより実現してもよいし、ソフトウェアおよびハードウェアを併用して実現してもよい。 The reception unit 101, the specific unit 102, the modulation unit 103, and the output control unit 104 may be realized by, for example, having one or more processors such as a CPU (Central Processing Unit) execute a program, that is, by software. However, it may be realized by one or more processors such as an IC (Integrated Circuit), that is, hardware, or may be realized by using software and hardware in combination.

図２は、本実施形態のスピーカ１０５の配置の一例を示す図である。図２は、利用者２０５の鉛直上方から下方を観察した場合のスピーカ１０５の配置の例を示す。スピーカ１０５－１とスピーカ１０５－２からは、変調部１０３により変調処理が実行された音声が流れる。スピーカ１０５－１は、利用者２０５の右耳の延長上に置かれている。スピーカ１０５－２は、スピーカ１０５－１と右耳とを通過する線を基準として角度をもって置くことができる。 FIG. 2 is a diagram showing an example of the arrangement of the speaker 105 of the present embodiment. FIG. 2 shows an example of the arrangement of the speaker 105 when the user 205 is observed from vertically above to below. From the speaker 105-1 and the speaker 105-2, the sound that has been modulated by the modulation unit 103 flows. The speaker 105-1 is placed on the extension of the right ear of the user 205. The speaker 105-2 can be placed at an angle with respect to a line passing through the speaker 105-1 and the right ear.

発明者は、曲線２０３または曲線２０４に沿ってスピーカ１０５－２の位置を変化させ、ピッチおよび位相を変調した音声を出力した場合の注意力を測定し、いずれの場合でも注意力の増大を確認した。注意力は、ＥＥＧ（Electroencephalogram）、ＮＩＲＳ（Near-Infrared Spectroscopy）、および、主観評価などの評価基準を用いて測定した。 The inventor changed the position of the speaker 105-2 along the curve 203 or the curve 204, measured the attention when the pitch- and phase-modulated sound was output, and confirmed the increase in the attention in either case. did. Attention was measured using evaluation criteria such as EEG (Electroencephalogram), NIRS (Near-Infrared Spectroscopy), and subjective evaluation.

図３は、測定結果の一例を示す図である。図３のグラフの横軸は、スピーカ１０５の配置角度を表す。配置角度は、例えば、スピーカ１０５－１と利用者２０５とを結ぶ線と、スピーカ１０５－２と利用者２０５とを結ぶ線とのなす角度である。図３に示すように、配置角度が９０°から１８０°のときに注意力の増加が大きくなる。従って、スピーカ１０５－１とスピーカ１０５－２は、配置角度が９０°から１８０°となるように配置することが望ましい。なお注意力は検出されるため、配置角度が０°より大きければ９０°より小さくてもよい。 FIG. 3 is a diagram showing an example of measurement results. The horizontal axis of the graph in FIG. 3 represents the arrangement angle of the speaker 105. The arrangement angle is, for example, an angle formed by a line connecting the speaker 105-1 and the user 205 and a line connecting the speaker 105-2 and the user 205. As shown in FIG. 3, the increase in attention is large when the arrangement angle is from 90 ° to 180 °. Therefore, it is desirable that the speaker 105-1 and the speaker 105-2 are arranged so that the arrangement angle is from 90 ° to 180 °. Since attention is detected, if the arrangement angle is larger than 0 °, it may be smaller than 90 °.

音声の全区間のピッチまたは位相を変調してもよいが、この場合、慣れなどのために注意力が減少する可能性がある。そこで変調部１０３は、付加情報などにより特定された強調部分のみに対して変調を行う。これにより、強調部分に対する注意力をより効果的に高めることが可能となる。 The pitch or phase of the entire section of the voice may be modulated, but in this case, attention may be reduced due to habituation or the like. Therefore, the modulation unit 103 modulates only the emphasized portion specified by the additional information or the like. This makes it possible to more effectively increase attention to the emphasized portion.

図４は、本実施形態のスピーカ１０５の配置の他の例を示す図である。図４は、例えば屋外で場外放送を出力するために設置されるスピーカ１０５の配置の例を示す。図３に示すように、９０°から１８０°の配置角度となるスピーカ１０５の組を用いることが望ましい。従って、図４の例では、１８０°の配置角度で配置されるスピーカ１０５－１、スピーカ１０５－２の組に対して、音声の変調処理が実行される。 FIG. 4 is a diagram showing another example of the arrangement of the speaker 105 of the present embodiment. FIG. 4 shows an example of an arrangement of a speaker 105 installed for outputting an out-of-field broadcast, for example, outdoors. As shown in FIG. 3, it is desirable to use a set of speakers 105 having an arrangement angle of 90 ° to 180 °. Therefore, in the example of FIG. 4, the voice modulation process is executed for the set of the speaker 105-1 and the speaker 105-2 arranged at the arrangement angle of 180 °.

図５は、本実施形態のスピーカ１０５の配置の他の例を示す図である。図５は、ヘッドフォンとしてスピーカ１０５－１およびスピーカ１０５－２を構成した例である。 FIG. 5 is a diagram showing another example of the arrangement of the speaker 105 of the present embodiment. FIG. 5 shows an example in which the speaker 105-1 and the speaker 105-2 are configured as headphones.

スピーカ１０５の配置例は図２、図４および図５に限られるものではない。図３に示したように注意力が得られる配置角度で配置されれば、どのような組み合わせのスピーカであってもよい。例えば、カーナビゲーションのために用いられる複数のスピーカに対して本実施形態を適用してもよい。 The arrangement example of the speaker 105 is not limited to FIGS. 2, 4 and 5. As shown in FIG. 3, any combination of speakers may be used as long as they are arranged at an arrangement angle that allows attention to be obtained. For example, the present embodiment may be applied to a plurality of speakers used for car navigation.

次に、ピッチ変調および位相変調について説明する。図６は、ピッチ変調および位相変調について説明するための図である。位相変調は、音声の包絡線６０４をもとに、元の信号６０１に対して同一の包絡線に対して単位時間内の波数を変えることなく、ピークの時間位置を変更した信号６０３を出力する。ピッチ変調は、波数を変更した信号６０２を出力する。 Next, pitch modulation and phase modulation will be described. FIG. 6 is a diagram for explaining pitch modulation and phase modulation. The phase modulation outputs a signal 603 in which the peak time position is changed without changing the wave number within a unit time for the same envelope with respect to the original signal 601 based on the envelope 604 of the voice. .. Pitch modulation outputs a signal 602 with a changed wave number.

次に、ピッチまたは位相の変調と、音声の聞き取りやすさとの関係について説明する。図７は、位相の差（度）と背景音の音圧（ｄＢ）との関係を示す図である。位相の差は、２つのスピーカ１０５から出力させる音声間の位相の差（例えばスピーカ１０５－１から出力させる音声の位相と、スピーカ１０５－２から出力させる音声の位相との差）を表す。背景音の音圧は、出力された音声を利用者が聞き取ることができる背景音の音圧の最大値（限界音圧）を表す。 Next, the relationship between pitch or phase modulation and the audibility of speech will be described. FIG. 7 is a diagram showing the relationship between the phase difference (degrees) and the sound pressure (dB) of the background sound. The phase difference represents the phase difference between the sounds output from the two speakers 105 (for example, the difference between the phase of the sound output from the speaker 105-1 and the phase of the sound output from the speaker 105-2). The sound pressure of the background sound represents the maximum value (limit sound pressure) of the sound pressure of the background sound that the user can hear the output sound.

背景音は、スピーカ１０５から出力する音声以外の音である。例えば周囲の雑音、および、音声以外に出力されている音楽等の音が、背景音に相当する。図７の矩形で示す点が、得られた値の平均値を表す。この点の上下の線で示した範囲が得られた値の標準偏差を表す。 The background sound is a sound other than the sound output from the speaker 105. For example, ambient noise and sounds such as music output other than voice correspond to background sounds. The points indicated by the rectangles in FIG. 7 represent the average value of the obtained values. The range indicated by the lines above and below this point represents the standard deviation of the obtained values.

図７に示すように、０．５ｄＢ以上の背景音が存在する場合であっても、位相の差が６０°以上１８０°以下であれば、利用者はスピーカ１０５から出力される音声を聞き取ることができる。従って、変調部１０３は、位相の差が６０°以上１８０°以下となるように変調処理を実行してもよい。変調部１０３は、より限界音圧の高い９０°以上１８０°以下、または、１２０°以上１８０°以下の位相差となるように変調処理を実行してもよい。 As shown in FIG. 7, even when a background sound of 0.5 dB or more is present, if the phase difference is 60 ° or more and 180 ° or less, the user listens to the sound output from the speaker 105. Can be done. Therefore, the modulation unit 103 may execute the modulation process so that the phase difference is 60 ° or more and 180 ° or less. The modulation unit 103 may execute the modulation process so that the phase difference is 90 ° or more and 180 ° or less, or 120 ° or more and 180 ° or less, which has a higher limit sound pressure.

図８は、周波数差（Ｈｚ）と背景音の音圧（ｄＢ）との関係を示す図である。周波数差は、２つのスピーカ１０５から出力させる音声の周波数の差（例えばスピーカ１０５－１から出力させる音声の周波数と、スピーカ１０５－２から出力させる音声の周波数との差）を表す。図８の矩形で示す点が、得られた値の平均値を表す。この点の横に付した数値“Ａ、Ｂ”のうち、Ａが周波数差を表し、Ｂが背景音の音圧を表す。 FIG. 8 is a diagram showing the relationship between the frequency difference (Hz) and the sound pressure (dB) of the background sound. The frequency difference represents the difference in the frequency of the sound output from the two speakers 105 (for example, the difference between the frequency of the sound output from the speaker 105-1 and the frequency of the sound output from the speaker 105-2). The points indicated by the rectangles in FIG. 8 represent the average value of the obtained values. Of the numerical values "A, B" attached next to this point, A represents the frequency difference and B represents the sound pressure of the background sound.

図８に示すように、背景音が存在する場合であっても、周波数差が１００Ｈｚ（ヘルツ）以上であれば、利用者はスピーカ１０５から出力される音声を聞き取ることができる。従って、変調部１０３は、可聴域の範囲内で、周波数差が１００Ｈｚ以上となるように変調処理を実行してもよい。 As shown in FIG. 8, even when the background sound is present, the user can hear the sound output from the speaker 105 if the frequency difference is 100 Hz (hertz) or more. Therefore, the modulation unit 103 may execute the modulation process so that the frequency difference is 100 Hz or more within the audible range.

次に、このように構成された第１の実施形態にかかる音声処理装置１００による音声出力処理について図９を用いて説明する。図９は、第１の実施形態における音声出力処理の一例を示すフローチャートである。 Next, the voice output processing by the voice processing device 100 according to the first embodiment configured in this way will be described with reference to FIG. FIG. 9 is a flowchart showing an example of audio output processing according to the first embodiment.

受付部１０１は、テキストデータの入力を受け付ける（ステップＳ１０１）。特定部１０２は、テキストデータに付加情報が付加されているか否かを判断する（ステップＳ１０２）。付加されていない場合（ステップＳ１０２：Ｎｏ）、特定部１０２は、テキストデータから強調部分を特定する（ステップＳ１０３）。例えば特定部１０２は、入力されたテキストデータと、予め定められた強調部分を示すデータとを照合することにより、強調部分を特定する。特定部１０２は、強調部分を示す付加情報を、対応するテキストデータの強調部分に付加する（ステップＳ１０４）。付加情報の付加方法は、変調部１０３が、強調部分を特定できればどのような方法であってもよい。 The reception unit 101 accepts the input of text data (step S101). The specific unit 102 determines whether or not additional information is added to the text data (step S102). When it is not added (step S102: No), the specific unit 102 specifies the emphasized portion from the text data (step S103). For example, the specific unit 102 identifies the emphasized portion by collating the input text data with the data indicating the predetermined emphasized portion. The specific unit 102 adds additional information indicating the emphasized portion to the emphasized portion of the corresponding text data (step S104). The method of adding the additional information may be any method as long as the modulation unit 103 can specify the emphasized portion.

付加情報が付加された後（ステップＳ１０４）、および、テキストデータに付加情報が付加されている場合（ステップＳ１０２：Ｙｅｓ）、変調部１０３は、テキストデータに対応する音声であって、強調部分のテキストデータに対しては変調対象が相互に異なるように変調対象を変調した音声（第１音声、第２音声）を生成する（ステップＳ１０５）。 After the additional information is added (step S104) and when the additional information is added to the text data (step S102: Yes), the modulation unit 103 is a voice corresponding to the text data and is an emphasized portion. For the text data, a voice (first voice, second voice) in which the modulation target is modulated so that the modulation targets are different from each other is generated (step S105).

出力制御部１０４は、スピーカ１０５ごとに出力する音声を決定し、決定した音声を出力させる（ステップＳ１０６）。各スピーカ１０５は、出力制御部１０４の指示に従い音声を出力する。 The output control unit 104 determines the sound to be output for each speaker 105, and outputs the determined sound (step S106). Each speaker 105 outputs sound according to the instruction of the output control unit 104.

このように、第１の実施形態にかかる音声処理装置では、テキストデータに対応する音声を生成しながら、強調部分に対応するテキストデータに対しては、音声のピッチおよび位相の少なくとも一方を変調し、変調した音声を出力する。これにより、音声信号の強度を変えることなく、利用者の注意力を増大させることが可能となる。 As described above, in the voice processing apparatus according to the first embodiment, while generating the voice corresponding to the text data, at least one of the pitch and the phase of the voice is modulated with respect to the text data corresponding to the emphasized portion. , Outputs modulated audio. This makes it possible to increase the user's attention without changing the strength of the audio signal.

（第２の実施形態）
第１の実施形態では、テキストデータを逐次音声に変換するときに、強調部分のテキストデータに対して変調処理を行った。第２の実施形態にかかる音声処理装置は、テキストデータに対する音声を生成した後、生成した音声のうち強調部分に相当する音声に対して変調処理を行う。 (Second embodiment)
In the first embodiment, when the text data is sequentially converted into speech, the text data in the emphasized portion is modulated. The voice processing device according to the second embodiment generates a voice for text data, and then performs a modulation process on the voice corresponding to the emphasized portion of the generated voice.

図１０は、第２の実施形態にかかる音声処理装置１００－２の構成の一例を示すブロック図である。図１０に示すように、音声処理装置１００－２は、記憶部１２１と、受付部１０１と、特定部１０２と、変調部１０３－２と、出力制御部１０４と、スピーカ１０５－１～１０５－ｎと、生成部１０６－２と、を備えている。 FIG. 10 is a block diagram showing an example of the configuration of the voice processing device 100-2 according to the second embodiment. As shown in FIG. 10, the voice processing device 100-2 includes a storage unit 121, a reception unit 101, a specific unit 102, a modulation unit 103-2, an output control unit 104, and speakers 105-1 to 105-. n and a generation unit 106-2 are provided.

第２の実施形態では、変調部１０３－２の機能、および、生成部１０６－２を追加したことが第１の実施形態と異なっている。その他の構成および機能は、第１の実施形態にかかる音声処理装置１００のブロック図である図１と同様であるので、同一符号を付し、ここでの説明は省略する。 In the second embodiment, the function of the modulation unit 103-2 and the addition of the generation unit 106-2 are different from the first embodiment. Other configurations and functions are the same as those in FIG. 1, which is a block diagram of the voice processing device 100 according to the first embodiment. Therefore, the same reference numerals are given, and the description thereof is omitted here.

生成部１０６－２は、テキストデータに対応する音声を生成する。例えば生成部１０６－２は、入力されたテキストデータを、スピーカ１０５－１に出力する音声（第１音声）およびスピーカ１０５－２に出力する音声（第２音声）に変換する。 The generation unit 106-2 generates the voice corresponding to the text data. For example, the generation unit 106-2 converts the input text data into a voice output to the speaker 105-1 (first voice) and a voice output to the speaker 105-2 (second voice).

変調部１０３－２は、生成部１０６－２により生成された音声のうち、強調部分の音声に対して変調処理を行う。例えば変調部１０３－２は、生成された第１音声の強調部分と生成された第２音声の強調部分との間で変調対象が異なるように、第１音声および第２音声の少なくとも一方の強調部分の変調対象を変調する。 The modulation unit 103-2 performs modulation processing on the voice of the emphasized portion among the voices generated by the generation unit 106-2. For example, the modulation unit 103-2 emphasizes at least one of the first voice and the second voice so that the modulation target differs between the generated first voice enhancement portion and the generated second voice enhancement portion. Modulate the modulation target of the part.

次に、このように構成された第２の実施形態にかかる音声処理装置１００－２による音声出力処理について図１１を用いて説明する。図１１は、第２の実施形態における音声出力処理の一例を示すフローチャートである。 Next, the voice output processing by the voice processing device 100-2 according to the second embodiment configured in this way will be described with reference to FIG. FIG. 11 is a flowchart showing an example of audio output processing according to the second embodiment.

ステップＳ２０１からステップＳ２０４までは、第１の実施形態にかかる音声処理装置１００におけるステップＳ１０１からステップＳ１０４までと同様の処理なので、その説明を省略する。 Since steps S201 to S204 are the same processes as steps S101 to S104 in the voice processing device 100 according to the first embodiment, the description thereof will be omitted.

本実施形態では、テキストデータが入力されると、生成部１０６－２による音声生成処理（音声合成処理）が実行される。すなわち、生成部１０６－２は、テキストデータに対応する音声を生成する（ステップＳ２０５）。 In the present embodiment, when the text data is input, the voice generation process (speech synthesis process) by the generation unit 106-2 is executed. That is, the generation unit 106-2 generates the voice corresponding to the text data (step S205).

音声を生成後（ステップＳ２０５）、付加情報が付加された後（ステップＳ２０４）、および、テキストデータに付加情報が付加されている場合（ステップＳ２０２：Ｙｅｓ）、変調部１０３－２は、生成された音声から強調部分を抽出する（ステップＳ２０６）。例えば変調部１０３－２は、付加情報を参照してテキストデータのうち強調部分を特定するとともに、テキストデータと生成した音声との対応から、特定したテキストデータの強調部分に対応する音声の強調部分を抽出する。変調部１０３－２は、抽出した音声の強調部分に対して変調処理を実行する（ステップＳ２０７）。なお変調部１０３－２は、強調部分以外の音声に対しては変調処理を行わない。 After the voice is generated (step S205), after the additional information is added (step S204), and when the additional information is added to the text data (step S202: Yes), the modulation unit 103-2 is generated. The emphasized portion is extracted from the voice (step S206). For example, the modulation unit 103-2 identifies the emphasized portion of the text data by referring to the additional information, and from the correspondence between the text data and the generated voice, the emphasized portion of the voice corresponding to the emphasized portion of the specified text data. To extract. The modulation unit 103-2 executes a modulation process on the emphasized portion of the extracted voice (step S207). Note that the modulation unit 103-2 does not perform modulation processing on the voice other than the emphasized portion.

ステップＳ２０８は、第１の実施形態にかかる音声処理装置１００におけるステップＳ１０６と同様の処理なので、その説明を省略する。 Since step S208 is the same process as step S106 in the voice processing device 100 according to the first embodiment, the description thereof will be omitted.

このように、第２の実施形態にかかる音声処理装置では、テキストデータに対応する音声を生成した後に、音声の強調部分のピッチおよび位相の少なくとも一方を変調し、変調した音声を出力する。これにより、音声信号の強度を変えることなく、利用者の注意力を増大させることが可能となる。 As described above, in the voice processing apparatus according to the second embodiment, after generating the voice corresponding to the text data, at least one of the pitch and the phase of the emphasized portion of the voice is modulated, and the modulated voice is output. This makes it possible to increase the user's attention without changing the strength of the audio signal.

（第３の実施形態）
第１および第２の実施形態では、テキストデータを入力し、テキストデータを音声に変換して出力した。このような実施形態は、例えば、予め定められた緊急災害放送用のテキストデータを出力する場合などに適用できる。一方、利用者が発声した音声を緊急災害放送用に出力する状況も考えられる。第３の実施形態にかかる音声処理装置は、マイクなどの音声入力装置から音声を入力し、入力された音声の強調部分に対して変調処理を行う。 (Third embodiment)
In the first and second embodiments, text data is input, and the text data is converted into voice and output. Such an embodiment can be applied, for example, to output a predetermined text data for emergency disaster broadcasting. On the other hand, it is conceivable that the voice uttered by the user is output for emergency disaster broadcasting. The voice processing device according to the third embodiment inputs voice from a voice input device such as a microphone, and performs modulation processing on the emphasized portion of the input voice.

図１２は、第３の実施形態にかかる音声処理装置１００－３の構成の一例を示すブロック図である。図１２に示すように、音声処理装置１００－３は、記憶部１２１と、受付部１０１－３と、特定部１０２－３と、変調部１０３－３と、出力制御部１０４と、スピーカ１０５－１～１０５－ｎと、生成部１０６－２と、を備えている。 FIG. 12 is a block diagram showing an example of the configuration of the voice processing device 100-3 according to the third embodiment. As shown in FIG. 12, the voice processing device 100-3 includes a storage unit 121, a reception unit 101-3, a specific unit 102-3, a modulation unit 103-3, an output control unit 104, and a speaker 105-. 1 to 105-n and a generation unit 106-2 are provided.

第３の実施形態では、受付部１０１－３、特定部１０２－３、および、変調部１０３－３の機能が第２の実施形態と異なっている。その他の構成および機能は、第２の実施形態にかかる音声処理装置１００－２のブロック図である図１０と同様であるので、同一符号を付し、ここでの説明は省略する。 In the third embodiment, the functions of the reception unit 101-3, the specific unit 102-3, and the modulation unit 103-3 are different from those of the second embodiment. Other configurations and functions are the same as those in FIG. 10, which is a block diagram of the voice processing device 100-2 according to the second embodiment, and thus the same reference numerals are given, and the description thereof is omitted here.

受付部１０１－３は、テキストデータのみでなく、マイクなどの音声入力装置から入力される音声を受け付ける。また、受付部１０１－３は、入力される音声のうち強調する部分の指定を受け付ける。例えば受付部１０１－３は、利用者による所定のボタンの押下を、押下後に入力される音声が強調する部分であることを示す指定として受け付ける。受付部１０１－３は、強調部分の開始および終了の指定を、開始から終了までに入力された音声が強調する部分であることを示す指定として受け付けてもよい。指定方法はこれらに限られるものではなく、音声のうち強調する部分を決定可能であればどのような方法であってもよい。以下では、音声のうち強調する部分の指定をトリガーという場合がある。 The reception unit 101-3 receives not only text data but also voice input from a voice input device such as a microphone. Further, the reception unit 101-3 receives the designation of the portion to be emphasized in the input voice. For example, the reception unit 101-3 accepts the user's pressing of a predetermined button as a designation indicating that the voice input after the pressing is the emphasized portion. The reception unit 101-3 may accept the designation of the start and end of the emphasized portion as a designation indicating that the voice input from the start to the end is the portion to be emphasized. The designation method is not limited to these, and any method may be used as long as it is possible to determine the part to be emphasized in the voice. In the following, the designation of the emphasized part of the voice may be referred to as a trigger.

特定部１０２－３は、さらに、受け付けられた指定（トリガー）に基づいて、音声の強調部分を特定する機能を有する。 The specific unit 102-3 further has a function of specifying the emphasized portion of the voice based on the received designation (trigger).

変調部１０３－３は、生成部１０６－２により生成された音声、または、入力された音声のうち、強調部分の音声に対して変調処理を行う。 The modulation unit 103-3 performs modulation processing on the voice of the emphasized portion of the voice generated by the generation unit 106-2 or the input voice.

次に、このように構成された第３の実施形態にかかる音声処理装置１００－３による音声出力処理について図１３を用いて説明する。図１３は、第３の実施形態における音声出力処理の一例を示すフローチャートである。 Next, the voice output processing by the voice processing device 100-3 according to the third embodiment configured in this way will be described with reference to FIG. FIG. 13 is a flowchart showing an example of audio output processing according to the third embodiment.

受付部１０１－３は、音声入力優先であるか否かを判定する（ステップＳ３０１）。音声入力優先とは、テキストデータではなく、音声を入力して出力することを示す指定である。例えば、音声入力優先を指定するためのボタンが押下された場合に、受付部１０１－３は、音声入力優先であると判定する。 The reception unit 101-3 determines whether or not the voice input is prioritized (step S301). Voice input priority is a designation indicating that voice is input and output instead of text data. For example, when the button for designating the voice input priority is pressed, the reception unit 101-3 determines that the voice input priority is given.

音声入力優先であるかの判定方法はこれに限られるものではない。例えば、音声入力優先であるかを示す事前に保存された情報を参照して判定してもよい。また、テキストデータは入力せず、音声入力のみとする場合は、音声入力優先の指定や判定（ステップＳ３０１）を実行しなくてもよい。この場合、後述するテキストデータに基づく付加処理（ステップＳ３０６）も実行しなくてもよい。 The method for determining whether the voice input is prioritized is not limited to this. For example, it may be determined by referring to the information stored in advance indicating whether the voice input is prioritized. Further, when the text data is not input and only the voice input is performed, it is not necessary to execute the voice input priority designation or determination (step S301). In this case, it is not necessary to execute the additional process (step S306) based on the text data described later.

音声入力優先の場合（ステップＳ３０１：Ｙｅｓ）、受付部１０１－３は、音声の入力を受け付ける（ステップＳ３０２）。特定部１０２－３は、音声の強調する部分の指定（トリガー）が入力されているか否かを判定する（ステップＳ３０３）。 In the case of voice input priority (step S301: Yes), the reception unit 101-3 accepts voice input (step S302). The specific unit 102-3 determines whether or not the designation (trigger) of the portion to be emphasized in the voice is input (step S303).

トリガーが入力されていない場合（ステップＳ３０３：Ｎｏ）、特定部１０２－３は、音声の強調部分を特定する（ステップＳ３０４）。例えば特定部１０２－３は、入力されている音声と、予め登録された音声データとを照合し、登録された音声データと一致または類似する音声を強調部分として特定する。特定部１０２－３は、入力された音声を音声認識して得られるテキストデータと、予め定められた強調部分を示すデータとを照合することにより、強調部分を特定してもよい。 When the trigger is not input (step S303: No), the identification unit 102-3 specifies the emphasized portion of the voice (step S304). For example, the specifying unit 102-3 collates the input voice with the voice data registered in advance, and specifies the voice that matches or is similar to the registered voice data as the emphasized part. The specifying unit 102-3 may specify the emphasized portion by collating the text data obtained by recognizing the input voice with the data indicating the predetermined emphasized portion.

ステップＳ３０３でトリガーが入力されていると判定した場合（ステップＳ３０３：Ｙｅｓ）、および、ステップＳ３０４で強調部分を特定した後、特定部１０２－３は、入力されている音声のデータに対して、強調部分を示す付加情報を付加する（ステップＳ３０５）。付加情報の付加方法は、音声が強調部分であることを判定できればどのような方法であってもよい。 When it is determined in step S303 that the trigger is input (step S303: Yes), and after the emphasized portion is specified in step S304, the identification unit 102-3 refers to the input voice data. Additional information indicating the emphasized portion is added (step S305). The method of adding the additional information may be any method as long as it can be determined that the voice is the emphasized portion.

ステップＳ３０１で音声入力優先でないと判定された場合（ステップＳ３０１：Ｎｏ）、テキストに基づく付加処理が実行される（ステップＳ３０６）。この処理は、例えば図１１のステップＳ２０１からステップＳ２０５までと同様の処理で実現できる。 When it is determined in step S301 that the voice input priority is not given (step S301: No), additional processing based on the text is executed (step S306). This process can be realized, for example, by the same process as in steps S201 to S205 of FIG.

変調部１０３－３は、生成された音声から強調部分を抽出する（ステップＳ３０７）。例えば変調部１０３－３は、付加情報を参照して音声の強調部分を抽出する。ステップＳ３０６を実行した場合は、変調部１０３－３は、図１１のステップＳ２０６と同様の処理により強調部分を抽出する。 The modulation unit 103-3 extracts the emphasized portion from the generated voice (step S307). For example, the modulation unit 103-3 extracts the emphasized portion of the voice by referring to the additional information. When step S306 is executed, the modulation unit 103-3 extracts the emphasized portion by the same processing as in step S206 of FIG.

ステップＳ３０８からステップＳ３０９までは、第２の実施形態にかかる音声処理装置１００－２におけるステップＳ２０７からステップＳ２０８までと同様の処理なので、その説明を省略する。 Since steps S308 to S309 are the same processes as steps S207 to S208 in the voice processing device 100-2 according to the second embodiment, the description thereof will be omitted.

このように、第３の実施形態にかかる音声処理装置では、入力された音声の強調部分をトリガーなどにより特定し、音声の強調部分のピッチおよび位相の少なくとも一方を変調し、変調した音声を出力する。これにより、音声信号の強度を変えることなく、利用者の注意力を増大させることが可能となる。 As described above, in the voice processing apparatus according to the third embodiment, the emphasized portion of the input voice is specified by a trigger or the like, at least one of the pitch and the phase of the emphasized portion of the voice is modulated, and the modulated voice is output. do. This makes it possible to increase the user's attention without changing the strength of the audio signal.

（第４の実施形態）
上記実施形態では、例えば付加情報およびトリガーを参照して強調部分を特定した。強調部分の特定方法はこれに限られるものではない。第４の実施形態の音声処理装置は、出力させる音声に含まれる音声（部分音声）のうち、いずれか１以上の部分音声を、部分音声の属性に基づいて強調部分として特定する。 (Fourth Embodiment)
In the above embodiment, the emphasized portion is specified by referring to, for example, additional information and a trigger. The method of specifying the emphasized part is not limited to this. The voice processing device of the fourth embodiment specifies one or more partial voices among the voices (partial voices) included in the voices to be output as emphasized parts based on the attributes of the partial voices.

以下では、音声による学習のためのアプリケーション、または、テキストデータを音声として出力するアプリケーションとして音声処理装置を実現した例を説明する。音声による学習は、例えば、音声による外国語の学習、および、教科の内容を音声により出力する学習など、音声を用いた任意の学習を含む。テキストデータを音声として出力するアプリケーションは、例えば、書籍の内容を読み上げて音声により出力する朗読アプリケーションを含む。適用可能なアプリケーションはこれらに限られるものではない。 In the following, an example of realizing a voice processing device as an application for learning by voice or an application for outputting text data as voice will be described. The learning by voice includes arbitrary learning using voice, for example, learning of a foreign language by voice and learning to output the contents of a subject by voice. An application that outputs text data as voice includes, for example, a reading application that reads out the contents of a book and outputs it by voice. Applicable applications are not limited to these.

音声による学習のためのアプリケーションに適用することにより、例えば、学習の対象となる部分を適切に強調し、学習効果をより増大させることが可能となる。また、テキストデータを音声として出力するアプリケーションに適用することにより、例えば、音声の特定の部分に注意を向けさせることが可能となる。また、朗読アプリケーションに適用することにより、例えば、物語の臨場感をより増大させることが可能となる。 By applying it to an application for learning by voice, for example, it is possible to appropriately emphasize the part to be learned and further increase the learning effect. Further, by applying it to an application that outputs text data as voice, it is possible to draw attention to a specific part of voice, for example. Further, by applying it to a reading application, for example, it becomes possible to further increase the presence of the story.

図１４は、第４の実施形態にかかる音声処理装置１００－４の構成の一例を示すブロック図である。図１４に示すように、音声処理装置１００－４は、記憶部１２１－４と、表示部１２２－４と、受付部１０１－４と、特定部１０２－４と、変調部１０３－４と、出力制御部１０４－４と、スピーカ１０５－１～１０５－ｎと、を備えている。スピーカ１０５－１～１０５－ｎは、第１の実施形態にかかる音声処理装置１００のブロック図である図１と同様であるので、同一符号を付し、ここでの説明は省略する。 FIG. 14 is a block diagram showing an example of the configuration of the voice processing device 100-4 according to the fourth embodiment. As shown in FIG. 14, the voice processing device 100-4 includes a storage unit 121-4, a display unit 122-4, a reception unit 101-4, a specific unit 102-4, and a modulation unit 103-4. It includes an output control unit 104-4 and speakers 105-1 to 105-n. Since the speakers 105-1 to 105-n are the same as those in FIG. 1, which is a block diagram of the voice processing device 100 according to the first embodiment, they are designated by the same reference numerals, and the description thereof will be omitted here.

記憶部１２１－４は、出力させる音声に含まれる部分音声の属性の一例として出力回数をさらに記憶する点が、第１の実施形態の記憶部１２１と異なっている。図１５は、記憶部１２１－４に記憶されるデータの構造の一例を示す図である。図１５は、学習の対象とする部分音声を示すデータのデータ構造の一例を示す。図１５に示すように、このデータは、音声ＩＤと、単語と、時間と、出力回数と、を含む。 The storage unit 121-4 is different from the storage unit 121 of the first embodiment in that the number of output times is further stored as an example of the attribute of the partial voice included in the voice to be output. FIG. 15 is a diagram showing an example of the structure of data stored in the storage unit 121-4. FIG. 15 shows an example of a data structure of data showing a partial voice to be learned. As shown in FIG. 15, this data includes a voice ID, a word, a time, and the number of outputs.

音声ＩＤは、出力対象となる音声を識別する識別情報である。例えば、数値、および、音声を記憶するファイルのファイル名などを音声ＩＤとすることができる。 The voice ID is identification information that identifies the voice to be output. For example, a numerical value, a file name of a file for storing voice, or the like can be used as a voice ID.

単語は、学習の対象の一例であり、他の情報を学習の対象としてもよい。例えば、複数の単語を含む文または章などの、単語以外の対象を、単語とともに、または、単語の代わりに用いてもよい。記憶部１２１－４に記憶する単語は、音声に含まれるすべての単語のうち、ユーザなどにより選択された一部の単語であってもよいし、音声に含まれるすべての単語であってもよい。単語の選択方法の例については後述する。 A word is an example of a learning target, and other information may be a learning target. For example, a non-word object, such as a sentence or chapter containing multiple words, may be used with or in place of the word. The word stored in the storage unit 121-4 may be a part of the words selected by the user or the like among all the words included in the voice, or may be all the words included in the voice. .. An example of how to select a word will be described later.

時間は、単語に対応する部分音声の音声内での位置を示す。部分音声の位置を特定できる情報であれば、時間以外の情報を記憶してもよい。 Time indicates the position of the partial speech corresponding to the word in the speech. Information other than time may be stored as long as the information can specify the position of the partial voice.

単語および時間は、例えば、学習に用いる音声を音声認識することにより得られる。音声処理装置１００－４は、他の装置で予め生成された図１５のようなデータを取得して記憶部１２１－４に記憶してもよい。音声処理装置１００－４が、入力された音声を音声認識して得られたデータを記憶部１２１－４に記憶してもよい。 Words and times are obtained, for example, by speech recognition of speech used for learning. The voice processing device 100-4 may acquire data as shown in FIG. 15 generated in advance by another device and store it in the storage unit 121-4. The voice processing device 100-4 may store the data obtained by voice recognition of the input voice in the storage unit 121-4.

出力回数は、単語に対応する部分音声を出力した回数を示す。例えば、学習が開始されてからの部分音声を出力した回数の累積値が、出力回数として記憶部１２１－４に記憶される。なお出力回数は部分音声の属性の一例であり、出力回数以外の情報を部分音声の属性として用いてもよい。他の属性の例については後述する。 The number of outputs indicates the number of times that the partial voice corresponding to the word is output. For example, the cumulative value of the number of times the partial voice is output since the learning is started is stored in the storage unit 121-4 as the number of times of output. The number of outputs is an example of the attribute of the partial voice, and information other than the number of outputs may be used as the attribute of the partial voice. Examples of other attributes will be described later.

図１４に戻り、表示部１２２－４は、各種処理で用いられるデータを表示する表示装置である。表示部１２２－４は、例えば液晶ディスプレイなどにより構成することができる。 Returning to FIG. 14, the display unit 122-4 is a display device that displays data used in various processes. The display unit 122-4 can be configured by, for example, a liquid crystal display.

受付部１０１－４は、学習の対象となる単語の指定などをさらに受け付ける点が第１の実施形態の受付部１０１と異なっている。 The reception unit 101-4 is different from the reception unit 101 of the first embodiment in that it further accepts the designation of a word to be learned.

特定部１０２－４は、音声に含まれる１以上の部分音声のうちいずれか１以上の部分音声を、部分音声の属性に基づいて強調部分として特定する。例えば出力回数を属性とする場合、特定部１０２－４は、出力回数が閾値以下である部分音声を、強調部分として特定する。これにより、例えば、出力回数が少ないために学習が不十分であると解釈される単語が優先的に強調され、より学習効果を高めることが可能となる。出力回数の代わりに音声の出力時間（例えば学習開始からの出力時間の累計）を属性として用いる場合も、同様の効果を得ることができる。 The specific unit 102-4 identifies any one or more of the partial voices included in the voice as the emphasized part based on the attributes of the partial voices. For example, when the number of outputs is used as an attribute, the specific unit 102-4 specifies the partial voice whose output number is equal to or less than the threshold value as the emphasized portion. As a result, for example, words that are interpreted as insufficient learning due to a small number of outputs are preferentially emphasized, and it becomes possible to further enhance the learning effect. The same effect can be obtained when the voice output time (for example, the cumulative output time from the start of learning) is used as an attribute instead of the number of outputs.

変調部１０３－４は、属性に基づいて強調部分を変調する度合い（変調強度）を変更する点が、第１の実施形態の変調部１０３と異なっている。例えば変調部１０３－４は、出力回数が小さい部分音声は、変調強度がより大きくなるように、第１音声および第２音声の少なくとも一方を変調する。変調強度は、出力回数に応じて線形に変更してもよいし、非線形となるように変更してもよい。変調部１０３－４は、強調部分に含まれる各部分の変調強度を相互に異ならせてもよい。例えば、単語のアクセント部分のみを強調するように変調強度を制御してもよい。なお、属性に基づいて変調強度を変更しないように構成してもよい。この場合は第１の実施形態と同様の変調部１０３を備えればよい。 The modulation unit 103-4 is different from the modulation unit 103 of the first embodiment in that the degree of modulation (modulation intensity) of the emphasized portion is changed based on the attribute. For example, the modulation unit 103-4 modulates at least one of the first voice and the second voice so that the partial voice having a small number of outputs has a higher modulation intensity. The modulation intensity may be changed linearly or non-linearly depending on the number of outputs. The modulation unit 103-4 may have different modulation intensities of each portion included in the emphasized portion. For example, the modulation intensity may be controlled so as to emphasize only the accent portion of the word. It should be noted that the modulation intensity may not be changed based on the attribute. In this case, the same modulation unit 103 as in the first embodiment may be provided.

出力制御部１０４－４は、表示部１２２－４に対する各種データの出力（表示）を制御する機能をさらに備える点が、第１の実施形態の出力制御部１０４と異なっている。 The output control unit 104-4 is different from the output control unit 104 of the first embodiment in that it further includes a function of controlling the output (display) of various data to the display unit 122-4.

次に、このように構成された第４の実施形態にかかる音声処理装置１００－４による音声出力処理について図１６を用いて説明する。図１６は、第４の実施形態における音声出力処理の一例を示すフローチャートである。 Next, the voice output processing by the voice processing device 100-4 according to the fourth embodiment configured in this way will be described with reference to FIG. FIG. 16 is a flowchart showing an example of audio output processing according to the fourth embodiment.

受付部１０１－４は、テキストデータの入力を受け付ける（ステップＳ４０１）。特定部１０２－４は、テキストデータから、属性を参照して強調部分を特定する（ステップＳ４０２）。例えば出力回数を属性とする場合、特定部１０２－４は、記憶部１２１－４に記憶された出力回数が閾値以下である単語を、強調部分として特定する。 The reception unit 101-4 accepts the input of text data (step S401). The specifying unit 102-4 identifies the emphasized part from the text data by referring to the attribute (step S402). For example, when the number of outputs is used as an attribute, the specifying unit 102-4 specifies a word whose output number stored in the storage unit 121-4 is equal to or less than a threshold value as an emphasized portion.

変調部１０３－４は、特定された強調部分を変調した音声を生成する（ステップＳ４０３）。例えば変調部１０３－４は、特定された強調部分（単語など）に対応する音声であって、強調部分に対しては変調対象が相互に異なるように変調対象を変調した音声（第１音声、第２音声）を生成する。このとき変調部１０３－４は、属性に応じた変調強度となるように第１音声および第２音声を生成してもよい。 The modulation unit 103-4 generates a voice in which the specified emphasized portion is modulated (step S403). For example, the modulation unit 103-4 is a voice corresponding to the specified emphasized portion (word, etc.), and the modulation target is modulated so that the modulation target is different from each other for the emphasized portion (first voice, Second voice) is generated. At this time, the modulation unit 103-4 may generate the first voice and the second voice so as to have the modulation intensity according to the attribute.

出力制御部１０４－４は、スピーカ１０５ごとに出力する音声を決定し、決定した音声を出力させる（ステップＳ４０４）。各スピーカ１０５は、出力制御部１０４－４の指示に従い音声を出力する。 The output control unit 104-4 determines the sound to be output for each speaker 105, and outputs the determined sound (step S404). Each speaker 105 outputs sound according to the instruction of the output control unit 104-4.

次に、語学学習用のアプリケーションとして音声処理装置１００－４を実現する場合の例について説明する。学習アプリケーションは、例えば以下のような機能を有する。
（１）出力させる音声のうち、学習の対象とする箇所、すなわち、強調部分を指定する機能。
（２）音声を再生する機能。一時停止、巻き戻し、および、早送りなどの機能を備えてもよい。
（３）強調部分を理解できたか否かを確認するための機能。
（４）学習の結果などに応じて属性を変更する機能。 Next, an example of realizing the speech processing device 100-4 as an application for language learning will be described. The learning application has, for example, the following functions.
(1) A function to specify a part to be learned, that is, an emphasized part, in the voice to be output.
(2) A function to play audio. It may have functions such as pause, rewind, and fast forward.
(3) A function to check whether or not the emphasized part can be understood.
(4) A function to change attributes according to the learning results.

図１７は、学習の対象とする箇所を指定するための指定画面の一例を示す図である。図１７に示すように、指定画面１７００は、出力させる音声に対応するテキストデータを表示する画面である。指定画面１７００は、例えば出力制御部１０４－４により表示部１２２－４に表示される。指定画面１７００は、上記（１）の機能を実現する画面の例である。 FIG. 17 is a diagram showing an example of a designation screen for designating a portion to be learned. As shown in FIG. 17, the designated screen 1700 is a screen for displaying text data corresponding to the voice to be output. The designated screen 1700 is displayed on the display unit 122-4 by, for example, the output control unit 104-4. The designated screen 1700 is an example of a screen that realizes the function (1) above.

ユーザは、指定画面１７００に表示されたテキストデータのうち、学習の対象とする箇所（単語、文など）を、マウスまたはタッチパネルなどにより選択する。単語１７０１は、このようにして選択された箇所の例を示している。 The user selects a part (word, sentence, etc.) to be learned from the text data displayed on the designated screen 1700 with a mouse, a touch panel, or the like. Word 1701 gives an example of a location thus selected.

登録ボタン１７１１が押下されると、選択された単語が、学習の対象として記憶部１２１－４に記憶される。図１５は、このようにして記憶されたデータの一例を示す。図１５の出力回数は、登録時点では例えば「０」に設定される。キャンセルボタン１７１２が押下された場合は、例えば、選択が解除され、前の画面が表示される。 When the registration button 1711 is pressed, the selected word is stored in the storage unit 121-4 as a learning target. FIG. 15 shows an example of the data stored in this way. The number of outputs in FIG. 15 is set to, for example, "0" at the time of registration. When the cancel button 1712 is pressed, for example, the selection is canceled and the previous screen is displayed.

学習対象の指定方法は図１７に示す方法に限られない。例えば、音声が出力されている途中に登録（ボタンの押下など）が指示された場合に、指示されたタイミングで出力されていた箇所（単語など）を学習の対象として登録してもよい。学習対象とする１以上の単語を音声とは無関係に選択し、音声（または音声に対応するテキストデータ）から、選択された単語を抽出することにより、図１５に示すようなデータを生成してもよい。 The method of designating the learning target is not limited to the method shown in FIG. For example, when registration (pressing a button, etc.) is instructed while the voice is being output, the part (word, etc.) that was output at the instructed timing may be registered as a learning target. Data as shown in FIG. 15 is generated by selecting one or more words to be learned independently of the voice and extracting the selected words from the voice (or text data corresponding to the voice). May be good.

学習を開始する前までに、図１７に示す方法などにより学習の対象とする箇所が指定され、図１５に示すようなデータが生成されていればよい。学習する際に用いられる画面の例について以下に説明する。 By the time the learning is started, it is sufficient that the part to be learned is specified by the method shown in FIG. 17 and the data as shown in FIG. 15 is generated. An example of a screen used for learning will be described below.

図１８は、学習画面の一例を示す図である。図１８に示すように、学習画面１８００は、カーソル１８０１と、出力制御ボタン１８０２と、ＯＫボタン１８１１と、キャンセルボタン１８１２と、を含む。 FIG. 18 is a diagram showing an example of a learning screen. As shown in FIG. 18, the learning screen 1800 includes a cursor 1801, an output control button 1802, an OK button 1811, and a cancel button 1812.

出力制御ボタン１８０２は、音声の再生開始、一時停止、再生の停止、巻き戻し、および、早送りなどのために用いられる。カーソル１８０１は、現在再生されている音声に対応する箇所を示すための情報である。図１８では矩形のカーソル１８０１の例が示されているが、カーソル１８０１の表示態様はこれに限られない。 The output control button 1802 is used for audio reproduction start, pause, reproduction stop, rewind, fast forward, and the like. The cursor 1801 is information for indicating a portion corresponding to the currently reproduced voice. Although FIG. 18 shows an example of a rectangular cursor 1801, the display mode of the cursor 1801 is not limited to this.

ＯＫボタン１８１１が押下されると、学習処理が終了する。ＯＫボタン１８１１が押下された場合に、それまでに再生された各単語の出力回数に１加算して記憶部１２１－４のデータを更新してもよい。例えば巻き戻し機能により、ある単語の再生が繰り返されると、この単語の出力回数が増加する。特定部１０２－４は、例えば繰り返し再生された単語の出力回数が閾値を超えた場合、この単語を強調部分として特定せず、出力回数が閾値以下の単語のみを強調部分として特定する。これにより、学習の対象とする単語を適切に特定して学習効果を高めることが可能となる。 When the OK button 1811 is pressed, the learning process ends. When the OK button 1811 is pressed, the data of the storage unit 121-4 may be updated by adding 1 to the number of outputs of each word reproduced so far. For example, the rewind function increases the number of times a word is output when the word is repeatedly played. For example, when the number of times of output of a repeatedly reproduced word exceeds the threshold value, the specifying unit 102-4 does not specify this word as an emphasized part, and specifies only a word whose number of times of output is less than or equal to the threshold value as an emphasized part. This makes it possible to appropriately identify the word to be learned and enhance the learning effect.

キャンセルボタン１８１２が押下された場合は、例えば、前の画面が表示される。キャンセルボタン１８１２が押下された場合には出力回数を更新しないように構成してもよい。 When the cancel button 1812 is pressed, for example, the previous screen is displayed. When the cancel button 1812 is pressed, the number of outputs may not be updated.

図１９は、学習画面の他の例を示す図である。図１９の学習画面１９００は、単語ごとに学習結果を指定可能とする画面の例である。再生されている音声に対応する単語にカーソル１９０１が表示されるとともに、カーソル１９０１に対応する指定ウインドウ１９１０が表示される。音声の再生が進むに従い、カーソル１９０１が移動するとともに、対応する指定ウインドウ１９１０も移動する。 FIG. 19 is a diagram showing another example of the learning screen. The learning screen 1900 of FIG. 19 is an example of a screen in which a learning result can be specified for each word. The cursor 1901 is displayed for the word corresponding to the voice being played, and the designated window 1910 corresponding to the cursor 1901 is displayed. As the audio reproduction progresses, the cursor 1901 moves and the corresponding designated window 1910 also moves.

指定ウインドウ１９１０は、ＯＫボタンとキャンセルボタンとを含む。例えばＯＫボタンが押下された場合、対応する単語の出力回数に１加算して記憶部１２１－４のデータが更新される。キャンセルボタンが押下された場合、出力回数は更新されない。指定ウインドウ１９１０がＯＫボタンのみを含み、ＯＫボタンが押下されない場合、出力回数が更新されないように構成してもよい。 The designated window 1910 includes an OK button and a cancel button. For example, when the OK button is pressed, the data in the storage unit 121-4 is updated by adding 1 to the number of outputs of the corresponding word. When the cancel button is pressed, the number of outputs is not updated. If the designated window 1910 includes only the OK button and the OK button is not pressed, the output count may not be updated.

図２０は、学習画面の他の例を示す図である。図２０の学習画面２０００では、学習する対象（単語など）が非表示とされ、正解を選択させる選択ウインドウ２０１０が表示される。選択ウインドウ２０１０では、対応する単語の正しい表記と、その他の表記とが、選択可能に表示される。例えば正しい表記が選択された場合に、対応する単語の出力回数に１加算して記憶部１２１－４のデータが更新される。正しい表記が選択されなかった場合には、出力回数は更新されない。このような構成の場合、出力回数の代わりに、正解回数を属性として記憶してもよい。 FIG. 20 is a diagram showing another example of the learning screen. In the learning screen 2000 of FIG. 20, the object to be learned (words and the like) is hidden, and the selection window 2010 for selecting the correct answer is displayed. In the selection window 2010, the correct notation of the corresponding word and other notations are displayed selectably. For example, when the correct notation is selected, the data in the storage unit 121-4 is updated by adding 1 to the number of outputs of the corresponding word. If the correct notation is not selected, the output count will not be updated. In such a configuration, the number of correct answers may be stored as an attribute instead of the number of outputs.

図２１は、学習画面の他の例を示す図である。図２１の学習画面２１００は、選択肢を下部に表示する画面の例である。学習する対象（単語など）の表記は非表示とされ、代わりに「Ｑ１」、「Ｑ２」、および、「Ｑ３」などのように、下部の選択肢とを対応づける情報が表示される。ユーザは、音声が再生されているとき、または、音声の再生が完了したときに、選択肢から表記を選択することができる。 FIG. 21 is a diagram showing another example of the learning screen. The learning screen 2100 of FIG. 21 is an example of a screen displaying options at the bottom. The notation of the object to be learned (words, etc.) is hidden, and instead, information associated with the lower option is displayed, such as "Q1", "Q2", and "Q3". The user can select a notation from the choices when the audio is playing or when the audio playback is complete.

次に、属性の他の例について説明する。 Next, another example of the attribute will be described.

学校などでは、予め定められた計画に従い学習を進めるために、計画の進行に応じて学習の対象が変更される場合がある。そこで、学習の開始、例えば、音声出力の開始からの経過時間を属性としてもよい。この場合、特定部１０２－４は、経過時間に応じて異なる強調部分を特定する。例えば記憶部１２１－４は、図１７の出力回数の代わりに、経過時間の範囲を単語ごとに記憶する。特定部１０２－４は、実際の音声出力の開始からの経過時間が、記憶された経過時間の範囲に含まれる単語を、強調部分として特定する。さらに、音声等の繰り返し利用回数、例えば、ファイルの再生回数を属性として加味してもよい。 In schools and the like, in order to proceed with learning according to a predetermined plan, the target of learning may be changed according to the progress of the plan. Therefore, the elapsed time from the start of learning, for example, the start of voice output may be used as an attribute. In this case, the specifying unit 102-4 identifies different emphasized parts according to the elapsed time. For example, the storage unit 121-4 stores the range of elapsed time for each word instead of the number of outputs in FIG. The specifying unit 102-4 identifies a word whose elapsed time from the start of the actual voice output is included in the range of the stored elapsed time as an emphasized portion. Further, the number of times of repeated use of voice or the like, for example, the number of times of playing a file may be added as an attribute.

学習期間および学習の単元などの、学習の単位を属性としてもよい。例えば記憶部１２１－４は、図１７の出力回数の代わりに、複数の学習期間を識別する情報（学習期間１、学習期間２、学習期間３・・・など）を単語ごとに記憶する。特定部１０２－４は、ユーザなどにより指定される学習期間、または、予め定められた計画と日時などに基づき判定される学習期間に対応する単語を、強調部分として特定する。 A learning unit, such as a learning period and a learning unit, may be an attribute. For example, the storage unit 121-4 stores information for identifying a plurality of learning periods (learning period 1, learning period 2, learning period 3, ...) For each word instead of the number of outputs in FIG. The specifying unit 102-4 specifies a word corresponding to a learning period designated by a user or the like, or a learning period determined based on a predetermined plan, date and time, etc., as an emphasized part.

学習の対象の種類を属性としてもよい。例えば、歴史の学習に適用する場合、記憶部１２１－４は、学習の対象（単語、文など）が、年代、および、キーワードなどのいずれの種類を示すかを、図１７の出力回数の代わりに属性として記憶する。特定部１０２－４は、ユーザなどにより指定される種類、または、予め定められた計画と日時などに基づき判定される種類に対応する単語を、強調部分として特定する。語学学習などに適用する場合、記憶部１２１－４は、単語の品詞を種類（属性）として記憶してもよい。 The type of learning target may be an attribute. For example, when applied to history learning, the storage unit 121-4 determines which type of learning target (word, sentence, etc.) indicates the age, keywords, etc., instead of the number of outputs in FIG. Store as an attribute in. The specifying unit 102-4 specifies a word corresponding to a type specified by a user or the like, or a type determined based on a predetermined plan, date and time, etc., as an emphasized part. When applied to language learning or the like, the storage unit 121-4 may store the part of speech of a word as a type (attribute).

音声を出力する場所を属性としてもよい。例えば朗読アプリケーションに適用する場合、朗読アプリケーションを実行する場所および音声の出力回数の少なくとも一方に応じて異なる強調部分を特定してもよい。これにより、例えば同じ書籍の内容であってもユーザに飽きさせないように音声を出力することが可能となる。 The place where the sound is output may be an attribute. For example, when applied to a reading application, different emphasis may be specified depending on where the reading application is executed and at least one of the number of times the audio is output. This makes it possible to output audio so that the user does not get bored even if the content of the same book is the same.

学習の対象ごとに定められる優先度を属性としてもよい。優先度は、対象（対象に対応する部分音声）を優先する度合いを示す。優先度の決定方法はどのような方法であってもよい。例えば、ユーザが単語を選択するとともに優先度を指定してもよい。単語の辞書データなどの中で予め定められた単語の重要度（または難易度）を優先度として利用してもよい。優先度は固定である必要はなく、動的に変更されてもよい。 The priority determined for each learning target may be used as an attribute. The priority indicates the degree to which the target (partial voice corresponding to the target) is prioritized. Any method may be used for determining the priority. For example, the user may select a word and specify a priority. A predetermined word importance (or difficulty) in word dictionary data or the like may be used as a priority. The priority does not have to be fixed and may change dynamically.

例えば特定部１０２－４は、優先度が閾値以上の単語に対応する部分音声を強調部分として特定する。特定部１０２－４は、優先度が指定された値（指定値）または指定された範囲（指定範囲）内の単語に対応する部分音声を強調部分として特定してもよい。閾値、指定値および指定範囲は、固定値でもよいし、ユーザ等により指定可能としてもよい。 For example, the specific unit 102-4 specifies a partial voice corresponding to a word having a priority equal to or higher than a threshold value as an emphasized portion. The specific unit 102-4 may specify the partial voice corresponding to the word in the specified value (designated value) or the specified range (designated range) as the emphasized part. The threshold value, the specified value, and the specified range may be fixed values or may be specified by a user or the like.

例えば記憶部１２１－４は、図１７の出力回数の代わりに、優先度を単語ごとに記憶する。例えば単語「mission」、「knowledge」に対して「１」、単語「aspiration」に対して「２」が優先度として設定される。そして例えば閾値を「１」とした場合、特定部１０２－４は、「mission」および「knowledge」に対応する部分音声を強調部分として特定する。優先度の範囲を指定可能とすれば、例えば、単語の重要度（難易度）に応じて強調部分を変更することが可能となる。 For example, the storage unit 121-4 stores the priority for each word instead of the number of outputs in FIG. For example, "1" is set as a priority for the words "mission" and "knowledge", and "2" is set as a priority for the word "aspiration". Then, for example, when the threshold value is set to "1", the specific unit 102-4 specifies the partial voice corresponding to the "mission" and the "knowledge" as the emphasized part. If the priority range can be specified, for example, the emphasized part can be changed according to the importance (difficulty) of the word.

優先度を他の情報に応じて変更するように構成してもよい。例えば優先度は、音声出力の開始からの経過時間に応じて変更されてもよい。経過時間に応じて学習の対象とする単語の優先度を上げ、対象外とする単語の優先度を下げるように制御すれば、上記のような計画に従った学習が可能となる。 The priority may be configured to change according to other information. For example, the priority may be changed according to the elapsed time from the start of the audio output. If the priority of the word to be learned is raised according to the elapsed time and the priority of the word to be excluded is lowered, the learning according to the above plan becomes possible.

また、例えば図２０および図２１のような画面で正解を選択させ、正解であった場合に優先度を下げ、正解でなかった場合に優先度を上げるように構成してもよい。これにより、学習が十分でない対象を適切に強調することが可能となる。正解回数などを属性とすることによっても同様の機能を実現できる。 Further, for example, the correct answer may be selected on the screens as shown in FIGS. 20 and 21, and if the answer is correct, the priority may be lowered, and if the answer is not correct, the priority may be increased. This makes it possible to appropriately emphasize objects that are not sufficiently learned. The same function can be realized by using the number of correct answers as an attribute.

これまでの説明では、第１の実施形態と同様に、テキストデータに対応する音声を生成しながら強調部分を変調する例を説明した。変調方法はこれに限られるものではない。例えば、第２の実施形態と同様に、生成された音声のうち強調部分に相当する音声に対して変調処理を行ってもよい。また、変調方法は、ピッチおよび位相の少なくとも一方を変調する方法に限られず、他の変調方法を適用してもよい。 In the above description, as in the first embodiment, an example of modulating the emphasized portion while generating the voice corresponding to the text data has been described. The modulation method is not limited to this. For example, as in the second embodiment, the modulation processing may be performed on the voice corresponding to the emphasized portion of the generated voice. Further, the modulation method is not limited to the method of modulating at least one of the pitch and the phase, and other modulation methods may be applied.

このように、第４の実施形態にかかる音声処理装置では、属性に応じて変更した強調部分を変調して出力する。これにより、学習アプリケーションに適用した場合の学習効果の向上、および、朗読アプリケーションに適用した場合の臨場感の向上などが可能となる。 As described above, in the voice processing apparatus according to the fourth embodiment, the emphasized portion changed according to the attribute is modulated and output. This makes it possible to improve the learning effect when applied to a learning application and to improve the sense of presence when applied to a reading application.

以上説明したとおり、第１から第４の実施形態によれば、音声のピッチおよび位相の少なくとも一方を変調して出力することにより、音声信号の強度を変えることなく、利用者の注意力を増大させることが可能となる。 As described above, according to the first to fourth embodiments, by modulating and outputting at least one of the pitch and the phase of the voice, the attention of the user is increased without changing the strength of the voice signal. It is possible to make it.

次に、第１から第４の実施形態にかかる音声処理装置のハードウェア構成について図２２を用いて説明する。図２２は、第１から第４の実施形態にかかる音声処理装置のハードウェア構成例を示す説明図である。 Next, the hardware configuration of the voice processing device according to the first to fourth embodiments will be described with reference to FIG. 22. FIG. 22 is an explanatory diagram showing a hardware configuration example of the voice processing device according to the first to fourth embodiments.

第１から第４の実施形態にかかる音声処理装置は、ＣＰＵ（Central Processing Unit）５１などの制御装置と、ＲＯＭ（Read Only Memory）５２やＲＡＭ（Random Access Memory）５３などの記憶装置と、ネットワークに接続して通信を行う通信Ｉ／Ｆ５４と、各部を接続するバス６１を備えている。 The voice processing device according to the first to fourth embodiments includes a control device such as a CPU (Central Processing Unit) 51, a storage device such as a ROM (Read Only Memory) 52 and a RAM (Random Access Memory) 53, and a network. It is provided with a communication I / F 54 that connects to and communicates with a communication unit, and a bus 61 that connects each unit.

第１から第４の実施形態における音声処理装置は、コンピュータまたは組み込みシステムであり、パソコンおよびマイコン等の１つからなる装置、または、複数の装置がネットワーク接続されたシステム等の何れの構成であってもよい。また、本実施形態におけるコンピュータは、パソコンに限らず、情報処理機器に含まれる演算処理装置およびマイコン等も含み、プログラムによって本実施形態における機能を実現することが可能な機器または装置を総称している。 The voice processing device in the first to fourth embodiments is a computer or an embedded system, and has either a configuration consisting of one device such as a personal computer and a microcomputer, or a system in which a plurality of devices are connected to a network. You may. Further, the computer in the present embodiment includes not only a personal computer but also an arithmetic processing unit and a microcomputer included in an information processing device, and is a general term for devices or devices capable of realizing the functions in the present embodiment by a program. There is.

第１から第４の実施形態にかかる音声処理装置で実行されるプログラムは、ＲＯＭ５２等に予め組み込まれて提供される。 The program executed by the voice processing apparatus according to the first to fourth embodiments is provided by being incorporated in the ROM 52 or the like in advance.

第１から第４の実施形態にかかる音声処理装置で実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ－ＲＯＭ（Compact Disk Read Only Memory）、フレキシブルディスク（ＦＤ）、ＣＤ－Ｒ（Compact Disk Recordable）、ＤＶＤ（Digital Versatile Disk）、ＵＳＢフラッシュメモリー、ＳＤカード、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read-Only Memory）等のコンピュータで読み取り可能な記録媒体に記録してコンピュータプログラムプロダクトとして提供されるように構成してもよい。 The program executed by the voice processing apparatus according to the first to fourth embodiments is a file in an installable format or an executable format, and is a CD-ROM (Compact Disk Read Only Memory), a flexible disk (FD), or a CD. -Recorded on a computer-readable recording medium such as R (Compact Disk Recordable), DVD (Digital Versatile Disk), USB flash memory, SD card, EEPROM (Electrically Erasable Programmable Read-Only Memory), and provided as a computer program product. It may be configured to be.

さらに、第１から第４の実施形態にかかる音声処理装置で実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。また、第１から第４の実施形態にかかる音声処理装置で実行されるプログラムをインターネット等のネットワーク経由で提供または配布するように構成してもよい。 Further, the program executed by the voice processing device according to the first to fourth embodiments is configured to be stored on a computer connected to a network such as the Internet and provided by downloading via the network. May be good. Further, the program executed by the voice processing device according to the first to fourth embodiments may be configured to be provided or distributed via a network such as the Internet.

第１から第４の実施形態にかかる音声処理装置で実行されるプログラムは、コンピュータを上述した音声処理装置の各部として機能させうる。このコンピュータは、ＣＰＵ５１がコンピュータ読取可能な記憶媒体からプログラムを主記憶装置上に読み出して実行することができる。 The program executed by the voice processing device according to the first to fourth embodiments can make the computer function as each part of the above-mentioned voice processing device. This computer can read a program from a computer-readable storage medium onto the main storage device and execute the program by the CPU 51.

本発明のいくつかの実施形態を説明したが、これらの実施形態は、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although some embodiments of the present invention have been described, these embodiments are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and variations thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１００、１００－２、１００－３、１００－４音声処理装置
１０１、１０１－３、１０１－４受付部
１０２、１０２－３、１０２－４特定部
１０３、１０３－２、１０３－３、１０３－４変調部
１０４、１０４－４出力制御部
１０５スピーカ
１０６－２生成部
１２１、１２１－４記憶部
１２２－４表示部 100, 100-2, 100-3, 100-4 Speech processing device 101, 101-3, 101-4 Reception section 102, 102-3, 102-4 Specific section 103, 103-2, 103-3, 103- 4 Modulation unit 104, 104-4 Output control unit 105 Speaker 106-2 Generation unit 121, 121-4 Storage unit 122-4 Display unit

Claims

A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice is provided.
The attributes are the number of times one or more voices included in the voice to be output are output, the time during which one or more voices included in the voice to be output are output, and after the output of the first voice and the second voice is started. Elapsed time, place to output the sound to be output, type of learning target using the sound to be output, and learning using the sound to be output for a period determined based on a predetermined plan and date and time. At least one of the learning periods,
Voice processing device.

A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice is provided.
The specific part identifies the emphasized part from the input text data, and the specific part identifies the emphasized part.
The modulation unit is the first voice and the second voice corresponding to the text data, and the emphasis is made on at least one of the first voice and the second voice so that the difference is 100 hertz or more. Generates the first voice and the second voice in which the pitch of the portion is modulated.
Voice processing device.

A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice,
A generation unit for generating the first voice and the second voice corresponding to the input text data is provided.
The specific part identifies the emphasized part from the text data, and the specific part identifies the emphasized part.
The modulation unit performs the first voice and the above so that the difference between the emphasized portion of the generated first voice and the emphasized portion of the generated second voice is 100 hertz or more. Modulates the pitch of at least one of the emphasized portions of the second voice,
Voice processing device.

A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice is provided.
Further, the modulation unit further comprises the first voice and 180 ° or less so that the difference between the phase of the emphasized portion of the first voice and the phase of the emphasized portion of the second voice is 60 ° or more and 180 ° or less. Modulates the phase of at least one of the emphasized portions of the second voice.
Voice processing device.

A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice is provided.
The modulation unit further modulates the phase of at least one of the first voice and the second voice by inverting the polarity of the signal input to the first output unit or the second output unit. do,
Voice processing device.

The modulation unit changes the degree of modulation of the emphasized portion based on the attribute.
The voice processing device according to any one of claims 1 to 5.

The attribute is a priority determined for one or more voices included in the voice to be output.
The voice processing device according to any one of claims 2 to 5.

A specific step of specifying any one or more of the one or more voices included in the voice to be output as an emphasized portion based on the attribute of the voice, and
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. Includes a modulation step that modulates the pitch of the audio and at least one of the enhanced portions of the second audio.
The attributes are the number of times one or more voices included in the voice to be output are output, the time during which one or more voices included in the voice to be output are output, and after the output of the first voice and the second voice is started. Elapsed time, place to output the sound to be output, type of learning target using the sound to be output, and learning using the sound to be output for a period determined based on a predetermined plan and date and time. At least one of the learning periods,
Voice processing method.

A specific step of specifying any one or more of the one or more voices included in the voice to be output as an emphasized portion based on the attribute of the voice, and
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. Includes a modulation step that modulates the pitch of the audio and at least one of the enhanced portions of the second audio.
In the specific step, the emphasized portion is specified from the input text data, and the emphasized portion is specified.
The modulation step is the first voice and the second voice corresponding to the text data, and at least one of the first voice and the second voice is emphasized so that the difference is 100 hertz or more. Generates the first voice and the second voice in which the pitch of the portion is modulated.
Voice processing method.

A specific step of specifying any one or more of the one or more voices included in the voice to be output as an emphasized portion based on the attribute of the voice, and
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation step that modulates the pitch of the audio and at least one of the enhanced portions of the second audio.
A generation step of generating the first voice and the second voice corresponding to the input text data is included.
In the specific step, the emphasized portion is specified from the text data, and the emphasized portion is specified.
The modulation step is such that the difference between the enhanced portion of the generated first voice and the enhanced portion of the generated second voice is 100 hertz or more, the first voice and the said. Modulates the pitch of at least one of the emphasized portions of the second voice,
Voice processing method.

A specific step of specifying any one or more of the one or more voices included in the voice to be output as an emphasized portion based on the attribute of the voice, and
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. Includes a modulation step that modulates the pitch of the audio and at least one of the enhanced portions of the second audio.
In the modulation step, the first voice and the first voice and 180 ° or less so that the difference between the phase of the emphasized portion of the first voice and the phase of the emphasized portion of the second voice is 60 ° or more and 180 ° or less. Modulates the phase of at least one of the emphasized portions of the second voice.
Voice processing method.

A specific step of specifying any one or more of the one or more voices included in the voice to be output as an emphasized portion based on the attribute of the voice, and
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. Includes a modulation step that modulates the pitch of the audio and at least one of the enhanced portions of the second audio.
The modulation step further modulates the phase of at least one of the first voice and the second voice by inverting the polarity of the signal input to the first output unit or the second output unit. do,
Voice processing method.

Computer,
A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. It functions as a modulator that modulates the pitch of the voice and at least one of the emphasized portions of the second voice.
The attributes are the number of times one or more voices included in the voice to be output are output, the time during which one or more voices included in the voice to be output are output, and after the output of the first voice and the second voice is started. Elapsed time, place to output the sound to be output, type of learning target using the sound to be output, and learning using the sound to be output for a period determined based on a predetermined plan and date and time. At least one of the learning periods,
program.

Computer,
A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. It functions as a modulator that modulates the pitch of the voice and at least one of the emphasized portions of the second voice.
The specific part identifies the emphasized part from the input text data, and the specific part identifies the emphasized part.
The modulation unit is the first voice and the second voice corresponding to the text data, and the emphasis is made on at least one of the first voice and the second voice so that the difference is 100 hertz or more. Generates the first voice and the second voice in which the pitch of the portion is modulated.
program.

Computer,
A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. A modulation unit that modulates the pitch of the voice and at least one of the emphasized portions of the second voice,
It functions as a generation unit that generates the first voice and the second voice corresponding to the input text data.
The specific part identifies the emphasized part from the text data, and the specific part identifies the emphasized part.
The modulation unit performs the first voice and the above so that the difference between the emphasized portion of the generated first voice and the emphasized portion of the generated second voice is 100 hertz or more. Modulates the pitch of at least one of the emphasized portions of the second voice,
program.

Computer,
A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. It functions as a modulator that modulates the pitch of the voice and at least one of the emphasized portions of the second voice.
Further, the modulation unit further comprises the first voice and 180 ° or less so that the difference between the phase of the emphasized portion of the first voice and the phase of the emphasized portion of the second voice is 60 ° or more and 180 ° or less. Modulates the phase of at least one of the emphasized portions of the second voice.
program.

Computer,
A specific part that specifies any one or more of the one or more voices included in the voice to be output as an emphasized part based on the attribute of the voice, and a specific part.
The first is such that the difference between the frequency of the emphasized portion of the first sound output to the first output unit and the frequency of the emphasized portion of the second sound output to the second output unit is 100 hertz or more. It functions as a modulator that modulates the pitch of the voice and at least one of the emphasized portions of the second voice.
The modulation unit further modulates the phase of at least one of the first voice and the second voice by inverting the polarity of the signal input to the first output unit or the second output unit. do,
program.