JP2018116096A

JP2018116096A - Voice processing program, voice processing method and voice processor

Info

Publication number: JP2018116096A
Application number: JP2017005278A
Authority: JP
Inventors: 太郎外川; Taro Togawa; 紗友梨香村; Sayuri Komura; 猛大谷; Takeshi Otani
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-01-16
Filing date: 2017-01-16
Publication date: 2018-07-26
Anticipated expiration: 2037-01-16
Also published as: JP6790851B2

Abstract

PROBLEM TO BE SOLVED: To correctly calculate speech speed in voice data.SOLUTION: A voice processing program causes a computer to perform following first to fifth processing. The first processing is processing to detect a plurality of vowel sections included in a speech section in input voice data. The second processing is processing to calculate time lengths of the plurality of detected vowel sections. The third processing is processing to calculate frequency distribution on the time lengths of the plurality of vowel sections and to specify a minimum value in the time lengths of the plurality of vowel sections. The fourth processing is processing to calculate a mora number corresponding to the time lengths of the plurality of vowel sections on the basis of the specified minimum value of the time lengths and the frequency distribution. The fifth processing is processing to control an output signal corresponding to the speech section in the input voice data in accordance with the calculated mora number.SELECTED DRAWING: Figure 2

Description

本発明は、音声処理プログラム、音声処理方法、及び音声処理装置に関する。 The present invention relates to a voice processing program, a voice processing method, and a voice processing apparatus.

マイクロフォンで収音して得られる音声データに対する音声処理の１つとして、音声データにおける話速を算出する処理が知られている。音声データにおける話速を算出する際には、フォルマント周波数等の声道特性の変化に基づいて、音声データに含まれる母音の変化を検出し、単位時間あたりの母音数（モーラ数）を算出する（例えば、特許文献１，２を参照。）。 As one of audio processes for audio data obtained by collecting sound with a microphone, a process for calculating a speech speed in the audio data is known. When calculating speech speed in voice data, a change in vowels included in the voice data is detected based on a change in vocal tract characteristics such as formant frequency, and the number of vowels (number of mora) per unit time is calculated. (For example, see Patent Documents 1 and 2.)

特開平７−２９５５８８号公報JP 7-295588 A 特開平１０−７０７９０号公報Japanese Patent Laid-Open No. 10-70790

人が発話する際には同じ母音が連続する長母音が含まれることがある。同じ母音を連続して発する場合、母音の切れ目において声道特性はほとんど変化しない。このため、声道特性の変化に基づいて、音声データに含まれる母音の変化を検出する場合、長母音は１個の母音として検出される。すなわち、声道特性の変化に基づいて、音声データに含まれる母音の変化を検出する場合、実際には２モーラである母音区間が１モーラの母音区間となり、音声データにおける正しい話速を算出することが難しい。 When a person utters, long vowels in which the same vowel continues may be included. When the same vowel is continuously generated, the vocal tract characteristic hardly changes at the vowel break. For this reason, when detecting the change of the vowel included in the voice data based on the change of the vocal tract characteristic, the long vowel is detected as one vowel. That is, when detecting a change in vowels included in voice data based on a change in vocal tract characteristics, a vowel section that is actually 2 mora becomes a vowel section of 1 mora, and the correct speech speed in the voice data is calculated. It is difficult.

１つの側面において、本発明は、音声データにおける話速を正しく算出することを目的とする。 In one aspect, an object of the present invention is to correctly calculate the speech speed in voice data.

１つの態様に係る音声処理プログラムは、以下の第１の処理から第５の処理をコンピュータに実行させる。第１の処理は、入力音声データにおける発話区間に含まれる複数の母音区間を検出する処理である。第２の処理は、検出した複数の母音区間それぞれの時間長を算出する処理である。第３の処理は、複数の母音区間の時間長についての頻度分布を算出するとともに、複数の母音区間それぞれの時間長のうちの最小値を特定する処理である。第４の処理は、特定した時間長の最小値と、頻度分布とに基づいて、複数の母音区間それぞれの時間長と対応するモーラ数を算出する処理である。第５の処理は、算出したモーラ数に応じて入力音声データにおける発話区間と対応する出力信号を制御する処理である。 A sound processing program according to one aspect causes a computer to execute the following first to fifth processes. The first process is a process for detecting a plurality of vowel sections included in the utterance section in the input voice data. The second process is a process for calculating the time length of each of the detected plurality of vowel segments. The third process is a process of calculating the frequency distribution for the time lengths of the plurality of vowel segments and specifying the minimum value of the time lengths of the plurality of vowel segments. The fourth process is a process of calculating the number of mora corresponding to the time length of each of the plurality of vowel intervals based on the specified minimum value of the time length and the frequency distribution. The fifth process is a process for controlling the output signal corresponding to the speech section in the input voice data according to the calculated number of mora.

上述の態様によれば、音声データにおける話速を正しく算出することが可能となる。 According to the above-described aspect, it is possible to correctly calculate the speech speed in the voice data.

第１の実施形態に係る話速推定装置の機能的構成を示す図である。It is a figure which shows the functional structure of the speech-speed estimation apparatus which concerns on 1st Embodiment. 第１の実施形態に係る話速推定装置が行う処理を説明するフローチャートである。It is a flowchart explaining the process which the speech-speed estimation apparatus which concerns on 1st Embodiment performs. 母音区間を検出する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which detects a vowel section. 音声データにおける単独母音と長母音との関係を説明する図である。It is a figure explaining the relationship between the single vowel and long vowel in audio | voice data. 母音区間の頻度分布の例を示す図である。It is a figure which shows the example of the frequency distribution of a vowel section. 発話区間の発話内容と各母音区間のモーラ数の算出結果とを説明する図である。It is a figure explaining the utterance content of an utterance area, and the calculation result of the number of mora of each vowel area. 話速の算出結果の出力例を示す図である。It is a figure which shows the example of an output of the calculation result of speech speed. モーラ数の算出方法の別の例を示す図である。It is a figure which shows another example of the calculation method of the number of mora. 話速推定装置の第１の適用例を示す図である。It is a figure which shows the 1st application example of a speech-speed estimation apparatus. 話速推定装置の第２の適用例を示す図である。It is a figure which shows the 2nd application example of a speech-speed estimation apparatus. 第２の実施形態に係る話速推定装置の機能的構成を示す図である。It is a figure which shows the functional structure of the speech-speed estimation apparatus which concerns on 2nd Embodiment. 第２の実施形態に係る母音区間を検出する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which detects the vowel area which concerns on 2nd Embodiment. 第２の実施形態に係る母音区間を検出する処理の変形例を説明するフローチャートである。It is a flowchart explaining the modification of the process which detects the vowel area which concerns on 2nd Embodiment. 第３の実施形態に係る通話システムのシステム構成を示す図である。It is a figure which shows the system configuration | structure of the telephone call system which concerns on 3rd Embodiment. 第３の実施形態に係る話速調整部の機能的構成を示す図である。It is a figure which shows the functional structure of the speech-speed adjustment part which concerns on 3rd Embodiment. 話速制御部の機能的構成を示す図である。It is a figure which shows the functional structure of a speech speed control part. 第３の実施形態に係る話速調整部が行う処理を説明するフローチャートである。It is a flowchart explaining the process which the speech speed adjustment part which concerns on 3rd Embodiment performs. 音声データの話速を制御する処理の内容を説明するフローチャートである。It is a flowchart explaining the content of the process which controls the speech speed of audio | voice data. 基本周期の検出方法を説明する図である。It is a figure explaining the detection method of a fundamental period. 音声波形を重ね合わせる方法を説明する図である。It is a figure explaining the method of superimposing a speech waveform. 音声波形を重ねる際の重み付けの方法を説明する図である。It is a figure explaining the weighting method at the time of superimposing a speech waveform. コンピュータのハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a computer.

［第１の実施形態］
図１は、第１の実施形態に係る話速推定装置の機能的構成を示す図である。 [First Embodiment]
FIG. 1 is a diagram illustrating a functional configuration of the speech speed estimation apparatus according to the first embodiment.

図１に示すように、本実施形態に係る話速推定装置１は、音声取得部１１０と、発話区間検出部１２０と、母音区間検出部１３０と、モーラ数決定部１４０と、話速算出部１５０と、出力部１６０と、を備える。また、図１では省略しているが、話速推定装置１は、音声データを含む各種情報を記憶させる記憶部を備える。 As shown in FIG. 1, the speech speed estimation apparatus 1 according to the present embodiment includes a speech acquisition unit 110, a speech segment detection unit 120, a vowel segment detection unit 130, a mora number determination unit 140, and a speech speed calculation unit. 150 and an output unit 160. Although omitted in FIG. 1, the speech speed estimation apparatus 1 includes a storage unit that stores various types of information including voice data.

音声取得部１１０は、収音装置２から音声信号（音声データ）を取得する。
発話区間検出部１２０は、音声データにおける発話区間を検出する。発話区間は、話速推定の対象である人物が発した音声を含む区間である。 The audio acquisition unit 110 acquires an audio signal (audio data) from the sound collection device 2.
The utterance section detection unit 120 detects the utterance section in the voice data. The utterance section is a section including a voice uttered by a person who is a target of speech speed estimation.

母音区間検出部１３０は、音声データの発話区間に含まれる母音区間を検出する。
モーラ数決定部１４０は、音声データから検出した複数の母音区間それぞれの長さ（時間長）に基づいて、各母音区間のモーラ数を算出する。本実施形態に係る話速推定装置１のモーラ数決定部１４０は、まず、複数の母音区間それぞれの時間長を算出して該複数の母音区間についての時間長の頻度分布を算出するとともに、複数の母音区間それぞれの時間長のうちの最小値を特定する。その後、モーラ数決定部１４０は、特定した時間長の最小値と、頻度分布とに基づいて、複数の母音区間それぞれの時間長と対応するモーラ数を算出する（決定する）。 The vowel section detector 130 detects a vowel section included in the speech section of the voice data.
The mora number determination unit 140 calculates the number of mora in each vowel section based on the length (time length) of each of the plurality of vowel sections detected from the speech data. First, the mora number determination unit 140 of the speech speed estimation apparatus 1 according to the present embodiment calculates a time length of each of a plurality of vowel segments and calculates a frequency distribution of the time lengths for the plurality of vowel segments. The minimum value of the time length of each vowel section is specified. Thereafter, the mora number determination unit 140 calculates (determines) the number of mora corresponding to the time length of each of the plurality of vowel intervals based on the specified minimum value of the time length and the frequency distribution.

話速算出部１５０は、発話区間の時間長と、該発話区間に含まれる母音区間のモーラ数とに基づいて、音声データにおける発話区間の話速（発話速度）を算出する。 The speech speed calculation unit 150 calculates the speech speed (speech speed) of the speech section in the speech data based on the time length of the speech section and the number of mora of the vowel section included in the speech section.

出力部１６０は、算出した話速を可視化する表示データを生成して表示装置３に出力する。 The output unit 160 generates display data for visualizing the calculated speech speed and outputs the display data to the display device 3.

また、本実施形態の話速推定装置１におけるモーラ数決定部１４０は、図１に示したように、頻度分布算出部１４１と、ピーク検出部１４２と、モーラ数算出部１４３と、を含む。 Further, the mora number determination unit 140 in the speech speed estimation apparatus 1 of the present embodiment includes a frequency distribution calculation unit 141, a peak detection unit 142, and a mora number calculation unit 143, as illustrated in FIG.

頻度分布算出部１４１は、複数の母音区間の時間長に基づいて、音声データにおける母音区間の時間長毎の出現頻度を示す頻度分布を算出する。 The frequency distribution calculating unit 141 calculates a frequency distribution indicating the appearance frequency for each time length of the vowel section in the speech data, based on the time length of the plurality of vowel sections.

ピーク検出部１４２は、母音区間の時間長についての頻度分布において出現頻度がピーク（極大値）となる母音区間の時間長を検出する。 The peak detection unit 142 detects the time length of a vowel section where the appearance frequency is a peak (maximum value) in the frequency distribution of the time length of the vowel section.

モーラ数算出部１４３は、出現頻度が極大値となる母音区間の時間長に基づいて、各母音区間のモーラ数を算出する。本実施形態に係るモーラ数算出部１４３は、出現頻度が極大値となる母音区間の時間長のうちの最短の時間長を基準時間長（１モーラに相当する時間長）とし、母音区間の時間長と基準時間長との比に基づいて、各母音区間のモーラ数を算出する。 The mora number calculation unit 143 calculates the number of mora in each vowel section based on the time length of the vowel section in which the appearance frequency is a maximum value. The mora number calculation unit 143 according to the present embodiment uses the shortest time length of the time lengths of the vowel section where the appearance frequency is the maximum value as the reference time length (time length corresponding to 1 mora), and the time of the vowel section Based on the ratio between the length and the reference time length, the number of mora in each vowel section is calculated.

本実施形態の話速推定装置１は、動作を開始すると、収音装置２から音声データを取得する処理と、取得した音声データにおける話速を推定し推定結果を表示装置３に出力する処理とを行う。音声データを取得する処理は、話速推定装置１の音声取得部１１０が行う。 When the operation is started, the speech speed estimation apparatus 1 of the present embodiment acquires voice data from the sound collection apparatus 2, and processes to estimate the speech speed in the acquired voice data and output the estimation result to the display apparatus 3. I do. The voice acquisition unit 110 of the speech speed estimation apparatus 1 performs the process of acquiring the voice data.

一方、音声データにおける話速を推定し推定結果を表示装置３に出力する処理は、話速推定装置１の発話区間検出部１２０、母音区間検出部１３０、モーラ数決定部１４０、話速算出部１５０、及び出力部１６０が行う。話速推定装置１は、音声データにおける話速を推定し推定結果を表示装置３に出力する処理として、図２に示した処理を行う。 On the other hand, the process of estimating the speech speed in the speech data and outputting the estimation result to the display device 3 includes the speech segment detection unit 120, the vowel segment detection unit 130, the mora number determination unit 140, and the speech speed calculation unit of the speech speed estimation device 1. 150 and the output unit 160. The speech speed estimation device 1 performs the processing shown in FIG. 2 as processing for estimating speech speed in speech data and outputting the estimation result to the display device 3.

図２は、第１の実施形態に係る話速推定装置が行う処理を説明するフローチャートである。 FIG. 2 is a flowchart for explaining processing performed by the speech speed estimation apparatus according to the first embodiment.

音声データにおける話速を推定し推定結果を表示装置３に出力する処理において、話速推定装置１は、まず、音声データに含まれる発話区間を検出する（ステップＳ１）。ステップＳ１の処理は、発話区間検出部１２０が行う。発話区間検出部１２０は、既知の検出方法に従い、音声データに含まれる発話区間（言い換えると話速推定の対象である人物が発した音声を含む区間）を検出する。例えば、発話区間検出部１２０は、Voice Activity Detection（ＶＡＤ）により発話区間を検出する。 In the process of estimating the speech speed in the speech data and outputting the estimation result to the display device 3, the speech speed estimation device 1 first detects an utterance section included in the speech data (step S1). The processing of step S1 is performed by the utterance section detection unit 120. The utterance section detection unit 120 detects an utterance section (in other words, a section including speech uttered by a person who is a target of speech speed estimation) included in the sound data according to a known detection method. For example, the utterance section detection unit 120 detects the utterance section by Voice Activity Detection (VAD).

次に、話速推定装置１は、発話区間に含まれる母音区間を検出する（ステップＳ２）。ステップＳ２の処理は、母音区間検出部１３０が行う。母音区間検出部１３０は、既知の検出方法に従い、音声データの発話区間に含まれる母音区間を検出する。例えば、母音区間検出部１３０は、発話区間内における信号対雑音比の時間変化に基づいて、信号対雑音比が所定の閾値以上で連続する１個の区間を１個の母音区間として検出する。 Next, the speech speed estimation device 1 detects a vowel section included in the utterance section (step S2). The process of step S2 is performed by the vowel section detection unit 130. The vowel section detection unit 130 detects a vowel section included in the speech section of the voice data according to a known detection method. For example, the vowel section detection unit 130 detects, as one vowel section, one section in which the signal-to-noise ratio is continuous at a predetermined threshold or more based on the time change of the signal-to-noise ratio in the utterance section.

次に、話速推定装置１は、検出した母音区間のモーラ数（母音数）を決定する処理（ステップＳ３〜Ｓ５）を行う。母音区間のモーラ数を決定するステップＳ３〜Ｓ５の処理は、話速推定装置１のモーラ数決定部１４０が行う。 Next, the speech speed estimation apparatus 1 performs processing (steps S3 to S5) for determining the number of mora (number of vowels) in the detected vowel section. The processing of steps S3 to S5 for determining the number of mora in the vowel section is performed by the mora number determination unit 140 of the speech speed estimation apparatus 1.

モーラ数決定部１４０は、まず、検出した複数の母音区間それぞれの時間長に基づいて、母音区間の時間長についての頻度分布を算出する（ステップＳ３）。ステップＳ３の処理は、モーラ数決定部１４０の頻度分布算出部１４１が行う。頻度分布算出部１４１は、検出した複数の母音区間のそれぞれにおける区間の開始時刻と終了時刻とに基づいて、各母音区間の時間長を算出する。また、頻度分布算出部１４１は、各母音区間の時間長に基づいて時間長毎の母音区間の出現頻度を計数し、頻度分布を算出する。この際、頻度分布算出部１４１は、例えば、１個の発話区間における末尾の母音区間を除外して、頻度分布を算出する。また、頻度分布算出部１４１は、例えば、１個の発話区間に含まれる全ての母音区間のうちの、時間長が所定の範囲内である母音区間を抽出して頻度分布を算出してもよい。時間長の所定の範囲は、例えば、一般的な話速における単独母音及び長母音の時間長等に基づいて設定する。 The mora number determination unit 140 first calculates a frequency distribution for the time length of the vowel section based on the detected time lengths of the plurality of vowel sections (step S3). The processing in step S3 is performed by the frequency distribution calculation unit 141 of the mora number determination unit 140. The frequency distribution calculating unit 141 calculates the time length of each vowel section based on the start time and end time of each detected vowel section. Further, the frequency distribution calculation unit 141 calculates the frequency distribution by counting the appearance frequency of the vowel section for each time length based on the time length of each vowel section. At this time, the frequency distribution calculation unit 141 calculates the frequency distribution by excluding, for example, the last vowel section in one utterance section. In addition, the frequency distribution calculation unit 141 may calculate the frequency distribution by extracting, for example, a vowel section whose time length is within a predetermined range from all vowel sections included in one utterance section. . The predetermined range of the time length is set based on, for example, the time length of a single vowel and a long vowel at a general speaking speed.

次に、モーラ数決定部１４０は、ステップＳ３で算出した頻度分布において出現頻度がピーク（極大値）となる時間長を検出する（ステップＳ４）。ステップＳ４の処理は、モーラ数決定部１４０のピーク検出部１４２が行う。例えば、ピーク検出部１４２は、頻度分布における最短時間長の出現頻度から順に、判定対象である時間長の出現頻度と、その前後の時間長の出現頻度と比較し、出現頻度が極大値となる時間長を検出する。 Next, the mora number determination unit 140 detects a time length at which the appearance frequency reaches a peak (maximum value) in the frequency distribution calculated in step S3 (step S4). The processing in step S4 is performed by the peak detection unit 142 of the mora number determination unit 140. For example, the peak detection unit 142 compares the appearance frequency of the time length that is the determination target with the appearance frequency of the time length before and after the appearance frequency in order from the appearance frequency of the shortest time length in the frequency distribution, and the appearance frequency becomes the maximum value. Detect time length.

次に、モーラ数決定部１４０は、頻度分布から検出した時間長のうちの最小値と、母音区間の時間長とに基づいて、各母音区間のモーラ数を算出する（ステップＳ５）。ステップＳ５の処理は、モーラ数決定部１４０のモーラ数算出部１４３が行う。モーラ数算出部１４３は、頻度分布においてピークとなる複数の時間長のうちの最小値を基準時間長とし、母音区間の時間長を基準時間長で除した値を算出する。基準時間長は、音声データの発話区間における１モーラ（単独母音）の時間長に相当する。このため、モーラ数算出部１４３は、母音区間の時間長を基準時間長で除した値に近い整数値を、該母音区間のモーラ数とする。例えば、ある母音区間の時間長が基準時間長の約２倍である場合、モーラ数算出部１４３は、該母音区間のモーラ数を２とする。 Next, the mora number determination unit 140 calculates the number of mora in each vowel section based on the minimum value of the time lengths detected from the frequency distribution and the time length of the vowel sections (step S5). The process of step S5 is performed by the mora number calculation unit 143 of the mora number determination unit 140. The mora number calculation unit 143 calculates a value obtained by dividing the time length of the vowel section by the reference time length, with the minimum value among the plurality of time lengths peaking in the frequency distribution as the reference time length. The reference time length corresponds to a time length of 1 mora (single vowel) in the speech section of the voice data. For this reason, the mora number calculation unit 143 sets an integer value close to a value obtained by dividing the time length of the vowel section by the reference time length as the number of mora of the vowel section. For example, when the time length of a certain vowel section is about twice the reference time length, the mora number calculation unit 143 sets the number of mora of the vowel section to two.

モーラ数決定部１４０によるステップＳ３〜Ｓ５の処理を終えると、話速推定装置１は、次に、ステップＳ５で算出した各母音区間のモーラ数と、発話区間の時間長とに基づいて、発話区間の話速を算出する（ステップＳ６）。ステップＳ６の処理は、話速算出部１５０が行う。話速算出部１５０は、話速として、発話区間に含まれる母音区間についてのモーラ数の合計を該発話区間の時間長で除した値（モーラ／秒）を算出する。 When the processing of steps S3 to S5 by the mora number determination unit 140 is completed, the speech speed estimation apparatus 1 next utters based on the mora number of each vowel section calculated in step S5 and the time length of the utterance section. The speech speed of the section is calculated (step S6). The speech speed calculation unit 150 performs the process of step S6. The speech speed calculation unit 150 calculates, as the speech speed, a value (mora / second) obtained by dividing the total number of mora for the vowel section included in the speech section by the time length of the speech section.

次に、話速推定装置１は、算出した話速を出力する（ステップＳ７）。ステップＳ７の処理は、出力部１６０が行う。例えば、出力部１６０は、ステップＳ６で算出した話速を可視化する表示データを生成し、該表示データを表示装置３に出力する。 Next, the speech speed estimation apparatus 1 outputs the calculated speech speed (step S7). The output unit 160 performs the process in step S7. For example, the output unit 160 generates display data for visualizing the speech speed calculated in step S 6 and outputs the display data to the display device 3.

話速推定装置１は、上記のステップＳ１〜Ｓ７の処理を繰り返し行う。話速推定装置１は、ステップＳ１で検出した１個の発話区間に対するステップＳ２〜Ｓ７の処理を終えてから次の発話区間を検出する処理（ステップＳ１）を行ってもよいし、ステップＳ１〜Ｓ７の処理の全体又は一部をパイプライン化して行ってもよい。 The speech speed estimation apparatus 1 repeatedly performs the processes of steps S1 to S7 described above. The speech speed estimation apparatus 1 may perform processing (step S1) for detecting the next utterance interval after completing the processing of steps S2 to S7 for one utterance interval detected in step S1. The whole or part of the processing of S7 may be performed in a pipeline.

図３は、母音区間を検出する処理の内容を説明するフローチャートである。
本実施形態の話速推定装置１は、上記のステップＳ２の処理（母音区間を検出する処理）として、例えば、図３に示した処理を行う。図３に示した処理は、話速推定装置１の母音区間検出部１３０が行う。 FIG. 3 is a flowchart for explaining the contents of processing for detecting a vowel section.
The speech speed estimation apparatus 1 according to the present embodiment performs, for example, the process illustrated in FIG. 3 as the process of the above step S2 (process of detecting a vowel segment). The processing shown in FIG. 3 is performed by the vowel section detection unit 130 of the speech speed estimation device 1.

母音区間検出部１３０は、まず、音声データの発話区間における時刻ｔ_ｍ（ｍ＝１，２，・・・，Ｍ）の信号対雑音比ＳＮＲ（ｔ_ｍ）を算出する（ステップＳ２０１）。発話区間における時刻ｔ_ｍの時間間隔（ｔ_ｍ−ｔ_ｍ−１）は、例えば、音声データのパワー及び雑音パワーを算出する際の処理単位（フレーム）の時間長とする。以下の説明では、音声データにおける時刻ｔ_ｍと対応付けられたフレームを、時刻ｔ_ｍのフレームという。 First, the vowel section detection unit 130 calculates the signal-to-noise ratio SNR (t _m ) at the time t _m (m = 1, 2,..., M) in the speech section of the voice data (step S201). The time interval (t _m −t _m−1 ) at time t _m in the utterance section is, for example, the time length of the processing unit (frame) when calculating the power and noise power of the voice data. In the following description, a frame associated with time t _m in the audio data is referred to as a frame at time t _m .

ステップＳ２０１の処理において、母音区間検出部１３０は、下記式（１）により時刻ｔ_ｍのフレームにおける信号対雑音比ＳＮＲ（ｔ_ｍ）を算出する。 In the process of step S201, the vowel section detection unit 130 calculates the signal-to-noise ratio SNR (t _m ) in the frame at time t _m by the following equation (1).

ＳＮＲ（ｔ_ｍ）＝Ｐ（ｔ_ｍ）＋Ｎ（ｔ_ｍ）・・・（１） SNR (t _m ) = P (t _m ) + N (t _m ) (1)

式（１）のＰ（ｔ_ｍ）は、音声データのうちの時刻ｔ_ｍのフレームにおけるパワーである。式（１）のＮ（ｔ_ｍ）は、音声データのうちの時刻ｔ_ｍのフレームにおける雑音パワーである。母音区間検出部１３０は、既知の算出方法に従って時刻ｔ_ｍのフレームにおけるパワーＰ（ｔ_ｍ）及び雑音パワーＮ（ｔ_ｍ）を算出する。例えば、時刻ｔ_ｍの雑音パワーＮ（ｔ_ｍ）は、式（２）により算出する。 P (t _m ) in Equation (1) is the power in the frame at time t _{m in} the audio data. N (t _m ) in equation (1) is the noise power in the frame at time t _{m in} the audio data. The vowel section detector 130 calculates the power P (t _m ) and noise power N (t _m ) in the frame at time t _m according to a known calculation method. For example, the noise power N (t _m ) at time t _m is calculated by the equation (2).

式（２）のＮ１（ｔ_ｍ−１）は、音声データにおけるパワーＰ（ｔ_ｍ−１）と雑音パワーＮ（ｔ_ｍ−２）との差に基づいて更新される雑音パワーである。また、式（２）において、ＴＨ_Ｐは判定閾値であり、ＣＯＦは忘却係数である。判定閾値ＴＨ_Ｐ及び忘却係数ＣＯＦの値は、それぞれ、適宜設定すればよい。 N1 (t _m-1 ) in Expression (2) is a noise power that is updated based on the difference between the power P (t _m-1 ) and the noise power N (t _m-2 ) in the audio data. Further, in the equation (2), TH _P is determined threshold, COF is a forgetting factor. The value of the determination threshold value TH _P and the forgetting factor COF, respectively, may be set as appropriate.

ステップＳ２０１の処理を終えると、母音区間検出部１３０は、次に、変数ｍ、変数ｉ、及び信号対雑音比ＳＮＲ（ｔ_０）を、それぞれ、ｍ＝１、ｉ＝１、及びＳＮＲ（ｔ_０）＝０に設定する（ステップＳ２０２）。変数ｉは、母音区間を識別する値である。 When the processing of step S201 is completed, the vowel section detection unit 130 then sets the variable m, the variable i, and the signal-to-noise ratio SNR (t ₀ ) as m = 1, i = 1, and SNR (t ₀ ) = 0 is set (step S202). The variable i is a value that identifies a vowel section.

次に、母音区間検出部１３０は、時刻ｔ_ｍ−１の信号対雑音比ＳＮＲ（ｔ_ｍ−１）が閾値ＴＨ_ＳＮＲよりも小さく、かつ時刻ｔ_ｍの信号対雑音比ＳＮＲ（ｔ_ｍ）が閾値ＴＨ_ＳＮＲ以上であるか否かを判定する（ステップＳ２０３）。すなわち、ステップＳ２０３では、ＳＮＲ（ｔ_ｍ−１）＜ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）≧ＴＨ_ＳＮＲであるか否かを判定する。 Next, the vowel section detection unit 130 has a signal-to-noise ratio SNR (t _m-1 ) at time t _m-1 smaller than the threshold TH _SNR and a signal-to-noise ratio SNR (t _m ) at time t _m. It is determined whether or not the threshold value is TH _SNR or more (step S203). That is, in step S203, it is determined whether SNR (t _m-1 ) <TH _SNR and SNR (t _m ) ≧ TH _SNR .

閾値ＴＨ_ＳＮＲは、発話区間における音声が母音であるか非母音（子音等）であるかを判別する値である。音声データの発話区間に含まれる母音区間の信号対雑音比は、該発話区間における非母音区間（子音区間等）の信号対雑音比と比べて大きな値となる。このため、本実施形態では、発話区間のうちの信号対雑音比が閾値ＴＨ_ＳＮＲ以上である区間を母音区間とする。閾値ＴＨ_ＳＮＲは、例えば、母音区間における信号対雑音比の統計値と、非母音区間における信号対雑音比の統計値とに基づいて設定する。 The threshold value TH _SNR is a value for determining whether the voice in the utterance section is a vowel or a non-vowel (consonant, etc.). The signal-to-noise ratio of the vowel section included in the speech section of the speech data is a larger value than the signal-to-noise ratio of the non-vowel section (such as consonant section) in the speech section. For this reason, in this embodiment, a section in which the signal-to-noise ratio in the utterance section is equal to or greater than the threshold TH _SNR is set as a vowel section. The threshold TH _SNR is set based on, for example, a statistical value of the signal-to-noise ratio in the vowel section and a statistical value of the signal-to-noise ratio in the non-vowel section.

ＳＮＲ（ｔ_ｍ−１）＜ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）≧ＴＨ_ＳＮＲである場合、信号対雑音比は、時刻ｔ_ｍ−１からｔ_ｍの間において閾値ＴＨ_ＳＮＲよりも小さい値から閾値ＴＨ_ＳＮＲ以上の値に変化する。すなわち、ＳＮＲ（ｔ_ｍ−１）＜ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）≧ＴＨ_ＳＮＲである場合、時刻ｔ_ｍ−１は非母音区間に含まれ、時刻ｔ_ｍは母音区間に含まれる。よって、ＳＮＲ（ｔ_ｍ−１）＜ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）≧ＴＨ_ＳＮＲである場合（ステップＳ２０３；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の開始時刻Ｔｓ（ｉ）を時刻ｔ_ｍに設定する（ステップＳ２０４）。ステップＳ２０４の処理を終えると、母音区間検出部１３０は、次に、変数ｍが発話区間における最後の値Ｍ以上であるか否かを判定する（ステップＳ２０７）。 When SNR (t _m−1 ) <TH _SNR and SNR (t _m ) ≧ TH _SNR , the signal-to-noise ratio is from a value smaller than the threshold TH _SNR between time t _m−1 and t _m. It changes to a value greater than TH _SNR . That is, when SNR (t _m−1 ) <TH _SNR and SNR (t _m ) ≧ TH _SNR , time t _m−1 is included in the non-vowel section and time t _m is included in the vowel section. Therefore, when SNR (t _m−1 ) <TH _SNR and SNR (t _m ) ≧ TH _SNR (step S203; YES), the vowel section detection unit 130 next starts the i-th vowel section start time. Ts (i) is set at time t _m (step S204). When the processing of step S204 is completed, the vowel section detection unit 130 next determines whether or not the variable m is equal to or greater than the last value M in the utterance section (step S207).

一方、ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、又はＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲである場合（ステップＳ２０３；ＮＯ）、母音区間検出部１３０は、次に、ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲであるか否かを判定する（ステップＳ２０５）。 On the other hand, when SNR (t _m−1 ) ≧ TH _SNR or SNR (t _m ) <TH _SNR (step S203; NO), the vowel section detection unit 130 then selects SNR (t _m−1 ) ≧ TH _SNR, and determines whether the _{_{SNR (t m) <TH SNR}} ( step S205).

ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲである場合（ステップＳ２０５；ＹＥＳ）、時刻ｔ_ｍ−１は母音区間に含まれ、時刻ｔ_ｍは非母音区間に含まれる。よって、ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲである場合（ステップＳ２０５；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の終了時刻Ｔｅ（ｉ）を時刻ｔ_ｍ−１に設定し、変数ｉをｉ＋１に更新する（ステップＳ２０６）。ステップＳ２０６において、母音区間検出部１３０は、母音区間の終了時刻Ｔｅ（ｉ）を時刻ｔ_ｍ−１に設定した後、変数ｉをｉ＋１に更新する。ステップＳ２０６の処理を終えると、母音区間検出部１３０は、次に、ステップＳ２０７の判定を行う。 When SNR (t _m−1 ) ≧ TH _SNR and SNR (t _m ) <TH _SNR (step S205; YES), time t _m−1 is included in the vowel interval, and time t _m is in the non-vowel interval. included. Therefore, when SNR (t _m−1 ) ≧ TH _SNR and SNR (t _m ) <TH _SNR (step S205; YES), the vowel section detection unit 130 then ends the ith vowel section end time. Te (i) is set at time t _m−1 and the variable i is updated to i + 1 (step S206). In step S206, the vowel section detection unit 130 sets the end time Te (i) of the vowel section to the time t _m−1 and then updates the variable i to i + 1. When the process of step S206 is completed, the vowel section detection unit 130 next performs the determination of step S207.

これに対し、ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲを満たしていない場合、時刻ｔ_ｍ−１及び時刻ｔ_ｍにおける信号対雑音比は、両方とも閾値ＴＨ_ＳＮＲ以上であるか、或いは両方とも閾値ＴＨ_ＳＮＲよりも小さい。すなわち、ＳＮＲ（ｔ_ｍ−１）≧ＴＨ_ＳＮＲ、かつＳＮＲ（ｔ_ｍ）＜ＴＨ_ＳＮＲを満たしていない場合、時刻ｔ_ｍ−１，ｔ_ｍは、両方とも非母音区間に含まれるか、或いは両方とも母音区間に含まれる。よって、ＳＮＲ（ｔ_ｍ−１）＜ＴＨ_ＳＮＲ、又はＳＮＲ（ｔ_ｍ）≧ＴＨ_ＳＮＲである場合（ステップＳ２０５；ＮＯ）、母音区間検出部１３０は、ステップＳ２０６をスキップし、次に、ステップＳ２０７の判定を行う。 On the other hand, when SNR (t _m-1 ) ≧ TH _SNR and SNR (t _m ) <TH _SNR are not satisfied, the signal-to-noise ratio at time t _m−1 and time t _m both has a threshold value TH. whether there are _SNR than, or both smaller than the threshold value _{TH SNR.} That is, when SNR (t _m−1 ) ≧ TH _SNR and SNR (t _m ) <TH _SNR are not satisfied, both times t _m−1 and t _m are included in the non-vowel interval, or both Both are included in the vowel section. Therefore, when SNR (t _m−1 ) <TH _SNR or SNR (t _m ) ≧ TH _SNR (step S205; NO), the vowel section detection unit 130 skips step S206, and then proceeds to step S207. Judgment is made.

ステップＳ２０７では、母音区間検出部１３０は、上記のように、現在の変数ｍの値が発話区間における最大値Ｍ以上であるか否かを判定する。ｍ＜Ｍである場合（ステップＳ２０７；ＮＯ）、母音区間検出部１３０は、変数ｍをｍ＋１に更新し（ステップＳ２０８）、ステップＳ２０３以降の処理を行う。 In step S207, the vowel section detection unit 130 determines whether the current value of the variable m is equal to or greater than the maximum value M in the utterance section as described above. When m <M (step S207; NO), the vowel section detection unit 130 updates the variable m to m + 1 (step S208), and performs the processing after step S203.

一方、ｍ≧Ｍである場合（ステップＳ２０７；ＹＥＳ）、母音区間検出部１３０は、母音区間を示す情報［Ｔｓ（ｉ），Ｔｅ（ｉ）］を出力し（ステップＳ２０９）、１個の発話区間に含まれる母音区間を検出する処理を終了する。 On the other hand, if m ≧ M (step S207; YES), the vowel section detection unit 130 outputs information [Ts (i), Te (i)] indicating the vowel section (step S209), one utterance The process of detecting a vowel section included in the section is terminated.

本実施形態に係る母音区間検出部１３０は、発話区間検出部１２０で検出した発話区間のそれぞれに対し、図３のステップＳ２０１〜Ｓ２０９の処理を行う。 The vowel section detection unit 130 according to the present embodiment performs the processes of steps S201 to S209 in FIG. 3 for each of the utterance sections detected by the utterance section detection unit 120.

母音区間検出部１３０で発話区間に含まれる複数の母音区間を検出した後、話速推定装置１は、ステップＳ３〜Ｓ５の処理により、検出した各母音区間のモーラ数（母音数）を決定する。母音区間のモーラ数を決定する処理は、話速推定装置１のモーラ数決定部１４０が行う。本実施形態に係るモーラ数決定部１４０は、上記のように、各母音区間の時間長Ｌ（ｉ）＝｛Ｔｅ（ｉ）−Ｔｓ（ｉ）｝に基づいて、各母音区間のモーラ数を決定する。 After the vowel section detecting unit 130 detects a plurality of vowel sections included in the utterance section, the speech speed estimation apparatus 1 determines the number of mora (number of vowels) in each detected vowel section by the processes of steps S3 to S5. . The process of determining the number of mora in the vowel section is performed by the mora number determination unit 140 of the speech speed estimation apparatus 1. As described above, the mora number determination unit 140 according to the present embodiment determines the number of mora in each vowel section based on the time length L (i) = {Te (i) −Ts (i)} of each vowel section. decide.

図４は、音声データにおける単独母音と長母音との関係を説明する図である。
図４の（ａ）には、単独母音の例として、音声データにおける「はる」という言葉の発話区間４０１についての時間長を示している。例えば、「はる」が季節の「春」である場合、「は」の発話時間と、「る」の発話時間とは略同一となる。すなわち、図４の（ａ）に示した「は」の時間長ΔＴ１（＝ｔ_ｍ２−ｔ_１）と、「る」の時間長ΔＴ２（＝ｔ_ｍ３−ｔ_ｍ２）は、略同一（ΔＴ１≒ΔＴ２）となる。 FIG. 4 is a diagram for explaining the relationship between single vowels and long vowels in voice data.
FIG. 4A shows the length of time for the utterance section 401 of the word “Haru” in the voice data as an example of a single vowel. For example, when “Haru” is the season “Spring”, the utterance time of “Ha” is substantially the same as the utterance time of “Ru”. That is, the time length ΔT1 (= t _m2 −t ₁ ) of “ha” shown in FIG. 4A and the time length ΔT2 (= t _m3 −t _m2 ) of “ru” are substantially the same (ΔT1≈ ΔT2).

一方、図４の（ｂ）には、長母音の例として、音声データにおける「きょう」という言葉の発話区間４０３についての時間長を示している。例えば、「きょう」が日付の「今日」である場合、「きょ」の発話時間と、「う」の発話時間とは略同一となる。すなわち、図４の（ｂ）に示した「きょ」の時間長ΔＴ３（＝ｔ_ｍ５−ｔ_１）と、「う」の時間長ΔＴ４（＝ｔ_ｍ６−ｔ_ｍ５）とは、略同一（ΔＴ３≒ΔＴ４）となる。更に、話者が同一人物であり一定の話速で発話した場合、「きょ」の時間長ΔＴ３や「う」の時間長ΔＴ４は、図４の（ａ）に示した「は」の時間長ΔＴ１や「る」の時間長ΔＴ２と、略同一となる。 On the other hand, FIG. 4B shows the length of time for the utterance section 403 of the word “Kyo” in the speech data as an example of a long vowel. For example, when “Kyo” is “Today” on the date, the utterance time of “Kyo” is substantially the same as the utterance time of “U”. That is, the time length ΔT3 (= t _m5 −t ₁ ) of “Kyo” and the time length ΔT4 (= t _m6 −t _m5 ) of “U” shown in FIG. ΔT3≈ΔT4). Furthermore, when the speaker is the same person and speaks at a constant speaking speed, the time length ΔT3 of “Kyo” and the time length ΔT4 of “U” are the time periods “ha” shown in FIG. The length ΔT1 and the time length ΔT2 of “ru” are substantially the same.

人が「はる」と発話する際には、図４の（ａ）に示した発音表記４０２のように、２音節目の「る」を発話する際に母音から子音（非母音）に変化する。このため、音声データにおける「はる」の発話区間４０１では、「は」を発話している時刻ｔ_１と時刻ｔ_ｍ２との間の時刻ｔ_ｍ１が母音区間の開始時刻Ｔｓ（１）となり、「は」の発話が終了する時刻ｔ_ｍ２が母音区間の終了時刻Ｔｅ（１）となる。また、音声データにおける「はる」の発話区間４０１では、「る」を発話している時刻ｔ_ｍ２と時刻ｔ_ｍ３との間に母音区間の開始時刻Ｔｓ（２）が存在し、「る」の発話が終了する時刻ｔ_ｍ３が母音区間の終了時刻Ｔｅ（２）となる。よって、音声データにおける「はる」という発話区間４０１から検出される母音区間は２個となる。 When a person utters “Haru”, the vowel changes to a consonant (non-vowel) when uttering “ru” of the second syllable, as in the phonetic notation 402 shown in FIG. To do. For this reason, in the utterance section 401 of “Haru” in the voice data, the time t _m1 between the time t ₁ and the time t _m2 uttering “ha” is the start time Ts (1) of the vowel section, The time t _m2 when the utterance of “ha” ends becomes the end time Te (1) of the vowel section. Further, in the speech section 401 of “Haru” in the voice data, the start time Ts (2) of the vowel section exists between the time t _m2 and the time t _{m3 of} “Ru”, and “R”. The time t _{m3 at} which the utterance of vowel ends is the end time Te (2) of the vowel section. Therefore, there are two vowel segments detected from the speech segment 401 “Haru” in the voice data.

一方、人が「きょう」と発話する際には、図４の（ｂ）に示した発音表記４０４のように「ＫＹＯＯ」と発声することが多い。すなわち、人が「きょう」と発話する際には、２音節目を「う（Ｕ）」と発音する代わりに、「きょ（ＫＹＯ）」の母音「Ｏ」を長母音のように発声することが多い。このため、音声データにおける「きょう」の発話区間４０３では、「きょ」を発話している時刻ｔ_１と時刻ｔ_ｍ５との間の時刻ｔ_ｍ４が母音区間の開始時刻Ｔｓとなり、「う」の発話が終了する時刻ｔ_ｍ６が母音区間の終了時刻Ｔｅとなる。よって、音声データにおける「きょう」という発話区間４０３から検出される母音区間は１個となる。 On the other hand, when a person utters “Kyo”, he often utters “KYOO” as in the phonetic notation 404 shown in FIG. That is, when a person utters “Kyo”, instead of pronouncing the second syllable “U (U)”, the vowel “O” of “KYO” is uttered like a long vowel. There are many cases. For this reason, in the speech section 403 of “Kyo” in the voice data, the time t _m4 between the time t ₁ and the time t _m5 uttering “Kyo” is the start time Ts of the vowel section, and “U” The time t _m6 when the utterance of ends is the end time Te of the vowel section. Therefore, there is one vowel section detected from the speech section 403 “Kyo” in the voice data.

このように、音声データにおける「はる」という発話区間４０１と、「きょう」という発話区間４０３とは、いずれも２音節であり発話区間の時間長が略同一であるものの、検出される母音区間の数が異なる。「はる」は母音区間の数が２個であるのに対し、「きょう」の母音区間の数は１個である。このため、母音数を発話区間の時間長で除して話速を算出した場合、音声データにおける「きょう」という発話区間４０３の話速は、「はる」という発話区間４０１の話速の約２倍となってしまう。本実施形態に係る話速推定装置１は、このような長母音化による話速の誤りを防ぐため、母音区間の時間長についての頻度分布に基づいて、各母音区間の母音数（モーラ数）を決定する。 As described above, the utterance section 401 “Haru” and the utterance section 403 “Kyo” in the speech data are both two syllables and the duration of the utterance section is substantially the same, but are detected vowel sections. The number of is different. “Haru” has two vowel segments, whereas “Kyo” has one vowel segment. For this reason, when the speech speed is calculated by dividing the number of vowels by the duration of the utterance section, the speech speed of the utterance section 403 “Kyo” in the speech data is about the speech speed of the utterance section 401 “Haru”. It becomes twice. The speech speed estimation apparatus 1 according to the present embodiment, in order to prevent such an error in speech speed due to the lengthening of the long vowels, is based on the frequency distribution of the time length of the vowel sections and the number of vowels (number of mora) in each vowel section. To decide.

図４の（ａ）の発音表記４０２のように、音声データにおける「は（ＨＡ）」や「る（ＲＵ）」等の子音を含む音節では、母音区間の時間長が音節全体の時間長の７割以上を占める。例えば、音声データにおける「は」の発話区間では、母音区間の時間長Ｌ（１）＝Ｔｅ（１）−Ｔｓ（１）が発話区間の時間長ΔＴ１の７割以上を占める。同様に、音声データにおける「る」の発話区間では、母音区間の時間長Ｌ（２）＝Ｔｅ（２）−Ｔｓ（２）が発話区間の時間長ΔＴ２の７割以上を占める。 In the syllable including consonants such as “ha (HA)” and “ru (RU)” in the speech data as in the phonetic notation 402 of FIG. 4A, the time length of the vowel section is the time length of the entire syllable. It accounts for over 70%. For example, in the speech section of “ha” in the speech data, the time length L (1) = Te (1) −Ts (1) of the vowel section occupies 70% or more of the time length ΔT1 of the speech section. Similarly, in the speech section of “ru” in the speech data, the time length L (2) = Te (2) −Ts (2) of the vowel section occupies 70% or more of the time length ΔT2 of the speech section.

一方、図４の（ｂ）の発音表記４０４のように、子音を含む音節と母音のみの音節とが連続した区間で母音が長母音化した場合、母音区間の時間長は、次のようになる。まず、子音を含む「きょ」の音節における母音区間の時間長は、当該音節の時間長ΔＴ３の７割以上を占める。また、母音のみである「う」の音節における母音区間の時間長は、当該音節の時間長ΔＴ４となる。このため、「きょう」という発話区間４０３に含まれる母音区間の時間長Ｌは、｛（０．７×ΔＴ３）＋ΔＴ４｝≦Ｌ＜（ΔＴ３＋ΔＴ４）となる。よって、「きょう」という発話区間４０３に含まれる母音区間の時間長Ｌは、単独母音の時間長Ｌ（１），Ｌ（２）の約２倍となる。 On the other hand, when the vowel is turned into a long vowel in a section in which a syllable including a consonant and a syllable of only a vowel are continuous as in the phonetic notation 404 of FIG. 4B, the time length of the vowel section is Become. First, the time length of the vowel section in the syllable of “Kyo” including consonants occupies 70% or more of the time length ΔT3 of the syllable. Further, the time length of the vowel section in the “U” syllable that is only a vowel is the time length ΔT4 of the syllable. Therefore, the time length L of the vowel section included in the utterance section 403 “Kyo” is {(0.7 × ΔT3) + ΔT4} ≦ L <(ΔT3 + ΔT4). Therefore, the time length L of the vowel section included in the utterance section 403 “Kyo” is about twice the time length L (1) and L (2) of the single vowel.

このように、長母音化した母音区間の時間長は、単独母音の時間長の整数倍に近い値となる。このため、発話区間に含まれる複数の母音区間の時間長についての頻度分布には、単独母音の時間長と対応した特徴が現れる。 In this way, the time length of the vowel section that is converted into a long vowel is a value close to an integral multiple of the time length of a single vowel. For this reason, a characteristic corresponding to the time length of a single vowel appears in the frequency distribution of the time lengths of a plurality of vowel intervals included in the utterance interval.

図５は、母音区間の頻度分布の例を示す図である。
１個の発話区間に含まれる複数の母音区間のそれぞれにおける時間長に基づいて、時間長毎の母音区間の頻度を計数して算出した頻度分布は、例えば、図５に示したように４個のピークＰ１，Ｐ２，Ｐ３，及びＰ４が見られる。なお、図５の頻度分布は、時間長Ｌを０．０２秒毎の複数の区間に分割し、区間毎に、区間内に時間長が含まれる母音区間の数を頻度として算出している。例えば、時間長Ｌ＝０．２秒である母音区間の頻度は、発話区間における複数の母音区間のうちの、時間長（Ｔｅ−Ｔｓ）が０．１９≦（Ｔｅ−Ｔｓ）＜０．２１である母音区間の数を示している。 FIG. 5 is a diagram illustrating an example of a frequency distribution in a vowel section.
The frequency distribution calculated by counting the frequency of the vowel section for each time length based on the time length in each of the plurality of vowel sections included in one utterance section is, for example, four as shown in FIG. Peaks P1, P2, P3, and P4. In the frequency distribution of FIG. 5, the time length L is divided into a plurality of sections every 0.02 seconds, and the number of vowel sections whose time length is included in the section is calculated as the frequency for each section. For example, the frequency of a vowel section with a time length L = 0.2 seconds is such that the time length (Te−Ts) of a plurality of vowel sections in the utterance section is 0.19 ≦ (Te−Ts) <0.21. Indicates the number of vowel intervals.

長母音化した母音区間や複数種の単独母音が連続した母音区間の時間長は、単独母音の母音区間の時間長よりも長くなる。また、日本語では一般的に発話中の１モーラ（単独母音）をほぼ同じ長さで発話する性質（モーラの等時性）があるため、上記のように、１個の発話区間内における単独母音の母音区間の時間長は、略一定である。このため、頻度分布における４個のピークＰ１〜Ｐ４のうちの時間長が最短であるピークＰ１が、単独母音の母音区間の時間長を示しているといえる。また、長母音化した母音区間の時間長は、単独母音の母音区間の時間長の整数倍に近い値となる。このため、頻度分布においてピークＰ２〜Ｐ４が現れる時間長は、ピークＰ１の時間長ＴＰ１の整数倍に近い値となる。したがって、本実施形態では、頻度分布における時間長が最短であるピークＰ１の時間長ＴＰ１を基準時間長とし、下記式（３）により各母音区間の母音数（モーラ数）ＶＮ（ｉ）を算出する。 The time length of a vowel section that is a long vowel or a vowel section in which a plurality of types of single vowels are continuous becomes longer than the time length of a vowel section of a single vowel. In addition, since Japanese generally has the property of uttering a single mora (single vowel) that is being uttered with approximately the same length (mora isochronism), as described above, a single mora within a single utterance interval is used. The time length of the vowel section of the vowel is substantially constant. For this reason, it can be said that the peak P1 having the shortest time length among the four peaks P1 to P4 in the frequency distribution indicates the time length of the vowel section of the single vowel. Further, the time length of the vowel section that is converted into a long vowel is a value close to an integral multiple of the time length of the vowel section of the single vowel. For this reason, the time length at which the peaks P2 to P4 appear in the frequency distribution is a value close to an integral multiple of the time length TP1 of the peak P1. Therefore, in this embodiment, the time length TP1 of the peak P1 with the shortest time length in the frequency distribution is set as the reference time length, and the number of vowels (number of mora) VN (i) in each vowel section is calculated by the following equation (3). To do.

式（３）のＬ（ｉ）は、ｉ番目の母音区間の時間長である。 L (i) in Equation (3) is the time length of the i-th vowel section.

このように、本実施形態の話速推定装置１では、母音区間の時間長についての頻度分布に基づいて母音区間の基準時間長（単独母音の時間長）ＴＰ１を推定し、各母音区間の時間長と基準時間長ＴＰ１との比から各母音区間の母音数を決定する。このため、本実施形態の話速推定装置１では、長母音化された母音区間のモーラ数（母音数）の誤りを防ぐことが可能となる。 As described above, the speech speed estimation apparatus 1 according to the present embodiment estimates the reference time length (single vowel time length) TP1 of the vowel section based on the frequency distribution regarding the time length of the vowel section, and the time of each vowel section. The number of vowels in each vowel section is determined from the ratio between the length and the reference time length TP1. For this reason, in the speech speed estimation apparatus 1 according to the present embodiment, it is possible to prevent an error in the number of mora (vowel number) in a vowel section that has been made into a long vowel.

図６は、発話区間の発話内容と各母音区間のモーラ数の算出結果とを説明する図である。 FIG. 6 is a diagram for explaining the utterance content in the utterance section and the calculation result of the number of mora in each vowel section.

図６の文字列４１１は、音声データの発話区間における発話内容をテキスト化して示したものである。また、図６の発音表記４１２は、文字列４１１における各音節を母音及び子音で示したものである。文字列４１１に示した「きょうはいいようきですね」という言葉を自然に発話した場合、発音表記４１２に示したように「きょう」、「いい」、及び「よう」の部分で母音が長母音化することが多い。このため、音声データの発話区間に含まれる母音区間を検出する処理を行うと、発音表記４１２の下方に示したように７個の母音区間が検出される。この際、文字列４１１の「はいい」の部分から検出される母音区間は、母音が「Ａ」である区間と、母音が「Ｉ」である区間とが連続している区間である。このため、母音が長母音化している母音区間の母音数を１とした場合、「はいい」の部分から検出される母音区間の母音数は２となり、他の母音区間の母音数は１となる。よって、母音が長母音化している母音区間の母音数を１とした場合、音声データの発話区間（時刻ｔ_１〜ｔ_Ｍ）におけるモーラ数（母音数）は８となり、実際のモーラ数（１１個）とは異なる。すなわち、話速の算出に用いるモーラ数と、音声データにおける実際のモーラ数とに差異があるため、話速を精度良く推定することが困難である。 The character string 411 in FIG. 6 shows the utterance content in the utterance section of the voice data as text. Also, the phonetic notation 412 in FIG. 6 shows each syllable in the character string 411 as a vowel and a consonant. When the word “you are good today” shown in the character string 411 is spoken naturally, the vowels are long in the parts “Kyo”, “Good”, and “Yo” as shown in the pronunciation notation 412. Often vowelized. For this reason, when processing for detecting a vowel section included in a speech section of speech data is performed, seven vowel sections are detected as shown below the phonetic notation 412. At this time, the vowel section detected from the “haoi” portion of the character string 411 is a section in which the section where the vowel is “A” and the section where the vowel is “I” are continuous. Therefore, when the number of vowels in a vowel section in which the vowel is a long vowel is 1, the number of vowels in the vowel section detected from the “yes” part is 2, and the number of vowels in the other vowel sections is 1. Become. Therefore, when the number of vowels in a vowel section in which a vowel is a long vowel is 1, the number of mora (vowel number) in the speech section (time t _{1 to} t _M ) of the voice data is 8, and the actual number of mora Different). That is, since there is a difference between the number of mora used for calculating the speech speed and the actual number of mora in the speech data, it is difficult to estimate the speech speed with high accuracy.

これに対し、本実施形態に係るモーラ数の算出方法では、まず、各母音区間の時間長についての頻度分布に基づいて基準時間長ＴＰ１を算出する。図６に示した例における基準時間長ＴＰ１は、文字列４１１における「きですね」の部分区間から検出された母音区間の時間長Ｌ（４），Ｌ（５），Ｌ（６），及びＬ（７）のそれぞれと略同一の時間長となる。算出した基準時間長ＴＰ１を用い、式（３）により各母音区間のモーラ数ＶＮ（ｉ）を算出すると、それぞれ、図６に示した結果が得られる。１番目の母音区間の時間長Ｌ（１）及び３番目の母音区間の時間長Ｌ（３）は、それぞれ、基準時間長ＴＰ１の約２倍となる。そのため、１番目の母音区間のモーラ数ＶＮ（１）及び３番目の母音区間のモーラ数ＶＮ（３）は、それぞれ、２となる。また、２番目の母音区間の時間長Ｌ（２）は、基準時間長ＴＰ１の約３倍となる。そのため、２番目の母音区間のモーラ数ＶＮ（２）は、３となる。更に、残りの母音区間のモーラ数ＶＮ（４）〜ＶＮ（７）は、それぞれ、１となる。したがって、本実施形態に係るモーラ数の算出方法では、文字列４１１の発話区間に含まれるモーラ数が１１となる。すなわち、本実施形態によれば、各母音区間のモーラ数を正しく算出することが可能となる。よって、本実施形態によれば、発話区間の話速を正しく推定することが可能となる。 In contrast, in the method for calculating the number of mora according to the present embodiment, first, the reference time length TP1 is calculated based on the frequency distribution for the time length of each vowel section. The reference time length TP1 in the example shown in FIG. 6 is the time length L (4), L (5), L (6) of the vowel section detected from the partial section of “Ki-N” in the character string 411, and The time length is approximately the same as each of L (7). When the calculated reference time length TP1 is used to calculate the number of mora VN (i) of each vowel section by Equation (3), the results shown in FIG. 6 are obtained. The time length L (1) of the first vowel section and the time length L (3) of the third vowel section are each about twice the reference time length TP1. Therefore, the mora number VN (1) of the first vowel section and the mora number VN (3) of the third vowel section are 2, respectively. Further, the time length L (2) of the second vowel section is about three times the reference time length TP1. Therefore, the mora number VN (2) of the second vowel section is 3. Further, the mora numbers VN (4) to VN (7) of the remaining vowel sections are each 1. Therefore, in the method for calculating the number of mora according to the present embodiment, the number of mora included in the utterance section of the character string 411 is 11. That is, according to this embodiment, it is possible to correctly calculate the number of mora in each vowel section. Therefore, according to the present embodiment, it is possible to correctly estimate the speech speed of the utterance section.

発話区間の話速ＳＲは、例えば、下記式（４）により算出する。 The speaking speed SR of the utterance section is calculated by the following equation (4), for example.

式（４）のＳＴｓ及びＳＴｅは、それぞれ、発話区間の開始時刻及び終了時刻である。 STs and STe in Equation (4) are the start time and end time of the utterance section, respectively.

発話区間の時間長（ＳＴｅ−ＳＴｓ）を１０秒とすると、発話区間に含まれる母音区間の数に基づいて算出した話速は、約０．７（モーラ／秒）となる。これに対し、母音区間の時間長についての頻度分布に基づいて各母音区間のモーラ数ＶＮ（ｉ）を決定した場合の話速ＳＲは、１．１（モーラ／秒）となる。このように、本実施形態によれば、発話区間のモーラ数（母音数）を正しく算出し、発話区間における話速を精度良く算出することが可能となる。 If the time length (STe-STs) of the utterance section is 10 seconds, the speech speed calculated based on the number of vowel sections included in the utterance section is about 0.7 (mora / second). On the other hand, the speech rate SR when the number of mora VN (i) of each vowel section is determined based on the frequency distribution of the time length of the vowel section is 1.1 (mora / second). As described above, according to the present embodiment, the number of mora (number of vowels) in the utterance section can be correctly calculated, and the speech speed in the utterance section can be accurately calculated.

本実施形態に係る話速推定装置１は、話速を算出した後、算出した話速を可視化する表示データを生成し表示装置３に出力する。 The speech speed estimation apparatus 1 according to the present embodiment calculates the speech speed, generates display data for visualizing the calculated speech speed, and outputs the display data to the display apparatus 3.

図７は、話速の算出結果の出力例を示す図である。
図７には、話速の算出結果の出力例として、発話区間毎に算出した話速の時間変化を提示するグラフ４２１を示している。グラフ４２１の左端には、話速の値と、話速が適正であるか否かを示す情報が表示される。図７に示した例では、話速が６．０〜８．０モーラ／秒である場合を適正な話速としている。グラフ４２１における横軸は、会話時間である。グラフ４２１には、音声データから検出した各発話区間における話速ＳＲを示す曲線５０１が表示される。表示装置３に表示したグラフ４２１における曲線５０１は、話速推定装置１において発話区間及び母音区間を検出して話速を算出する毎に、更新される。グラフ４２１における曲線５０１は、例えば、会話中の全ての発話区間の話速を示すものであってもよいし、会話中の直近の所定時間内における話速のみを示すものであってもよい。 FIG. 7 is a diagram illustrating an output example of the speech speed calculation result.
FIG. 7 shows a graph 421 that presents the temporal change of the speech speed calculated for each speech section as an output example of the speech speed calculation result. On the left end of the graph 421, the value of the speech speed and information indicating whether or not the speech speed is appropriate are displayed. In the example shown in FIG. 7, the case where the speech speed is 6.0 to 8.0 mora / second is set as an appropriate speech speed. The horizontal axis in the graph 421 is the conversation time. In the graph 421, a curve 501 indicating the speech speed SR in each utterance section detected from the speech data is displayed. The curve 501 in the graph 421 displayed on the display device 3 is updated every time the speech speed estimation device 1 detects the speech section and the vowel section and calculates the speech speed. The curve 501 in the graph 421 may indicate, for example, the speaking speed of all the utterance sections during the conversation, or may indicate only the speaking speed within the latest predetermined time during the conversation.

なお、図７のグラフ４２１は、話速の算出結果を可視化して表示する方法の一例に過ぎない。話速の算出結果は、グラフ４２１のような話速の時間変化を示す方法に限らず、他の方法で表示してもよい。例えば、話速の算出結果の表示形態をレベルメータのような表示形態とし、話速推定装置１で算出した最新（直近）の話速のみを表示してもよい。また、話速の算出結果を表示装置３に表示する際には、例えば、算出した話速が所定の閾値以上となった場合にのみ話者に注意喚起する表示をしてもよい。 Note that the graph 421 in FIG. 7 is merely an example of a method for visualizing and displaying the calculation result of the speech speed. The calculation result of the speech speed is not limited to the method showing the temporal change of the speech speed as in the graph 421, and may be displayed by another method. For example, the display form of the speech speed calculation result may be a display form such as a level meter, and only the latest (most recent) speech speed calculated by the speech speed estimation apparatus 1 may be displayed. In addition, when the calculation result of the speech speed is displayed on the display device 3, for example, a display that alerts the speaker only when the calculated speech speed is equal to or higher than a predetermined threshold value may be displayed.

上記のように、本実施形態に係る話速推定装置１では、発話区間に含まれる複数の母音区間のそれぞれにおける時間長の頻度分布に基づいて検出した単独母音の時間長と、母音区間の時間長との比に基づいて、各母音区間のモーラ数（母音数）を算出する。このため、本実施形態に係る話速推定装置１では、長母音化した母音区間についてのモーラ数の誤りを防ぎ、発話区間の正しい話速を算出することが可能となる。よって、話速推定装置１は、例えば、話速の推定対象である利用者に対して正しい話速の時間変化を提示し、適正な話速での発話をするよう該利用者を導くことが可能となる。 As described above, in the speech speed estimation apparatus 1 according to the present embodiment, the time length of a single vowel detected based on the frequency distribution of the time length in each of a plurality of vowel sections included in the utterance section, and the time of the vowel section Based on the ratio to the length, the number of mora (number of vowels) in each vowel section is calculated. For this reason, in the speech speed estimation apparatus 1 according to the present embodiment, it is possible to prevent an error in the number of mora for a long vowel vowel section and to calculate a correct speech speed of the utterance section. Therefore, for example, the speech speed estimation apparatus 1 can present the time change of the correct speech speed to the user who is the target of speech speed estimation and guide the user to speak at an appropriate speech speed. It becomes possible.

なお、図２及び図３のフローチャートは、本実施形態に係る発話推定装置１が行う処理の一例に過ぎない。本実施形態に係る発話推定装置１が行う処理は、図２及び図３に示した内容に限らず、適宜変更可能である。例えば、発話区間に含まれる母音区間を検出する処理では、信号対雑音比の代わりに、音声データの波形自己相関、或いはフォルマント周波数に基づいて母音区間を検出してもよい。 Note that the flowcharts of FIGS. 2 and 3 are merely examples of processing performed by the utterance estimation apparatus 1 according to the present embodiment. The processing performed by the utterance estimation apparatus 1 according to the present embodiment is not limited to the contents shown in FIGS. 2 and 3 and can be changed as appropriate. For example, in the process of detecting the vowel section included in the utterance section, the vowel section may be detected based on the waveform autocorrelation of the speech data or the formant frequency instead of the signal-to-noise ratio.

また、母音区間の時間長についての頻度分布を算出する際には、例えば、検出した複数の母音区間のうちの、時間長が所定の範囲内である母音区間のみを抽出して算出してもよい。所定の範囲は、例えば、予め統計処理を行って得た単独母音の時間長に基づいて設定する。 Further, when calculating the frequency distribution for the time length of the vowel section, for example, only the vowel section whose time length is within a predetermined range among the detected plurality of vowel sections may be extracted and calculated. Good. The predetermined range is set based on, for example, the time length of a single vowel obtained by performing statistical processing in advance.

また、各母音区間のモーラ数を算出する際に用いる基準時間長は、頻度分布における複数のピークのうちの時間長が最短となるピークと対応した時間長ＴＰ１に限らず、例えば、頻度分布において隣接するピークの時間間隔の平均値としてもよい。 In addition, the reference time length used when calculating the number of mora in each vowel section is not limited to the time length TP1 corresponding to the peak having the shortest time length among the plurality of peaks in the frequency distribution. It is good also as an average value of the time interval of an adjacent peak.

図８は、モーラ数の算出方法の別の例を示す図である。
図８には、発話区間に含まれる複数の母音区間の時間長についての頻度分布の例を示している。図８の頻度分布は、図５の頻度分布と同じ算出方法により算出したものである。すなわち、図８の頻度分布は、横軸（時間長Ｌ）を０．０２秒毎の複数の区間に分割し、区間毎に、区間内に時間長が含まれる母音区間の数を頻度として算出している。例えば、時間長Ｌ＝０．２秒である母音区間の頻度は、発話区間における複数の母音区間のうちの、時間長（Ｔｅ−Ｔｓ）が０．１９≦（Ｔｅ−Ｔｓ）＜０．２１である母音区間の数を示している。 FIG. 8 is a diagram illustrating another example of a method for calculating the number of mora.
FIG. 8 shows an example of the frequency distribution for the time lengths of a plurality of vowel sections included in the utterance section. The frequency distribution in FIG. 8 is calculated by the same calculation method as the frequency distribution in FIG. That is, the frequency distribution of FIG. 8 is obtained by dividing the horizontal axis (time length L) into a plurality of sections every 0.02 seconds and calculating the number of vowel sections whose time length is included in each section as the frequency. doing. For example, the frequency of a vowel section with a time length L = 0.2 seconds is such that the time length (Te−Ts) of a plurality of vowel sections in the utterance section is 0.19 ≦ (Te−Ts) <0.21. Indicates the number of vowel intervals.

図８の頻度分布からは、時間長の短いピークから順に第１のピークＰ１，第２のピークＰ２，第３のピークＰ３，及び第４のピークＰ４が検出される。第１のピークＰ１の時間長ＴＰ１は、モーラ数が１である母音区間の時間長（すなわち単独母音の時間長）と略一致する。また、第２のピークＰ２の時間長ＴＰ２、第３のピークＰ３の時間長ＴＰ３、第４のピークＰ４の時間長ＴＰ４は、それぞれ、モーラ数が２、３、及び４の母音区間の時間長と略一致する。 From the frequency distribution of FIG. 8, the first peak P1, the second peak P2, the third peak P3, and the fourth peak P4 are detected in order from the peak with the shortest time length. The time length TP1 of the first peak P1 substantially coincides with the time length of a vowel section in which the number of mora is 1 (that is, the time length of a single vowel). In addition, the time length TP2 of the second peak P2, the time length TP3 of the third peak P3, and the time length TP4 of the fourth peak P4 are the time lengths of the vowel sections having the mora numbers 2, 3, and 4, respectively. Is approximately the same.

このように、母音区間の時間長についての頻度分布から４個のピークが検出された場合、各母音区間のモーラ数を算出する際に用いる基準時間長は、例えば、下記式（５）により算出される平均時間長ｄ＿ａｖｅとしてもよい。 As described above, when four peaks are detected from the frequency distribution of the time length of the vowel section, the reference time length used when calculating the number of mora of each vowel section is calculated by the following equation (5), for example. The average time length d_ave may be used.

式（５）におけるｄｊ（ｊ＝１，２，３）は、それぞれ、ｊ＋１番目のピークＰｊ＋１の時間長ＴＰｊ＋１と、ｊ番目のピークＰｊの時間長ＴＰｊとの時間差である。 In Equation (5), dj (j = 1, 2, 3) is a time difference between the time length TPj + 1 of the j + 1-th peak Pj + 1 and the time length TPj of the j-th peak Pj.

式（５）により算出した平均時間長ｄ＿ａｖｅを基準時間長とする場合、各母音区間のモーラ数ＶＮ（ｉ）は、下記式（６）により算出する。 When the average time length d_ave calculated by the equation (5) is used as the reference time length, the mora number VN (i) of each vowel section is calculated by the following equation (6).

頻度分布における第１のピークＰ１の時間長ＴＰ１は、母音のみの１音節における母音区間の時間長と、図４に示したような子音を含む１音節における母音区間の時間長とにより定まる。子音を含む１音節における母音区間の時間長は、上記のように１音節分の時間長の７割以上を占めるが、１音節分の時間長よりも短い。したがって、図４の（ｂ）に示したように、子音を含む音節の母音区間が１音節分だけ長母音化した場合の母音区間の時間長は、２音節分の時間長よりも短くなる。 The time length TP1 of the first peak P1 in the frequency distribution is determined by the time length of a vowel section in one syllable containing only vowels and the time length of a vowel section in one syllable including consonants as shown in FIG. The time length of a vowel section in one syllable including consonants occupies 70% or more of the time length of one syllable as described above, but is shorter than the time length of one syllable. Therefore, as shown in FIG. 4B, when the vowel section of the syllable including the consonant is made longer by one syllable, the time length of the vowel section becomes shorter than the time length of two syllables.

これに対し、式（５）における時間差ｄ１，ｄ２，及びｄ３は、それぞれ、１音節分に相当する時間長と略一致する。このため、第１のピークＰ１の時間長ＴＰ１のみに基づいて各母音区間のモーラ数ＶＮ（ｉ）を算出する場合に比べと、例えば、判定条件の境界部分におけるモーラ数ＶＮ（ｉ）を正しく算出することが可能となる。 On the other hand, the time differences d1, d2, and d3 in the equation (5) substantially coincide with the time length corresponding to one syllable. Therefore, compared to the case where the mora number VN (i) of each vowel section is calculated based only on the time length TP1 of the first peak P1, for example, the mora number VN (i) at the boundary portion of the determination condition is correctly set. It is possible to calculate.

本実施形態に係る話速推定装置１は、例えば、電話網等のネットワークを利用した通話システムにおける話者の話速の推定に適用可能である。 The speech speed estimation apparatus 1 according to the present embodiment can be applied to estimation of a speaker's speech speed in a call system using a network such as a telephone network, for example.

図９は、話速推定装置の第１の適用例を示す図である。
図９に示すように、通話システム１０は、第１の話者９Ａと、第２の話者９Ｂとの通話に利用される。第１の話者９Ａは、話速推定装置１と、通話処理装置１１とを含む情報処理装置１２に、収音装置２、表示装置３、及びレシーバ１３を接続したものを通話装置（電話機）として用いる。 FIG. 9 is a diagram illustrating a first application example of the speech speed estimation apparatus.
As shown in FIG. 9, the call system 10 is used for a call between a first speaker 9A and a second speaker 9B. The first speaker 9A is a communication device (telephone) obtained by connecting the sound collection device 2, the display device 3, and the receiver 13 to the information processing device 12 including the speech speed estimation device 1 and the call processing device 11. Used as

第１の話者９Ａが通話装置（情報処理装置１２）を用いて通話を行う際、通話処理装置１１は、第１の交換機１４Ａ、及びネットワーク１５を介して、通話相手が利用する通話装置（電話機）に接続される。第１の話者９Ａの通話相手が第２の話者９Ｂである場合、通話処理装置１１は、第１の交換機１４Ａ、ネットワーク１５、及び第２の交換機１４Ｂを介して、第２の話者９Ｂが利用する電話機１６と接続される。 When the first speaker 9A makes a call using the call device (information processing device 12), the call processing device 11 is connected to the call device (the call device used by the call partner (via the first switch 14A and the network 15)). Connected to the telephone). When the communication partner of the first speaker 9A is the second speaker 9B, the call processing device 11 transmits the second speaker via the first switch 14A, the network 15, and the second switch 14B. 9B is connected to the telephone set 16 used.

通話処理装置１１は、収音装置２から取得した第１の話者９Ａの音声を含む音声データを電話機１６に向けて送信する処理と、電話機１６から受信した音声データをレシーバ１３に出力する処理とを行う。一方、電話機１６の通話処理部１６１０は、収音装置１６２０から取得した第２の話者９Ｂの音声を含む音声データを通話処理装置１１（情報処理装置１２）に向けて送信する処理と、通話処理装置１１から受信した音声データをレシーバ１６３０に出力する処理とを行う。 The call processing device 11 transmits the voice data including the voice of the first speaker 9 A acquired from the sound collection device 2 to the telephone set 16, and outputs the voice data received from the telephone set 16 to the receiver 13. And do. On the other hand, the call processing unit 1610 of the telephone set 16 transmits the voice data including the voice of the second speaker 9B acquired from the sound collection device 1620 to the call processing device 11 (information processing device 12), The audio data received from the processing device 11 is output to the receiver 1630.

第１の話者９Ａと第２の話者９Ｂとが通話している間、情報処理装置１２に含まれる話速推定装置１は、収音装置２から第１の話者９Ａの音声を含む音声データを取得し、図２及び図３のフローチャートに沿って第１の話者９Ａの話速を算出する。また、話速推定装置１は、算出した話速を可視化する表示データを生成して表示装置３に出力する。このため、第１の話者９Ａは、第２の話者９Ｂとの通話中に、自身の話速が適正であるか否かを把握し、話速を調整することが可能となる。これにより、例えば、第１の話者９Ａを、第２の話者９Ｂが発話内容を聞き取りやすい話速での発話に導くことが可能となる。 While the first speaker 9A and the second speaker 9B are talking, the speech speed estimation device 1 included in the information processing device 12 includes the sound of the first speaker 9A from the sound collection device 2. Voice data is acquired, and the speaking speed of the first speaker 9A is calculated according to the flowcharts of FIGS. Further, the speech speed estimation device 1 generates display data for visualizing the calculated speech speed and outputs the display data to the display device 3. Therefore, the first speaker 9A can grasp whether or not his / her speaking speed is appropriate during a call with the second speaker 9B, and can adjust the speaking speed. Thereby, for example, the first speaker 9A can be guided to an utterance at a speaking speed at which the second speaker 9B can easily hear the utterance content.

なお、情報処理装置１２は、話速推定装置１と、通話処理装置１１との２個の装置を内包する装置に限らず、話速推定装置１が行う処理を担う第１の処理部と、通話処理装置１１が行う処理を担う第２の処理部とを含む１個の装置であってもよい。 The information processing device 12 is not limited to a device that includes two devices, the speech speed estimation device 1 and the call processing device 11, but a first processing unit that performs processing performed by the speech speed estimation device 1, It may be one device including a second processing unit responsible for processing performed by the call processing device 11.

また、通話システム１０は、例えば、ネットワーク１５を利用した電話会議システムや、テレビ会議システムであってもよい。 Further, the call system 10 may be, for example, a telephone conference system using the network 15 or a video conference system.

図１０は、話速推定装置の第２の適用例を示す図である。
図１０に示した通話システム１０は、第１の話者９Ａと、第２の話者９Ｂとの通話に利用される。第１の話者９Ａは、通話処理装置１１に相当する通話処理部２１を含む情報処理装置１２に収音装置２、及びレシーバ１３を接続したものを、通話装置（電話機）として用いる。この際、情報処理装置１２には話速推定装置１を接続し、話速推定装置１には表示装置３を接続する。 FIG. 10 is a diagram illustrating a second application example of the speech speed estimation apparatus.
The call system 10 shown in FIG. 10 is used for a call between the first speaker 9A and the second speaker 9B. The first speaker 9 A uses, as a call device (telephone), a sound collecting device 2 and a receiver 13 connected to an information processing device 12 including a call processing unit 21 corresponding to the call processing device 11. At this time, the speech speed estimation apparatus 1 is connected to the information processing apparatus 12, and the display apparatus 3 is connected to the speech speed estimation apparatus 1.

第１の話者９Ａが通話装置を用いて通話を行う際、情報処理装置１２の通話処理部２１は、第１の交換機１４Ａ、及びネットワーク１５を介して、通話相手が利用する通話装置（電話機）に接続される。第１の話者９Ａの通話相手が第２の話者９Ｂである場合、情報処理装置１２の通話処理部２１は、第１の交換機１４Ａ、ネットワーク１５、及び第２の交換機１４Ｂを介して、第２の話者９Ｂが利用する電話機１６と接続される。 When the first speaker 9A makes a call using the call device, the call processing unit 21 of the information processing device 12 uses the first switch 14A and the network 15 to call the call device (telephone) used by the other party. ). When the communication partner of the first speaker 9A is the second speaker 9B, the call processing unit 21 of the information processing apparatus 12 passes through the first switch 14A, the network 15, and the second switch 14B. It is connected to the telephone 16 used by the second speaker 9B.

情報処理装置１２の通話処理部２１は、収音装置２から取得した第１の話者９Ａの音声を含む音声データを電話機１６に向けて送信する処理と、電話機１６から受信した音声データをレシーバ１３に出力する処理とを行う。また、情報処理装置１２の通話処理部２１は、収音装置２から取得した音声データを話速推定装置１に出力する。一方、電話機１６の通話処理部１６１０は、収音装置１６２０から取得した第２の話者９Ｂの音声を含む音声データを情報処理装置１２に向けて送信する処理と、情報処理装置１２から受信した音声データをレシーバ１６３０に出力する処理とを行う。 The call processing unit 21 of the information processing device 12 receives the voice data including the voice of the first speaker 9A acquired from the sound pickup device 2 to the telephone set 16 and the voice data received from the telephone set 16 as a receiver. 13 is output. Further, the call processing unit 21 of the information processing device 12 outputs the voice data acquired from the sound collection device 2 to the speech speed estimation device 1. On the other hand, the telephone call processing unit 1610 of the telephone set 16 receives the voice data including the voice of the second speaker 9B acquired from the sound pickup device 1620 toward the information processing device 12, and the information received from the information processing device 12. Audio data is output to the receiver 1630.

第１の話者９Ａと第２の話者９Ｂとが通話している間、話速推定装置１は、情報処理装置１２を介して第１の話者９Ａの音声を含む音声データを取得し、図２及び図３のフローチャートに沿って第１の話者９Ａの話速を算出する。また、話速推定装置１は、算出した話速を可視化する表示データを生成して表示装置３に出力する。これにより、第１の話者９Ａは、第２の話者９Ｂとの通話中に、自身の話速が適正であるか否かを把握し、話速を調整することが可能となる。 While the first speaker 9A and the second speaker 9B are talking, the speech speed estimation device 1 acquires voice data including the voice of the first speaker 9A via the information processing device 12. The speech speed of the first speaker 9A is calculated according to the flowcharts of FIGS. Further, the speech speed estimation device 1 generates display data for visualizing the calculated speech speed and outputs the display data to the display device 3. Thus, the first speaker 9A can grasp whether or not his / her speaking speed is appropriate during a call with the second speaker 9B and can adjust the speaking speed.

また、図９及び図１０の通話システム１０における表示装置３の設置場所は、第１の話者９Ａの近傍に限らず、例えば、第１の話者９Ａとは異なる場所にいる第三者の近傍に設置することも可能である。更に、通話システム１０における表示装置３は、複数個であってもよい。 9 and 10 is not limited to the vicinity of the first speaker 9A, for example, the display device 3 is installed in a place different from the first speaker 9A. It is also possible to install in the vicinity. Furthermore, the display device 3 in the call system 10 may be plural.

加えて、話速推定装置１で算出した（推定した）第１の話者９Ａの話速は、例えば、図示しない話速推定装置１の記憶部、或いは他の装置に記憶させてもよい。 In addition, the speech speed of the first speaker 9A calculated (estimated) by the speech speed estimation device 1 may be stored in, for example, a storage unit of the speech speed estimation device 1 (not shown) or another device.

なお、本実施形態で挙げた話速推定装置１は、モーラ数決定部１４０において決定した複数の母音区間それぞれのモーラ数に基づいて、入力音声データと対応する出力信号を制御する音声処理装置の一例に過ぎない。すなわち、本実施形態に係るモーラ数決定部１４０において決定した複数の母音区間それぞれのモーラ数は、入力音声データの話速を提示する表示データ（出力信号）を制御するだけでなく、入力データと対応する他の出力信号の制御にも利用可能である。 Note that the speech speed estimation device 1 described in the present embodiment is a speech processing device that controls input speech data and an output signal corresponding to each of a plurality of vowel intervals determined by the number of moras determination unit 140. It is only an example. That is, the mora number of each of the plurality of vowel intervals determined by the mora number determination unit 140 according to the present embodiment controls not only the display data (output signal) indicating the speech speed of the input voice data but also the input data and It can also be used to control other corresponding output signals.

更に、本実施形態に係る話速推定装置１の機能的構成は、図１に示した構成に限らず、話速推定装置１で行う処理の内容に応じて適宜変更可能である。例えば、図２のステップＳ３、Ｓ４、及びＳ５の処理は、それぞれ、ステップＳ３’，Ｓ４’，及びＳ５’に置換可能である。
（ステップＳ３’）複数の母音区間それぞれの時間長を算出する処理。
（ステップＳ４’）複数の母音区間の時間長についての頻度分布を算出するとともに、複数の母音区間それぞれの時間長のうちの最小値を特定する処理。
（ステップＳ５’）特定した時間長の最小値と、頻度分布とに基づいて複数の母音区間それぞれの時間長と対応するモーラ数を算出する処理。 Furthermore, the functional configuration of the speech speed estimation apparatus 1 according to the present embodiment is not limited to the configuration illustrated in FIG. 1, and can be changed as appropriate according to the content of processing performed by the speech speed estimation apparatus 1. For example, the processes in steps S3, S4, and S5 in FIG. 2 can be replaced with steps S3 ′, S4 ′, and S5 ′, respectively.
(Step S3 ′) A process of calculating the time length of each of the plurality of vowel sections.
(Step S4 ′) A process of calculating the frequency distribution for the time lengths of the plurality of vowel sections and specifying the minimum value of the time lengths of the plurality of vowel sections.
(Step S5 ′) A process of calculating the number of mora corresponding to the time length of each of the plurality of vowel sections based on the specified minimum value of the time length and the frequency distribution.

話速処理装置１においてステップＳ３’〜Ｓ５’の処理を行う場合、話速推定装置１のモーラ数決定部１４０は、時間長算出部と、最小値特定部と、モーラ数算出部と、を含むものであってもよい。この場合、時間長算出部はステップＳ３’の処理を行い、最小値特定部はステップＳ４’の処理を行う。また、モーラ数算出部はステップＳ５’の処理を行う。 When the processing of steps S3 ′ to S5 ′ is performed in the speech speed processing device 1, the mora number determination unit 140 of the speech speed estimation device 1 includes a time length calculation unit, a minimum value specifying unit, and a mora number calculation unit. It may be included. In this case, the time length calculation unit performs the process of step S3 ', and the minimum value specifying unit performs the process of step S4'. Further, the mora number calculation unit performs the process of step S5 '.

［第２の実施形態］
図１１は、第２の実施形態に係る話速推定装置の機能的構成を示す図である。 [Second Embodiment]
FIG. 11 is a diagram illustrating a functional configuration of the speech speed estimation apparatus according to the second embodiment.

図１１に示すように、本実施形態に係る話速推定装置１は、音声取得部１１０と、発話区間検出部１２０と、母音区間検出部１３０と、モーラ数決定部１４０と、話速算出部１５０と、出力部１６０と、を備える。また、図１１では省略しているが、話速推定装置１は、音声データを含む各種情報を記憶させる記憶部を備える。 As shown in FIG. 11, the speech speed estimation apparatus 1 according to the present embodiment includes a speech acquisition unit 110, a speech segment detection unit 120, a vowel segment detection unit 130, a mora number determination unit 140, and a speech speed calculation unit. 150 and an output unit 160. Although omitted in FIG. 11, the speech speed estimation apparatus 1 includes a storage unit that stores various types of information including voice data.

音声取得部１１０は、収音装置２から音声データを取得する。
発話区間検出部１２０は、音声データにおける発話区間を検出する。発話区間は、話速推定の対象である人物が発した音声を含む区間である。 The sound acquisition unit 110 acquires sound data from the sound collection device 2.
The utterance section detection unit 120 detects the utterance section in the voice data. The utterance section is a section including a voice uttered by a person who is a target of speech speed estimation.

母音区間検出部１３０は、音声データの発話区間に含まれる母音区間を検出する。なお、本実施形態の話速推定装置１における母音区間検出部１３０は、発話区間検出部１２０で検出した発話区間からではなく、音声データ全体から母音区間を検出する。 The vowel section detector 130 detects a vowel section included in the speech section of the voice data. In addition, the vowel section detection unit 130 in the speech speed estimation apparatus 1 according to the present embodiment detects a vowel section from the entire speech data, not from the utterance section detected by the utterance section detection unit 120.

モーラ数決定部１４０は、音声データから検出した複数の母音区間のそれぞれにおける時間長に基づいて、各母音区間のモーラ数を算出する。頻度分布算出部１４１と、ピーク検出部１４２と、モーラ数算出部１４３と、を含む。 The mora number determination unit 140 calculates the number of mora in each vowel section based on the time length in each of the plurality of vowel sections detected from the speech data. A frequency distribution calculating unit 141, a peak detecting unit 142, and a mora number calculating unit 143 are included.

話速算出部１５０は、発話区間の時間長と、該発話区間に含まれる母音区間のモーラ数とに基づいて、音声データにおける話速を算出する。 The speech speed calculation unit 150 calculates the speech speed in the speech data based on the time length of the speech section and the number of mora in the vowel section included in the speech section.

一方、音声データにおける話速を推定し推定結果を表示装置３に出力する処理は、話速推定装置１の発話区間検出部１２０、母音区間検出部１３０、モーラ数決定部１４０、話速算出部１５０、及び出力部１６０が行う。本実施形態の話速推定装置１は、音声データにおける話速を推定し推定結果を表示装置３に出力する処理として、図２のステップＳ１〜Ｓ７の処理を行う。なお、本実施形態の話速推定装置１における母音区間検出部１３０は、発話区間検出部１２０による発話区間の検出結果を参照せずに、音声データ全体から母音区間を検出する。すなわち、本実施形態の話速推定装置１における母音区間検出部１３０は、ステップＳ２の母音区間を検出する処理として、例えば、図１２に示した処理を行う。 On the other hand, the process of estimating the speech speed in the speech data and outputting the estimation result to the display device 3 includes the speech segment detection unit 120, the vowel segment detection unit 130, the mora number determination unit 140, and the speech speed calculation unit of the speech speed estimation device 1. 150 and the output unit 160. The speech speed estimation apparatus 1 according to the present embodiment performs the processes of steps S1 to S7 in FIG. 2 as a process of estimating the speech speed in voice data and outputting the estimation result to the display device 3. In addition, the vowel section detection unit 130 in the speech speed estimation apparatus 1 of the present embodiment detects a vowel section from the entire speech data without referring to the detection result of the utterance section by the utterance section detection unit 120. That is, the vowel section detection unit 130 in the speech speed estimation apparatus 1 of the present embodiment performs, for example, the process shown in FIG. 12 as the process of detecting the vowel section in step S2.

図１２は、第２の実施形態に係る母音区間を検出する処理の内容を説明するフローチャートである。 FIG. 12 is a flowchart for explaining the contents of processing for detecting a vowel section according to the second embodiment.

本実施形態の話速推定装置１における母音区間検出部１３０は、音声データを複数の区間に分割し、区間毎に図１２に示したステップＳ２１１〜Ｓ２１９の処理を行う。母音区間検出部１３０は、まず、音声データのうちの処理対象の区間における時刻ｔ_ｍ（ｍ＝１，２，・・・，Ｍ）のそれぞれで、音声データの波形自己相関ＡＣ（ｔ_ｍ）を算出する（ステップＳ２１１）。発話区間における時刻ｔ_ｍの時間間隔（ｔ_ｍ−ｔ_ｍ−１）は、例えば、音声データを処理する際の処理単位（フレーム）の時間長とする。以下の説明では、音声データにおける時刻ｔ_ｍと対応付けられたフレームを、時刻ｔ_ｍのフレームという。 The vowel section detection unit 130 in the speech speed estimation apparatus 1 of the present embodiment divides the speech data into a plurality of sections, and performs the processing of steps S211 to S219 shown in FIG. 12 for each section. First, the vowel section detection unit 130 performs waveform autocorrelation AC (t _m ) of the voice data at each time t _m (m = 1, 2,..., M) in the processing target section of the voice data. Is calculated (step S211). The time interval (t _m −t _m−1 ) at time t _m in the utterance section is, for example, the time length of a processing unit (frame) when processing voice data. In the following description, a frame associated with time t _m in the audio data is referred to as a frame at time t _m .

ステップＳ２１１の処理において、母音区間検出部１３０は、下記式（７）により時刻ｔ（各時刻ｔ_ｍ）の波形自己相関ＡＣ（ｔ）を算出する。 In the process of step S211, the vowel section detection unit 130 calculates the waveform autocorrelation AC (t) at time t (each time t _m ) according to the following equation (7).

式（７）のＮは、波形自己相関の算出幅（サンプル数）であり、例えば、Ｎ＝５００とする。式（７）のＳＬは、波形自己相関の探索範囲の下限値（サンプル数）であり、例えば、ＳＬ＝２０とする。式（７）のＳＨは、波形自己相関の探索範囲の上限値（サンプル数）であり、例えば、ＳＨ＝１２０とする。 N in Expression (7) is a calculation width (number of samples) of the waveform autocorrelation, and for example, N = 500. SL in Expression (7) is a lower limit value (number of samples) of the search range of the waveform autocorrelation, for example, SL = 20. SH in Expression (7) is an upper limit value (number of samples) of the search range of the waveform autocorrelation. For example, SH = 120.

ステップＳ２１１の処理を終えると、母音区間検出部１３０は、次に、変数ｍ、変数ｉ、及び波形自己相関ＡＣ（ｔ_０）を、それぞれ、ｍ＝１、ｉ＝１、及びＡＣ（ｔ_０）＝０に設定する（ステップＳ２１２）。変数ｉは、母音区間を識別する値である。 When the processing of step S211 is completed, the vowel section detection unit 130 next sets the variable m, the variable i, and the waveform autocorrelation AC (t ₀ ) to m = 1, i = 1, and AC (t ₀ , respectively. ) = 0 (step S212). The variable i is a value that identifies a vowel section.

次に、母音区間検出部１３０は、時刻ｔ_ｍ−１の波形自己相関ＡＣ（ｔ_ｍ−１）が閾値ＴＨ_ＡＣよりも小さく、かつ時刻ｔ_ｍの自己波形相関ＡＣ（ｔ_ｍ）が閾値ＴＨ_ＡＣ以上であるか否かを判定する（ステップＳ２１３）。すなわち、ステップＳ２１３では、ＡＣ（ｔ_ｍ−１）＜ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）≧ＴＨ_ＡＣであるか否かを判定する。 Next, the vowel section detection unit 130 has a waveform autocorrelation AC (t _m−1 ) at time t _m−1 smaller than the threshold TH _AC and a self waveform correlation AC (t _m ) at time t _m has a threshold TH. It is determined whether or not it is _AC or more (step S213). That is, in step S213, it is determined whether AC (t _m−1 ) <TH _AC and AC (t _m ) ≧ TH _AC .

閾値ＴＨ_ＡＣは、音声データに含まれる音声が母音であるか非母音（子音等）であるかを判別する値である。式（７）により波形自己相関を算出した場合、音声データに含まれる母音区間の波形自己相関は、非母音区間（子音区間等）の波形自己相関と比べて大きな値となる。このため、本実施形態では、音声データのうちの波形自己相関が閾値ＴＨ_ＡＣ以上である区間を母音区間とする。閾値ＴＨ_ＡＣは、例えば、母音区間における波形自己相関の統計値と、非母音区間における波形自己相関の統計値とに基づいて設定する。 The threshold TH _AC is a value for determining whether the voice included in the voice data is a vowel or a non-vowel (consonant, etc.). When the waveform autocorrelation is calculated by the equation (7), the waveform autocorrelation of the vowel section included in the speech data is a larger value than the waveform autocorrelation of the non-vowel section (consonant section etc.). For this reason, in this embodiment, the section in which the waveform autocorrelation in the voice data is equal to or greater than the threshold TH _AC is set as the vowel section. The threshold TH _AC is set based on, for example, the statistical value of the waveform autocorrelation in the vowel section and the statistical value of the waveform autocorrelation in the non-vowel section.

ＡＣ（ｔ_ｍ−１）＜ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）≧ＴＨ_ＡＣである場合、波形自己相関は、時刻ｔ_ｍ−１からｔ_ｍの間において閾値ＴＨ_ＡＣよりも小さい値から閾値ＴＨ_ＡＣ以上の値に変化する。すなわち、ＡＣ（ｔ_ｍ−１）＜ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）≧ＴＨ_ＡＣである場合、時刻ｔ_ｍ−１のフレームは非母音区間に含まれ、時刻ｔ_ｍのフレームは母音区間に含まれる。よって、ＡＣ（ｔ_ｍ−１）＜ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）≧ＴＨ_ＡＣである場合（ステップＳ２１３；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の開始時刻Ｔｓ（ｉ）を時刻ｔ_ｍに設定する（ステップＳ２１４）。ステップＳ２１４の処理を終えると、母音区間検出部１３０は、次に、変数ｍが音声データのうちの波形自己相関を算出した区間における最後の値Ｍ以上であるか否かを判定する（ステップＳ２１７）。 When AC (t _m−1 ) <TH _AC and AC (t _m ) ≧ TH _AC , the waveform autocorrelation is between a value smaller than the threshold TH _{AC and a} threshold TH between time t _m−1 and t _m. The value changes to _AC or higher. That is, when AC (t _m−1 ) <TH _AC and AC (t _m ) ≧ TH _AC , the frame at time t _m−1 is included in the non-vowel section, and the frame at time t _m is included in the vowel section. included. Therefore, when AC (t _m-1 ) <TH _AC and AC (t _m ) ≧ TH _AC (step S213; YES), the vowel section detection unit 130 then starts the i-th vowel section start time. Ts (i) is set at time t _m (step S214). When the processing of step S214 is completed, the vowel section detection unit 130 next determines whether or not the variable m is equal to or greater than the last value M in the section of waveform data in which the waveform autocorrelation is calculated (step S217). ).

一方、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、又はＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣである場合（ステップＳ２１３；ＮＯ）、母音区間検出部１３０は、次に、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣであるか否かを判定する（ステップＳ２１５）。 On the other hand, when AC (t _m−1 ) ≧ TH _AC or AC (t _m ) <TH _AC (step S213; NO), the vowel section detection unit 130 then determines that AC (t _m−1 ) ≧ TH _AC, and _AC (t m) <determines whether the _{TH AC} (step S215).

ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣである場合（ステップＳ２１５；ＹＥＳ）、時刻ｔ_ｍ−１のフレームは母音区間に含まれ、時刻ｔ_ｍのフレームは非母音区間に含まれる。よって、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣである場合（ステップＳ２１５；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の終了時刻Ｔｅ（ｉ）を時刻ｔ_ｍ−１に設定し、変数ｉをｉ＋１に更新する（ステップＳ２１６）。ステップＳ２１６において、母音区間検出部１３０は、母音区間の終了時刻Ｔｅ（ｉ）を時刻ｔ_ｍ−１に設定した後、変数ｉをｉ＋１に更新する。ステップＳ２１６の処理を終えると、母音区間検出部１３０は、次に、ステップＳ２１７の判定を行う。 When AC (t _m−1 ) ≧ TH _AC and AC (t _m ) <TH _AC (step S215; YES), the frame at time t _m−1 is included in the vowel section, and the frame at time t _m is Included in non-vowel section. Therefore, when AC (t _m−1 ) ≧ TH _AC and AC (t _m ) <TH _AC (step S215; YES), the vowel section detection unit 130 next ends the end time of the i-th vowel section. Te (i) is set at time t _m−1 and the variable i is updated to i + 1 (step S216). In step S216, the vowel section detection unit 130 sets the end time Te (i) of the vowel section to the time t _m−1 and then updates the variable i to i + 1. When the process of step S216 is completed, the vowel section detection unit 130 next performs the determination of step S217.

これに対し、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣを満たしていない場合（ステップＳ２１５；ＮＯ）、時刻ｔ_ｍ−１及び時刻ｔ_ｍにおける波形自己相関は、両方とも閾値ＴＨ_ＡＣ以上であるか、或いは両方とも閾値ＴＨ_ＡＣよりも小さい。すなわち、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣを満たしていない場合、時刻ｔ_ｍ−１，ｔ_ｍのフレームは、両方とも非母音区間に含まれるか、或いは両方とも母音区間に含まれる。よって、ＡＣ（ｔ_ｍ−１）≧ＴＨ_ＡＣ、かつＡＣ（ｔ_ｍ）＜ＴＨ_ＡＣを満たしていない場合（ステップＳ２１５；ＮＯ）、母音区間検出部１３０は、ステップＳ２１６の処理をスキップし、次に、ステップＳ２１７の判定を行う。 On the other hand, when AC (t _m-1 ) ≧ TH _AC and AC (t _m ) <TH _AC is not satisfied (step S215; NO), the waveform autocorrelation at time t _m-1 and time t _m is either both the threshold value _{TH AC} above, or both smaller than the threshold value _{TH AC.} That is, if AC (t _m−1 ) ≧ TH _AC and AC (t _m ) <TH _AC are not satisfied, the frames at time t _m−1 and t _m are both included in the non-vowel interval, Alternatively, both are included in the vowel section. Therefore, when AC (t _m-1 ) ≧ TH _AC and AC (t _m ) <TH _AC are not satisfied (step S215; NO), the vowel section detection unit 130 skips the process of step S216, and next In step S217, the determination is made.

ステップＳ２１７では、母音区間検出部１３０は、上記のように、変数ｍが音声データのうちの現在処理対象となっている区間における最後の値Ｍ以上であるか否かを判定する。ｍ＜Ｍである場合（ステップＳ２１７；ＮＯ）、母音区間検出部１３０は、変数ｍをｍ＋１に更新し（ステップＳ２１８）、ステップＳ２１３以降の処理を行う。 In step S217, as described above, the vowel section detection unit 130 determines whether or not the variable m is equal to or greater than the last value M in the section that is the current processing target in the speech data. When m <M (step S217; NO), the vowel section detection unit 130 updates the variable m to m + 1 (step S218), and performs the processing after step S213.

一方、ｍ≧Ｍである場合（ステップＳ２１７；ＹＥＳ）、母音区間検出部１３０は、母音区間を示す情報［Ｔｓ（ｉ），Ｔｅ（ｉ）］を出力し（ステップＳ２１９）、音声データにおける１個の処理対象区間に含まれる母音区間を検出する処理を終了する。 On the other hand, if m ≧ M (step S217; YES), the vowel section detection unit 130 outputs information [Ts (i), Te (i)] indicating the vowel section (step S219), and 1 in the voice data The process of detecting a vowel section included in each processing target section is terminated.

本実施形態に係る母音区間検出部１３０は、音声データにおける処理対象区間毎に、ステップＳ２１１〜Ｓ２１９の処理を行う。 The vowel section detection unit 130 according to the present embodiment performs the processes of steps S211 to S219 for each processing target section in the voice data.

本実施形態に係る話速推定装置１は、発話区間検出部１２０で検出した発話区間と、ステップＳ２１１〜Ｓ２１９の処理により母音区間検出部１３０で検出した母音区間とに基づいて、図２のステップＳ３〜Ｓ６の処理を行い、話速を算出する。すなわち、本実施形態に係る話速推定装置１は、音声データに含まれる母音区間の時間長についての頻度分布に基づいて、１モーラ（単独母音）の母音区間の時間長を検出する。その後、話速推定装置１は、１モーラの母音区間の時間長と、母音区間の時間長とに基づいて、各母音区間のモーラ数を決定する。このため、本実施形態に係る話速推定装置１では、長母音化した母音区間についてのモーラ数の誤りを防ぎ、発話区間の正しい話速を算出することが可能となる。よって、話速推定装置１は、例えば、話速の推定対象である利用者に対して正しい話速の時間変化を提示し、適正な話速での発話をするよう該利用者を導くことが可能となる。 The speech speed estimation apparatus 1 according to the present embodiment is based on the utterance interval detected by the utterance interval detection unit 120 and the vowel interval detected by the vowel interval detection unit 130 by the processing of steps S211 to S219. The processing from S3 to S6 is performed to calculate the speech speed. That is, the speech speed estimation apparatus 1 according to the present embodiment detects the time length of a vowel section of 1 mora (single vowel) based on the frequency distribution of the time length of the vowel section included in the speech data. Thereafter, the speech speed estimation apparatus 1 determines the number of mora in each vowel section based on the time length of the vowel section of one mora and the time length of the vowel section. For this reason, in the speech speed estimation apparatus 1 according to the present embodiment, it is possible to prevent an error in the number of mora for a long vowel vowel section and to calculate a correct speech speed of the utterance section. Therefore, for example, the speech speed estimation apparatus 1 can present the time change of the correct speech speed to the user who is the target of speech speed estimation and guide the user to speak at an appropriate speech speed. It becomes possible.

また、本実施形態の話速推定装置１は、例えば、算出した話速を可視化する表示データを生成して表示装置３に出力する。この際、話速推定装置１は、例えば、図１０に示したグラフ４２１のような表示データを生成して表示装置３に表示させる。これにより、話速の推定対象である人物に対し、会話中に話速を提示することが可能となり、適正な話速に導くことが可能となる。 Moreover, the speech speed estimation apparatus 1 of the present embodiment generates display data for visualizing the calculated speech speed and outputs the display data to the display apparatus 3, for example. At this time, the speech speed estimation apparatus 1 generates display data such as a graph 421 shown in FIG. As a result, it is possible to present the speaking speed during the conversation to the person whose speech speed is to be estimated, and to lead to an appropriate speaking speed.

なお、図１２のフローチャートは、母音区間を検出する処理の一例に過ぎない。本実施形態に係る母音区間を検出する処理では、波形自己相関の時間変化に限らず、音声データにおける他の特性の時間変化に基づいて母音区間を検出してもよい。例えば、母音区間を検出する処理では、第１の実施形態で説明した信号対雑音比の時間変化に基づいて母音区間を検出してもよい。また、例えば、母音区間を検出する処理では、図１３のようにフォルマント周波数の時間変化に基づいて母音区間を検出してもよい。 Note that the flowchart of FIG. 12 is merely an example of processing for detecting a vowel section. In the process of detecting a vowel section according to the present embodiment, the vowel section may be detected based not only on the time change of the waveform autocorrelation but also on the time change of other characteristics in the speech data. For example, in the process of detecting the vowel section, the vowel section may be detected based on the time change of the signal-to-noise ratio described in the first embodiment. Further, for example, in the process of detecting the vowel section, the vowel section may be detected based on the time change of the formant frequency as shown in FIG.

図１３は、第２の実施形態に係る母音区間を検出する処理の変形例を説明するフローチャートである。 FIG. 13 is a flowchart for explaining a modification of the process for detecting a vowel section according to the second embodiment.

フォルマント周波数に基づいて母音区間を検出する場合、母音区間検出部１３０は、まず、音声データのうちの処理対象の区間における時刻ｔ_ｍ（ｍ＝１，２，・・・，Ｍ）のそれぞれでの、フォルマント周波数を算出する（ステップＳ２２１）。発話区間における時刻ｔ_ｍの時間間隔（ｔ_ｍ−ｔ_ｍ−１）は、例えば、音声データを処理する際の処理単位（フレーム）の時間長とする。以下の説明では、音声データにおける時刻ｔ_ｍと対応付けられたフレームを、時刻ｔ_ｍのフレームという。 When detecting a vowel section based on the formant frequency, the vowel section detection unit 130 firstly at each of the times t _m (m = 1, 2,..., M) in the processing target section of the speech data. The formant frequency is calculated (step S221). The time interval (t _m −t _m−1 ) at time t _m in the utterance section is, for example, the time length of a processing unit (frame) when processing voice data. In the following description, a frame associated with time t _m in the audio data is referred to as a frame at time t _m .

ステップＳ２１１の処理において、母音区間検出部１３０は、既知の算出方法に従い、例えば、音声データの時刻ｔ_ｍのフレームにおけるフォルマント周波数ＦＭ（ｔ_ｍ，ｋ）を算出する。例えば、母音区間検出部１３０は、第１フォルマント周波数ＦＭ（ｔ_ｍ，１）、第２フォルマント周波数ＦＭ（ｔ_ｍ，２）、及び第３フォルマント周波数ＦＭ（ｔ_ｍ，３）を算出する。 In the process of step S211, the vowel section detection unit 130 calculates, for example, the formant frequency FM (t _m , k) in the frame at time t _m of the audio data according to a known calculation method. For example, the vowel section detection unit 130 calculates a first formant frequency FM (t _m , 1), a second formant frequency FM (t _m , 2), and a third formant frequency FM (t _m , 3).

ステップＳ２２１の処理を終えると、母音区間検出部１３０は、次に、変数ｍ、変数ｉ、フォルマント周波数、及びフォルマント周波数の時間変化平均の各値を初期値に設定する（ステップＳ２２２）。ステップＳ２２２において、母音区間検出部１３０は、変数ｍ、及び変数ｉを、それぞれ、ｍ＝１，ｉ＝１に設定する。変数ｉは、母音区間を識別する値である。また、母音区間検出部１３０は、フォルマント周波数の初期値を、例えば、ＦＭ（ｔ_０，１）＝ＦＭ（ｔ_０，２）＝ＦＭ（ｔ_０，３）＝０に設定する。更に、母音区間検出部１３０は、フォルマント周波数の時間変化平均の初期値ΔＦＭ（ｔ_０）を、ΔＦＭ（ｔ_０）≧ＴＨ_ＦＭとなる値に設定する。閾値ＴＨ_ＦＭは、音声データに含まれる音声が母音であるか非母音（子音）であるかを判別する値である。 When the process of step S221 is completed, the vowel section detection unit 130 then sets each value of the variable m, the variable i, the formant frequency, and the time-varying average of the formant frequency as initial values (step S222). In step S222, the vowel section detection unit 130 sets the variables m and i to m = 1 and i = 1, respectively. The variable i is a value that identifies a vowel section. Further, the vowel section detection unit 130 sets the initial value of the formant frequency to, for example, FM (t ₀ , 1) = FM (t ₀ , 2) = FM (t ₀ , 3) = 0. Furthermore, the vowel section detection unit 130 sets the initial value ΔFM (t ₀ ) of the time change average of the formant frequency to a value that satisfies ΔFM (t ₀ ) ≧ TH _FM . The threshold value TH _FM is a value for determining whether the voice included in the voice data is a vowel or a non-vowel (consonant).

次に、母音区間検出部１３０は、時刻ｔ_ｍ−１からｔ_ｍにおけるフォルマント周波数の時間変化平均ΔＦＭ（ｔ_ｍ）を算出する（ステップＳ２２３）。母音区間検出部１３０は、例えば、下記式（８）によりフォルマント周波数の時間変化平均ΔＦＭ（ｔ_ｍ）を算出する。 Next, the vowel section detection unit 130 calculates the time change average ΔFM (t _m ) of the formant frequency from the time t _m−1 to t _m (step S223). The vowel section detection unit 130 calculates, for example, the time change average ΔFM (t _m ) of the formant frequency by the following equation (8).

次に、母音区間検出部１３０は、時刻ｔ_ｍ−１での時間変化平均ΔＦＭ（ｔ_ｍ−１）が閾値ＴＨ_ＦＭ以上であり、かつ時刻ｔ_ｍでの時間変化平均ΔＦＭ（ｔ_ｍ）が閾値ＴＨ_ＦＭよりも小さいか否かを判定する（ステップＳ２２４）。閾値ＴＨ_ＦＭは、音声データに含まれる音声が母音であるか非母音（子音）であるかを判別する値である。音声データにおける母音区間の特徴は、主として、第１フォルマント周波数及び第２フォルマント周波数の分布により決定する。言い換えると、音声データの１個の母音区間におけるフォルマント周波数の時間変化平均ΔＦＭは、略一定の値であり、かつ非母音区間における時間変化平均ΔＦＭと比べて小さな値となる。このため、音声データのうちのフォルマント周波数の時間変化平均ΔＦＭが閾値ＴＨ_ＦＭよりも小さいフレームが連続する区間を母音区間とみなすことが可能となる。閾値ＴＨ_ＦＭは、例えば、統計処理により得られる、母音区間におけるフォルマント周波数の時間変化平均の統計値と、非母音区間におけるフォルマント周波数の時間変化平均の統計値とに基づいて設定する。 Next, the vowel section detection unit 130 has a time change average ΔFM (t _m−1 ) at time t _m−1 that is equal to or greater than a threshold TH _FM and a time change average ΔFM (t _m ) at time t _m. It is determined whether it is smaller than the threshold value TH _FM (step S224). The threshold value TH _FM is a value for determining whether the voice included in the voice data is a vowel or a non-vowel (consonant). The characteristics of the vowel section in the speech data are mainly determined by the distribution of the first formant frequency and the second formant frequency. In other words, the time change average ΔFM of the formant frequency in one vowel section of the speech data is a substantially constant value, and is smaller than the time change average ΔFM in the non-vowel section. For this reason, it is possible to regard a section in which continuous frames having a time change average ΔFM of formant frequency smaller than the threshold value TH _FM in the voice data as vowel sections. The threshold value TH _FM is set based on, for example, the statistical value of the time change average of the formant frequency in the vowel interval and the statistical value of the time change average of the formant frequency in the non-vowel interval obtained by statistical processing.

ΔＦＭ（ｔ_ｍ−１）≧ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）＜ＴＨ_ＦＭである場合（ステップＳ２２４；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の開始時刻Ｔｓ（ｉ）を時刻ｔ_ｍ−１に設定する（ステップＳ２２５）。ステップＳ２２５の処理を終えると、母音区間検出部１３０は、次に、変数ｍが音声データのうちのフォルマント周波数を算出した区間における最後の値Ｍ以上であるか否かを判定する（ステップＳ２２８）。 If ΔFM (t _m−1 ) ≧ TH _FM and ΔFM (t _m ) <TH _FM (step S224; YES), the vowel section detection unit 130 then starts the i-th vowel section start time Ts ( i) is set to the time t _m−1 (step S225). When the process of step S225 is completed, the vowel section detection unit 130 next determines whether or not the variable m is equal to or greater than the last value M in the section in which the formant frequency of the speech data is calculated (step S228). .

一方、ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、又はΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭである場合（ステップＳ２２４；ＮＯ）、母音区間検出部１３０は、次に、ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭであるか否かを判定する（ステップＳ２２６）。 On the other hand, if ΔFM (t _m−1 ) <TH _FM or ΔFM (t _m ) ≧ TH _FM (step S224; NO), the vowel section detection unit 130 then sets ΔFM (t _m−1 ) < TH _FM, and determines whether the _{_{ΔFM (t m) ≧ TH FM}} ( step S226).

ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭである場合（ステップＳ２２６；ＹＥＳ）、母音区間検出部１３０は、次に、ｉ番目の母音区間の終了時刻Ｔｅ（ｉ）を時刻ｔ_ｍ−１に設定するとともに、変数ｉをｉ＋１に更新する（ステップＳ２２７）。ステップＳ２２７の処理を終えると、母音区間検出部１３０は、次に、ステップＳ２２８の判定を行う。 When ΔFM (t _m−1 ) <TH _FM and ΔFM (t _m ) ≧ TH _FM (step S226; YES), the vowel section detection unit 130 then ends the ith vowel section end time Te ( i) is set at time t _m−1 and the variable i is updated to i + 1 (step S227). When the process of step S227 is completed, the vowel section detection unit 130 next performs the determination of step S228.

これに対し、ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭを満たしていない場合、時刻ｔ_ｍ−１及び時刻ｔ_ｍにおけるフォルマント周波数の時間変化平均は、両方とも閾値ＴＨ_ＦＭ以上であるか、或いは両方とも閾値ＴＨ_ＦＭよりも小さい。すなわち、ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭを満たしていない場合、音声データにおける時刻ｔ_ｍ−１，ｔ_ｍのフレームは、両方とも非母音区間に含まれるか、或いは両方とも母音区間に含まれる。よって、ΔＦＭ（ｔ_ｍ−１）＜ＴＨ_ＦＭ、かつΔＦＭ（ｔ_ｍ）≧ＴＨ_ＦＭを満たしていない場合（ステップＳ２２６；ＮＯ）、母音区間検出部１３０は、ステップＳ２２７の処理をスキップして、次に、ステップＳ２２８の判定を行う。 On the other hand, when ΔFM (t _m−1 ) <TH _FM and ΔFM (t _m ) ≧ TH _FM are not satisfied, the time change averages of the formant frequencies at time t _m−1 and time t _m are both It is greater than or equal to the threshold TH _FM , or both are smaller than the threshold TH _FM . That is, when ΔFM (t _m−1 ) <TH _FM and ΔFM (t _m ) ≧ TH _FM are not satisfied, both frames at time t _m−1 and t _m in the speech data are included in the non-vowel section. Or both are included in the vowel section. Therefore, when ΔFM (t _m−1 ) <TH _FM and ΔFM (t _m ) ≧ TH _FM are not satisfied (step S226; NO), the vowel section detection unit 130 skips the process of step S227, Next, determination of step S228 is performed.

上記のように、ステップＳ２２８では、母音区間検出部１３０は、変数ｍが音声データのうちのフォルマント周波数を算出した区間における最後の値Ｍ以上であるか否かを判定する。ｍ＜Ｍである場合（ステップＳ２２８；ＮＯ）、母音区間検出部１３０は、変数ｍをｍ＋１に更新し（ステップＳ２２９）、ステップＳ２２３以降の処理を行う。 As described above, in step S228, the vowel section detection unit 130 determines whether or not the variable m is equal to or greater than the last value M in the section in which the formant frequency is calculated in the speech data. When m <M (step S228; NO), the vowel section detection unit 130 updates the variable m to m + 1 (step S229), and performs the processing after step S223.

一方、ｍ≧Ｍである場合（ステップＳ２２８；ＹＥＳ）、母音区間検出部１３０は、母音区間を示す情報［Ｔｓ（ｉ），Ｔｅ（ｉ）］を出力し（ステップＳ２３０）、音声データにおける１個の処理対象区間に対する母音区間を検出する処理を終了する。 On the other hand, if m ≧ M (step S228; YES), the vowel section detection unit 130 outputs information [Ts (i), Te (i)] indicating the vowel section (step S230), and 1 in the speech data The process of detecting a vowel section for each processing target section is terminated.

このように、音声データにおける母音区間は、第１フォルマント周波数及び第２フォルマント周波数を含むフォルマント周波数の時間変化平均に基づいて検出することも可能である。フォルマント周波数の時間変化平均に基づいて母音区間を検出する場合も、母音が長母音化することにより、例えば、実際には２モーラである母音区間が１個の母音区間として検出されることがある。しかしながら、音声データに含まれる母音区間の時間長についての頻度分布に基づいて各母音区間の母音数を決定することにより、音声データにおける発話区間のモーラ数（母音数）を正しく算出し、話速を精度良く算出することが可能となる。 As described above, the vowel section in the voice data can be detected based on the time-change average of the formant frequency including the first formant frequency and the second formant frequency. Even when a vowel section is detected based on the time-change average of the formant frequency, a vowel section that is actually 2 mora, for example, may be detected as one vowel section by converting the vowel into a long vowel. . However, by determining the number of vowels in each vowel section based on the frequency distribution of the time length of the vowel section included in the speech data, the number of mora (number of vowels) in the speech section in the speech data is correctly calculated, and the speech speed Can be calculated with high accuracy.

また、本実施形態に係る話速推定装置１のモーラ数決定部１４０は、上述したステップＳ３’の処理を行う時間長算出部と、ステップＳ４’の処理を行う最小値特定部と、ステップＳ５’の処理を行うモーラ数決定部とを含むものであってもよい。 Further, the mora number determination unit 140 of the speech speed estimation apparatus 1 according to the present embodiment includes a time length calculation unit that performs the process of step S3 ′ described above, a minimum value specification unit that performs a process of step S4 ′, and step S5. It may include a mora number determination unit that performs the process'.

［第３の実施形態］
図１４は、第３の実施形態に係る通話システムのシステム構成を示す図である。 [Third Embodiment]
FIG. 14 is a diagram illustrating a system configuration of a call system according to the third embodiment.

図１４に示すように、本実施形態に係る通話システム１０は、第１の話者９Ａと、第２の話者９Ｂとの通話に利用される。第１の話者９Ａは、携帯電話端末２５を通話装置（電話機）として用いる。携帯電話端末２５は、通話処理部２１と、収音装置２と、レシーバ１３と、話速調整部２６とを含む。 As shown in FIG. 14, the call system 10 according to this embodiment is used for a call between a first speaker 9A and a second speaker 9B. The first speaker 9A uses the mobile phone terminal 25 as a call device (telephone). The mobile phone terminal 25 includes a call processing unit 21, a sound collection device 2, a receiver 13, and a speech speed adjustment unit 26.

第１の話者９Ａが携帯電話端末２５を用いて通話を行う際、通話処理部２１は、基地局３０、及びネットワーク１５を介して、通話相手が利用する通話装置（電話機）に接続される。携帯電話端末２５（通話処理部２１）と基地局３０とは、所定の無線通信規格に従った無線通信により接続される。第１の話者９Ａの通話相手が第２の話者９Ｂである場合、通話処理部２１は、基地局３０、ネットワーク１５、及び交換機１４Ｂを介して、第２の話者９Ｂが利用する電話機１６と接続される。 When the first speaker 9 A makes a call using the mobile phone terminal 25, the call processing unit 21 is connected to a call device (telephone) used by the call partner via the base station 30 and the network 15. . The mobile phone terminal 25 (call processing unit 21) and the base station 30 are connected by wireless communication in accordance with a predetermined wireless communication standard. When the other party of the first speaker 9A is the second speaker 9B, the call processing unit 21 uses the telephone set used by the second speaker 9B via the base station 30, the network 15, and the exchange 14B. 16 is connected.

携帯電話端末２５は、収音装置２から取得した第１の話者９Ａの音声を含む音声データの話速を調整して電話機１６に向けて送信する処理と、電話機１６から受信した音声データをレシーバ１３に出力する処理とを行う。携帯電話端末２５において音声データの話速を調整する処理は、話速調整部２６が行う。また、携帯電話端末２５において、話速を調整した音声データを電話機１６に向けて送信する処理と、電話機１６から受信した音声データをレシーバ１３に出力する処理とは、通話処理部２１が行う。 The cellular phone terminal 25 adjusts the speech speed of the voice data including the voice of the first speaker 9 A acquired from the sound collection device 2 and transmits the voice data to the telephone set 16, and the voice data received from the telephone set 16. Processing to output to the receiver 13 is performed. The speech speed adjustment unit 26 performs processing for adjusting the speech speed of the voice data in the mobile phone terminal 25. In the cellular phone terminal 25, the call processing unit 21 performs processing for transmitting voice data with adjusted speech speed to the telephone set 16 and processing for outputting voice data received from the telephone set 16 to the receiver 13.

一方、第２の話者９Ｂが利用する電話機１６は、通話処理部１６０１と、収音装置１６２０と、レシーバ１６３０とを含む。電話機１６の通話処理部１６１０は、収音装置１６２０から取得した第２の話者９Ｂの音声を含む音声データを携帯電話端末２５に向けて送信する処理と、携帯電話端末２５から受信した音声データをレシーバ１６３０に出力する処理とを行う。 On the other hand, the telephone set 16 used by the second speaker 9B includes a call processing unit 1601, a sound collection device 1620, and a receiver 1630. The telephone call processing unit 1610 of the telephone 16 transmits the voice data including the voice of the second speaker 9B acquired from the sound pickup device 1620 to the mobile phone terminal 25, and the voice data received from the mobile phone terminal 25. Are output to the receiver 1630.

本実施形態に係る通話システムにおける携帯電話端末２５は、上記のように、話速調整部２６において、収音装置２から取得した第１の話者９Ａの音声を含む音声データの話速を調整する処理を行う。話速調整部２６は、音声データにおける第１の話者９Ａの話速を算出した後、該話速の算出結果に基づいて、音声データにおける話速が適正な話速になるよう音声データを調整する。本実施形態における話速調整部２６は、音声データにおける話速が閾値以上である場合に、音声データを伸長させて話速を減速させる。 As described above, the cellular phone terminal 25 in the call system according to the present embodiment adjusts the speech speed of the speech data including the speech of the first speaker 9 A acquired from the sound collection device 2 in the speech speed adjustment unit 26. Perform the process. The speech speed adjustment unit 26 calculates the speech speed of the first speaker 9A in the speech data, and then based on the speech speed calculation result, the speech data is adjusted so that the speech speed in the speech data becomes an appropriate speech speed. adjust. In the present embodiment, the speech speed adjustment unit 26 decompresses the speech data to reduce the speech speed when the speech speed in the speech data is equal to or higher than the threshold value.

図１５は、第３の実施形態に係る話速調整部の機能的構成を示す図である。
図１５に示すように、本実施形態の携帯電話端末２５における話速調整部２６は、音声取得部１１０と、発話区間検出部１２０と、母音区間検出部１３０と、モーラ数決定部１４０と、話速算出部１５０と、話速制御部１７０と、出力部１６２と、を備える。また、話速調整部２６は、音声データ１９１及び目標伸長率１９２を含む各種情報を記憶させる記憶部１９０を備える。目標伸長率１９２は、音声データにおける話速を減速させる際の音声データの伸長率の目標値である。 FIG. 15 is a diagram illustrating a functional configuration of the speech speed adjustment unit according to the third embodiment.
As shown in FIG. 15, the speech speed adjustment unit 26 in the mobile phone terminal 25 of the present embodiment includes a voice acquisition unit 110, a speech segment detection unit 120, a vowel segment detection unit 130, a mora number determination unit 140, A speech speed calculation unit 150, a speech speed control unit 170, and an output unit 162 are provided. The speech speed adjustment unit 26 includes a storage unit 190 that stores various types of information including the audio data 191 and the target expansion rate 192. The target expansion rate 192 is a target value for the expansion rate of the audio data when the speech speed in the audio data is reduced.

母音区間検出部１３０は、音声データの発話区間に含まれる母音区間を検出する。なお、本実施形態の話速調整部２６における母音区間検出部１３０は、発話区間検出部１２０で検出した発話区間からではなく、音声データ全体から母音区間を検出する。 The vowel section detector 130 detects a vowel section included in the speech section of the voice data. Note that the vowel segment detection unit 130 in the speech speed adjustment unit 26 of the present embodiment detects a vowel segment from the entire speech data, not from the utterance segment detected by the utterance segment detection unit 120.

モーラ数決定部１４０は、音声データから検出した複数の母音区間のそれぞれの時間長に基づいて、各母音区間のモーラ数を算出する。モーラ数決定部１４０は、第１の実施形態で説明した方法により、各母音区間のモーラ数を算出する。モーラ数決定部１４０は、頻度分布算出部１４１と、ピーク検出部１４２と、モーラ数算出部１４３と、を含む。 The mora number determination unit 140 calculates the mora number of each vowel section based on the time lengths of the plurality of vowel sections detected from the speech data. The mora number determination unit 140 calculates the mora number of each vowel section by the method described in the first embodiment. The mora number determination unit 140 includes a frequency distribution calculation unit 141, a peak detection unit 142, and a mora number calculation unit 143.

話速算出部１５０は、発話区間の時間長と、該発話区間に含まれる母音区間のモーラ数とに基づいて、音声データにおける話速（発話速度）を算出する。 The speech speed calculation unit 150 calculates the speech speed (speech speed) in the speech data based on the time length of the speech section and the number of mora of the vowel section included in the speech section.

話速制御部１７０は、話速算出部１５０で算出した話速に基づいて、音声データの話速を制御する。本実施形態に係る話速制御部１７０は、算出した話速が適正な話速よりも速い場合に、目標伸長率１９２を参照して、音声データの話速が適正な話速となるよう音声データを伸長させる。 The speech speed control unit 170 controls the speech speed of the voice data based on the speech speed calculated by the speech speed calculation unit 150. When the calculated speech speed is faster than the appropriate speech speed, the speech speed control unit 170 according to the present embodiment refers to the target expansion rate 192 so that the speech speed of the speech data becomes the appropriate speech speed. Decompress data.

出力部１６２は、話速制御部１７０により話速を制御した音声データを通話処理部２１に出力する。 The output unit 162 outputs the voice data whose speech speed is controlled by the speech speed control unit 170 to the call processing unit 21.

図１６は、話速制御部の機能的構成を示す図である。
図１６に示すように、本実施形態に係る話速制御部１７０は、基本周期検出部１７１と、波形処理部１７２と、伸長制御部１７３と、を含む。 FIG. 16 is a diagram illustrating a functional configuration of the speech speed control unit.
As shown in FIG. 16, the speech rate control unit 170 according to the present embodiment includes a basic period detection unit 171, a waveform processing unit 172, and an extension control unit 173.

基本周期検出部１７１は、音声データのうちの処理対象である母音区間の基本周期を検出する。 The fundamental period detector 171 detects the fundamental period of the vowel section that is the processing target in the speech data.

波形処理部１７２は、音声データに基本周期の波形を重ねて音声データを伸長させる。
伸長制御部１７３は、音声データを伸長させるか否か、言い換えると基本周期の波形を重ねるか否かの制御を行う。伸長制御部１７３は、まず、話速算出部１５０で算出した話速に基づいて、音声データを伸長させるか否かを判定する。また、音声データを伸長させる場合、伸長制御部１７３は、波形処理部１７２で基本周期の波形を重ねた音声データの実績伸長率と、目標伸長率とに基づいて、基本周期の波形を更に重ねるか否かの制御を行う。 The waveform processing unit 172 extends the audio data by superimposing the waveform of the basic period on the audio data.
The decompression control unit 173 controls whether or not the speech data is decompressed, in other words, whether or not the waveform of the basic period is overlapped. First, the expansion control unit 173 determines whether or not to expand the voice data based on the speech speed calculated by the speech speed calculation unit 150. Further, when expanding the audio data, the expansion control unit 173 further overlaps the waveform of the basic period based on the actual expansion rate of the audio data obtained by superimposing the waveform of the basic period by the waveform processing unit 172 and the target expansion rate. Control whether or not.

本実施形態の携帯電話端末２５における話速調整部２６は、収音装置２から音声データを取得する処理と、取得した音声データにおける話速を制御して通話処理部２１に出力する処理とを行う。音声データを取得する処理は、話速調整部２６の音声取得部１１０が行う。 The speech speed adjustment unit 26 in the mobile phone terminal 25 of the present embodiment performs a process of acquiring voice data from the sound collection device 2 and a process of controlling the speech speed in the acquired voice data and outputting it to the call processing unit 21. Do. The voice acquisition unit 110 of the speech speed adjustment unit 26 performs the process of acquiring the voice data.

一方、音声データにおける話速を制御して出力する処理は、話速調整部２６の発話区間検出部１２０、母音区間検出部１３０、モーラ数決定部１４０、話速算出部１５０、話速制御部１７０、及び出力部１６２が行う。話速調整部２６は、音声データにおける話速を制御して出力する処理として、図１７に示した処理を行う。 On the other hand, the process of controlling and outputting the speech speed in the speech data includes the speech interval detection unit 120, the vowel interval detection unit 130, the mora number determination unit 140, the speech speed calculation unit 150, and the speech speed control unit of the speech speed adjustment unit 26. 170 and the output unit 162. The speech speed adjustment unit 26 performs the process shown in FIG. 17 as a process of controlling and outputting the speech speed in the voice data.

図１７は、第３の実施形態に係る話速調整部が行う処理を説明するフローチャートである。 FIG. 17 is a flowchart illustrating processing performed by the speech speed adjustment unit according to the third embodiment.

音声データにおける話速を制御して出力する処理において、話速調整部２６は、まず、音声データに含まれる発話区間を検出する（ステップＳ１）。ステップＳ１の処理は、発話区間検出部１２０が行う。発話区間検出部１２０は、既知の検出方法に従い、音声データに含まれる発話区間（言い換えると話速推定の対象である人物が発した音声を含む区間）を検出する。例えば、発話区間検出部１２０は、ＶＡＤにより発話区間を検出する。 In the process of controlling and outputting the speech speed in the speech data, the speech speed adjusting unit 26 first detects a speech section included in the speech data (step S1). The processing of step S1 is performed by the utterance section detection unit 120. The utterance section detection unit 120 detects an utterance section (in other words, a section including speech uttered by a person who is a target of speech speed estimation) included in the sound data according to a known detection method. For example, the utterance section detection unit 120 detects the utterance section by VAD.

次に、話速調整部２６は、発話区間に含まれる母音区間を検出する（ステップＳ２）。ステップＳ２の処理は、母音区間検出部１３０が行う。母音区間検出部１３０は、既知の検出方法に従い、音声データにおける母音区間を検出する。例えば、母音区間検出部１３０は、音声データにおける信号対雑音比の時間変化に基づいて、信号対雑音比が所定の閾値以上で連続する１個の区間を１個の母音区間として検出する。 Next, the speech speed adjustment unit 26 detects a vowel section included in the utterance section (step S2). The process of step S2 is performed by the vowel section detection unit 130. The vowel section detection unit 130 detects a vowel section in the voice data according to a known detection method. For example, the vowel section detection unit 130 detects, as one vowel section, one section in which the signal-to-noise ratio is continuous at a predetermined threshold or more based on the time change of the signal-to-noise ratio in the speech data.

次に、話速調整部２６は、検出した母音区間のモーラ数（母音数）を決定する処理（ステップＳ３〜Ｓ５）を行う。母音区間のモーラ数を決定するステップＳ３〜Ｓ５の処理は、話速調整部２６のモーラ数決定部１４０が行う。 Next, the speech speed adjustment unit 26 performs processing (steps S3 to S5) for determining the number of mora (number of vowels) in the detected vowel section. The process of steps S3 to S5 for determining the number of mora in the vowel section is performed by the mora number determining unit 140 of the speech speed adjusting unit 26.

モーラ数決定部１４０は、まず、検出した複数の母音区間のそれぞれの時間長に基づいて、母音区間の時間長についての頻度分布を算出する（ステップＳ３）。ステップＳ３の処理は、モーラ数決定部１４０の頻度分布算出部１４１が行う。頻度分布算出部１４１は、検出した複数の母音区間のそれぞれにおける区間の開始時刻と終了時刻とに基づいて、母音区間の時間長を算出する。また、頻度分布算出部１４１は、各母音区間の時間長に基づいて時間長毎の母音区間の出現頻度を計数し、頻度分布を算出する。この際、頻度分布算出部１４１は、例えば、１個の発話区間における末尾の母音区間を除外して、頻度分布を算出する。 The mora number determination unit 140 first calculates a frequency distribution for the time length of the vowel section based on the detected time lengths of the plurality of vowel sections (step S3). The processing in step S3 is performed by the frequency distribution calculation unit 141 of the mora number determination unit 140. The frequency distribution calculation unit 141 calculates the time length of the vowel section based on the start time and end time of each of the detected plurality of vowel sections. Further, the frequency distribution calculation unit 141 calculates the frequency distribution by counting the appearance frequency of the vowel section for each time length based on the time length of each vowel section. At this time, the frequency distribution calculation unit 141 calculates the frequency distribution by excluding, for example, the last vowel section in one utterance section.

次に、モーラ数決定部１４０は、ステップＳ３で算出した頻度分布において出現頻度が極大値（ピーク）となる時間長を検出する（ステップＳ４）。ステップＳ４の処理は、モーラ数決定部１４０のピーク検出部１４２が行う。例えば、ピーク検出部１４２は、頻度分布における最短時間長の出現頻度から順に、判定対象である時間長の出現頻度と、その前後の時間長の出現頻度と比較し、出現頻度が極大値となる時間長を検出する。 Next, the mora number determination unit 140 detects a time length at which the appearance frequency becomes a maximum value (peak) in the frequency distribution calculated in step S3 (step S4). The processing in step S4 is performed by the peak detection unit 142 of the mora number determination unit 140. For example, the peak detection unit 142 compares the appearance frequency of the time length that is the determination target with the appearance frequency of the time length before and after the appearance frequency in order from the appearance frequency of the shortest time length in the frequency distribution, and the appearance frequency becomes the maximum value. Detect time length.

次に、モーラ数決定部１４０は、頻度分布から検出した時間長のうちの最小値と、母音区間の時間長とに基づいて、各母音区間のモーラ数を算出する（ステップＳ５）。ステップＳ５の処理は、モーラ数決定部１４０のモーラ数算出部１４３が行う。モーラ数算出部１４３は、頻度分布においてピークとなる複数の時間長のうちの最小値を基準時間長とし、まず、母音区間の時間長を基準時間長で除した値を算出する。その後、モーラ数算出部１４３は、母音区間の時間長を基準時間長で除した値に近い整数値を、該母音区間のモーラ数とする。 Next, the mora number determination unit 140 calculates the number of mora in each vowel section based on the minimum value of the time lengths detected from the frequency distribution and the time length of the vowel sections (step S5). The process of step S5 is performed by the mora number calculation unit 143 of the mora number determination unit 140. The mora number calculation unit 143 sets a minimum value among a plurality of time lengths peaking in the frequency distribution as a reference time length, and first calculates a value obtained by dividing the time length of the vowel section by the reference time length. Thereafter, the mora number calculation unit 143 sets an integer value close to a value obtained by dividing the time length of the vowel section by the reference time length as the mora number of the vowel section.

モーラ数決定部１４０によるステップＳ３〜Ｓ５の処理を終えると、話速調整部２６は、次に、ステップＳ５で算出した各母音区間のモーラ数と、発話区間の時間長とに基づいて、発話区間の話速を算出する（ステップＳ６）。ステップＳ６の処理は、話速算出部１５０が行う。話速算出部１５０は、話速として、発話区間の時間長を、該発話区間に含まれる母音区間についてのモーラ数の合計で除した値（モーラ／秒）を算出する。 When the processing of steps S3 to S5 by the mora number determination unit 140 is completed, the speech speed adjustment unit 26 next utters based on the mora number of each vowel section calculated in step S5 and the time length of the utterance section. The speech speed of the section is calculated (step S6). The speech speed calculation unit 150 performs the process of step S6. The speech speed calculation unit 150 calculates, as the speech speed, a value (mora / second) obtained by dividing the time length of the utterance section by the total number of mora for the vowel section included in the utterance section.

次に、話速調整部２６は、算出した話速に基づいて、音声データの話速を制御する処理（ステップＳ８）を行う。ステップＳ８の処理は、話速制御部１７０が行う。話速制御部１７０は、音声データのうちの算出した話速が適正な話速よりも速い区間に、該当区間における基本周期の波形を重ねて音声データを伸長させる処理を行う。 Next, the speech speed adjustment unit 26 performs processing (step S8) for controlling the speech speed of the voice data based on the calculated speech speed. The speech speed control unit 170 performs the process in step S8. The speech speed control unit 170 performs a process of extending the speech data by superimposing the waveform of the basic period in the corresponding section on a section in which the calculated speech speed is faster than the appropriate speech speed.

次に、話速調整部２６は、話速制御部１７０において話速を調整した音声データを通話処理部２１にする（ステップＳ９）。ステップＳ９の処理は、出力部１６２が行う。 Next, the speech speed adjustment unit 26 sets the speech data whose speech speed has been adjusted by the speech speed control unit 170 to the call processing unit 21 (step S9). The output unit 162 performs the process of step S9.

話速調整部２６は、通話中、上記のステップＳ１〜Ｓ６，Ｓ８，及びＳ９の処理を繰り返し行う。この際、話速調整部２６は、音声データにおける１個の処理対象区間に対するステップＳ１〜ＳＳ６，Ｓ８，及びＳ９の処理を終えてから次の処理対象区間に対する処理を行ってもよいし、各ステップの処理をパイプライン化して行ってもよい。 The speech speed adjustment unit 26 repeatedly performs the processes of steps S1 to S6, S8, and S9 described above during a call. At this time, the speech speed adjustment unit 26 may perform the processing for the next processing target section after finishing the processes of steps S1 to SS6, S8, and S9 for one processing target section in the voice data. The step processing may be performed in a pipeline.

本実施形態に係る携帯端末装置１における話速調整部２６は、上記のように、音声データにおける母音区間の時間長についての頻度分布に基づいて、１モーラ（単独母音）に相当する母音区間の時間長を算出し、各母音区間の母音数を決定する。このため、話速調整部２６は、長母音化による話速の誤りを防ぎ、話速を精度良く算出する（推定する）ことが可能となり、音声データにおける話速を適正な話速に調整することが可能となる。 As described above, the speech speed adjustment unit 26 in the mobile terminal device 1 according to the present embodiment is based on the frequency distribution of the time length of the vowel section in the speech data, and the vowel section corresponding to 1 mora (single vowel). The time length is calculated and the number of vowels in each vowel section is determined. For this reason, the speech speed adjusting unit 26 can prevent an error in the speech speed due to a long vowel and can calculate (estimate) the speech speed with high accuracy, and adjust the speech speed in the speech data to an appropriate speech speed. It becomes possible.

本実施形態の話速調整部２６における話速制御部１７０は、音声データの話速を制御する処理（ステップＳ８）として、例えば、図１８に示した処理を行う。 The speech speed control unit 170 in the speech speed adjustment unit 26 of the present embodiment performs, for example, the process shown in FIG. 18 as a process (step S8) for controlling the speech speed of the voice data.

図１８は、音声データの話速を制御する処理の内容を説明するフローチャートである。
話速制御部１７０は、音声データにおける話速を算出した区間（例えば、発話区間）毎に、図１８のステップＳ８０１〜Ｓ８０７の処理を行う。話速制御部１７０は、まず、話速算出部１５０で算出した話速に基づいて、処理対象区間の話速が閾値以上であるか否かを判定する（ステップＳ８０１）。ステップＳ８０１の判定は、例えば、話速制御部１７０の伸長制御部１７３が行う。話速が閾値よりも小さい（遅い）場合（ステップＳ８０１；ＮＯ）、話速制御部１７０は、処理対象区間に対する話速を制御する処理を終了する。話速の閾値は、例えば、８モーラ／秒とする。 FIG. 18 is a flowchart for explaining the contents of processing for controlling the speech speed of voice data.
The speech speed control unit 170 performs the processing of steps S801 to S807 in FIG. 18 for each section (for example, speech section) where the speech speed in the voice data is calculated. The speech speed control unit 170 first determines whether or not the speech speed of the processing target section is equal to or higher than a threshold based on the speech speed calculated by the speech speed calculation unit 150 (step S801). The determination in step S801 is performed by, for example, the extension control unit 173 of the speech speed control unit 170. When the speaking speed is smaller (slower) than the threshold (step S801; NO), the speaking speed control unit 170 ends the process of controlling the speaking speed for the processing target section. The threshold of the speech speed is, for example, 8 mora / second.

処理対象区間の話速が閾値以上である場合（ステップＳ８０１；ＹＥＳ）、話速制御部１７０は、次に、処理対象区間に対する実績伸長率を初期化する（ステップＳ８０２）。ステップＳ８０２の処理は、話速制御部１７０の伸長制御部１７３が行う。 When the speech speed of the processing target section is equal to or higher than the threshold (step S801; YES), the speech speed control unit 170 then initializes the actual expansion rate for the processing target section (step S802). The extension control unit 173 of the speech speed control unit 170 performs the process in step S802.

その後、話速制御部１７０は、音声データにおける処理対象区間を伸長させるステップＳ８０３〜Ｓ８０７の処理を行う。例えば、ステップＳ８０３〜Ｓ８０７の処理は、１フレーム期間を２０ミリ秒とするフレーム処理とする。 Thereafter, the speech speed control unit 170 performs the processes of steps S803 to S807 for expanding the processing target section in the voice data. For example, the processing in steps S803 to S807 is frame processing in which one frame period is 20 milliseconds.

音声データにおける処理対象区間を伸長させる処理において、話速制御部１７０は、まず、処理対象区間における母音区間の基本周期を検出する（ステップＳ８０３）。ステップＳ８０３の処理は、話速制御部１７０の基本周期検出部１７１が行う。基本周期検出部１７１は、既知の検出方法に従って、処理対象区間の音声波形についての自己相関を算出し、シフト量が０よりも大きい区間において自己相関が初めて極大となるシフト量と対応する周期を、基本周期として算出する。 In the process of extending the processing target section in the voice data, the speech speed control unit 170 first detects the basic period of the vowel section in the processing target section (step S803). The basic cycle detection unit 171 of the speech speed control unit 170 performs the process of step S803. The basic period detection unit 171 calculates the autocorrelation for the speech waveform in the processing target section according to a known detection method, and calculates a period corresponding to the shift amount at which the autocorrelation is maximized for the first time in the section where the shift amount is greater than zero. And calculated as a basic period.

次に、話速制御部１７０は、算出した基本周期に基づいてピッチの時間変化率を算出する（ステップＳ８０４）。ステップＳ８０４の処理は、例えば、話速制御部１７０の伸長制御部１７３が行う。伸長制御部１７３は、既知の算出方法に従って、ピッチの時間変化率を算出する。 Next, the speech speed control unit 170 calculates a time change rate of the pitch based on the calculated basic period (step S804). For example, the extension control unit 173 of the speech speed control unit 170 performs the process in step S804. The extension control unit 173 calculates the time change rate of the pitch according to a known calculation method.

次に、話速制御部１７０は、実績伸長率が目標伸長率よりも小さいか否かを判定する（ステップＳ８０５）。ステップＳ８０５の判定処理は、例えば、話速制御部１７０の伸長制御部１７３が行う。伸長制御部１７３は、記憶部１９０の目標伸長率１９２を参照して現在の実績伸長率が目標伸長率１９２よりも小さいか否かを判定する。 Next, the speech speed control unit 170 determines whether or not the actual expansion rate is smaller than the target expansion rate (step S805). The determination process in step S805 is performed by, for example, the expansion control unit 173 of the speech speed control unit 170. The expansion control unit 173 determines whether or not the current actual expansion rate is smaller than the target expansion rate 192 with reference to the target expansion rate 192 of the storage unit 190.

実績伸長率が目標伸長率よりも小さい場合（ステップＳ８０５；ＹＥＳ）、話速制御部１７０は、音声データの処理対象区間に基本周期の音声波形を重ね合わせる（ステップＳ８０６）。ステップＳ８０６の処理は、話速制御部１７０の波形処理部１７２が行う。波形処理部１７２は、既知の方法に従って、音声データの処理対象区間に基本周期の音声波形を重ね合わせる。 When the actual expansion rate is smaller than the target expansion rate (step S805; YES), the speech speed control unit 170 superimposes the speech waveform of the basic period on the processing target section of the audio data (step S806). The processing in step S806 is performed by the waveform processing unit 172 of the speech speed control unit 170. The waveform processing unit 172 superimposes the speech waveform of the basic period on the processing target section of the speech data according to a known method.

ステップＳ８０６の処理の後、話速制御部１７０は、実績伸長率を更新し（ステップＳ８０７）、ステップＳ８０３以降の処理を行う。ステップＳ８０７の処理は、例えば、伸長制御部１７３が行う。伸長制御部１７３は、例えば、下記式（９）により、実績伸長率rate_result（ｎ）を算出する。 After the process of step S806, the speech speed control unit 170 updates the actual expansion rate (step S807), and performs the processes after step S803. The process of step S807 is performed by, for example, the extension control unit 173. For example, the expansion control unit 173 calculates the actual expansion rate rate_result (n) by the following equation (9).

式（９）のｓ及びｎは、それぞれ、処理対象区間の開始フレーム及び現フレームである。式（９）のＭは１フレームのサンプル数であり、例えば、Ｍ＝１６０とする。式（９）のａｄｄ（ｉ）は、ｉ番目のフレーム処理で追加したサンプル数である。 In Equation (9), s and n are the start frame and the current frame of the processing target section, respectively. M in Equation (9) is the number of samples in one frame, and for example, M = 160. In the equation (9), add (i) is the number of samples added in the i-th frame processing.

一方、ステップＳ８０５の判定処理において、実績伸長率が目標伸長率以上であった場合（ステップＳ８０５；ＮＯ）、話速制御部１７０は、音声データにおける処理対象区間を出力して、処理対象区間に対する話速を制御する処理を終了する。話速制御部１７０は、１個の処理対象区間に対する処理を終えると、次の処理対象区間に対するステップＳ８０１以降の処理を行う。 On the other hand, in the determination process of step S805, when the actual expansion rate is equal to or higher than the target expansion rate (step S805; NO), the speech speed control unit 170 outputs the processing target section in the voice data and performs the processing for the processing target section. The process for controlling the speech speed is terminated. When the speech speed control unit 170 finishes the process for one process target section, the speech speed control unit 170 performs the processes after step S801 for the next process target section.

なお、図１８のフローチャートでは実績伸長率が目標伸長率以上になるまで処理を繰り返しているが、話速を制御する処理は、これに限らず、例えば、ステップＳ８０３以降のループ処理を行う回数に上限値を設けてもよい。すなわち、話速を制御する処理は、ステップＳ８０３以降のループ処理を所定回数行った場合には、実績伸長率が目標伸長率よりも小さくても処理を終了するようにしてもよい。これにより、例えば、処理が長くなり音声データの遅延等による通話品質の劣化を防止することが可能となる。 In the flowchart of FIG. 18, the process is repeated until the actual expansion rate becomes equal to or higher than the target expansion rate. However, the process for controlling the speech speed is not limited to this. An upper limit value may be provided. That is, the processing for controlling the speech speed may be terminated even when the actual expansion rate is smaller than the target expansion rate when the loop processing after step S803 is performed a predetermined number of times. Thereby, for example, the processing becomes longer, and it becomes possible to prevent the deterioration of the call quality due to the delay of the voice data.

図１９は、基本周期の検出方法を説明する図である。
ステップＳ８０３の基本周期を検出する処理では、上記のように、処理対象区間の音声波形についての自己相関に基づいて基本周期を検出する。音声波形についての自己相関は、例えば、図１９に示した曲線５１１のように、シフト量が０から大きくなるにつれて徐々に減少してあるシフト量で極小値となった後、シフト量Ｓｍで極大となる。本実施形態に係る話速制御部１７０の基本周期検出部１７１は、シフト量が０よりも大きい区間において自己相関が初めて極大Ｐとなるシフト量Ｓｍと対応する周期を、基本周期として検出する（算出する）。 FIG. 19 is a diagram for explaining a basic period detection method.
In the process of detecting the basic period in step S803, the basic period is detected based on the autocorrelation of the speech waveform in the processing target section as described above. The autocorrelation for the speech waveform, for example, as shown by a curve 511 shown in FIG. 19, becomes a minimum value at a shift amount that gradually decreases as the shift amount increases from 0, and then becomes a maximum at the shift amount Sm. It becomes. The basic cycle detection unit 171 of the speech speed control unit 170 according to the present embodiment detects, as a basic cycle, a cycle corresponding to the shift amount Sm in which the autocorrelation becomes the maximum P for the first time in a section where the shift amount is greater than 0 ( calculate).

図２０は、音声波形を重ね合わせる方法を説明する図である。図２１は、音声波形を重ねる際の重み付けの方法を説明する図である。 FIG. 20 is a diagram for explaining a method of superimposing speech waveforms. FIG. 21 is a diagram for explaining a weighting method used when superimposing speech waveforms.

ステップＳ８０６の音声波形を重ね合わせる処理では、上記のように、音声データの処理対象区間における基本周期Ｋの音声波形を重ね合わせる。例えば、図２０に示すように処理対象区間の波形ｘ（ｔ）が基本周期Ｋの音声波形５２１，５２２，５２３を含む波形であるとする。この波形ｘ（ｔ）に基本周期Ｋの音声波形を重ね合わせる場合、話速制御部１７０は、音声波形５２２を抽出し、波形ｘ（ｔ）における音声波形５２２と、音声波形５２３との間に音声波形５２２を挿入した波形ｙ（ｔ）を生成する。これにより、処理対象区間の時間長が基本周期Ｋだけ長くなり、処理対象区間を含む発話区間の時間長が基本周期Ｋだけ長くなる。したがって、発話区間に含まれる母音数（モーラ数）を発話区間の時間長で除した値（話速）が小さくなる。 In the process of superimposing the speech waveforms in step S806, the speech waveform of the basic period K in the process target section of the speech data is superimposed as described above. For example, it is assumed that the waveform x (t) in the processing target section is a waveform including speech waveforms 521, 522, and 523 having a basic period K as shown in FIG. When superimposing the speech waveform of the basic period K on this waveform x (t), the speech rate control unit 170 extracts the speech waveform 522 and between the speech waveform 522 and the speech waveform 523 in the waveform x (t). A waveform y (t) into which the speech waveform 522 is inserted is generated. As a result, the time length of the processing target section is increased by the basic period K, and the time length of the utterance section including the processing target section is increased by the basic period K. Therefore, a value (speech speed) obtained by dividing the number of vowels (number of mora) included in the utterance section by the time length of the utterance section becomes small.

また、音声波形を重ねる際には、サンプルの不連続による音質劣化を防ぐために、重み付け加算を行うことが好ましい。 In addition, when superimposing speech waveforms, it is preferable to perform weighted addition in order to prevent deterioration in sound quality due to discontinuity of samples.

例えば、図２１の（ａ）の波形ｘ（ｔ）における基本周期Ｋの音声波形５２２を重ね合わせる際には、音声波形５２２の後方に音声波形５２３の一部を含む第１の範囲５３１と、音声波形５２２の前方に音声波形５２１の一部を含む第２の範囲５３２とを設定する。そして、第２の範囲５３２を基本周期Ｋだけ正の時間方向にシフトさせて第１の範囲５３１と第２の範囲５３２とを結合することで、音声データの波形を伸長させる。 For example, when superimposing the speech waveform 522 of the basic period K in the waveform x (t) of FIG. 21A, a first range 531 including a part of the speech waveform 523 behind the speech waveform 522, A second range 532 including a part of the voice waveform 521 is set in front of the voice waveform 522. Then, the waveform of the audio data is expanded by shifting the second range 532 in the positive time direction by the basic period K and combining the first range 531 and the second range 532.

第１の範囲５３１と、基本周期Ｋだけシフトさせた第２の範囲５３２とを結合する際には、第１の範囲５３１における末尾の部分と、第２の範囲５３２における先頭の部分とが重複する。第１の範囲５３１と第２の範囲５３２とが重複する部分は、第１の範囲５３１における波形（振幅）と、第２の範囲５３２における波形とのそれぞれに重み付けをして加算する。 When combining the first range 531 and the second range 532 shifted by the basic period K, the end portion of the first range 531 and the start portion of the second range 532 overlap. To do. A portion where the first range 531 and the second range 532 overlap is weighted and added to the waveform (amplitude) in the first range 531 and the waveform in the second range 532.

例えば、図２１の（ｂ）に示すように、第１の範囲５３１に対する重み係数ｗ１（ｔ）は、第２の範囲５３２と重ならない時刻ｔからｔ１の区間をｗ１（ｔ）＝１とし、第２の範囲５３２と重なる時刻ｔ１からｔ２の区間をｗ１（ｔ）＝ｆ（ｔ）とする。ここで、関数ｆ（ｔ）は、時刻ｔ１からｔ２の区間においてｆ（ｔ１）＝１からｆ（ｔ２）＝０に単調減少する関数とする。 For example, as shown in FIG. 21B, the weighting factor w1 (t) for the first range 531 is set to w1 (t) = 1 in a section from time t to t1 that does not overlap with the second range 532. A section from time t1 to t2 that overlaps the second range 532 is set to w1 (t) = f (t). Here, the function f (t) is a function that monotonously decreases from f (t1) = 1 to f (t2) = 0 in the interval from time t1 to t2.

また、第２の範囲５３２に対する重み係数ｗ２（ｔ）は、第１の範囲５３１と重なる時刻ｔ１〜ｔ２の区間をｗ１（ｔ）＝ｇ（ｔ）とし、第１の範囲５３１と重ならない時刻ｔ２以降の区間をｗ２（ｔ）＝１とする。ここで、関数ｇ（ｔ）は、時刻ｔ１〜ｔ２の区間においてｇ（ｔ１）＝０からｇ（ｔ２）＝１に単調増加し、かつｆ（ｔ）＋ｇ（ｔ）＝１を満たす関数とする。 In addition, the weighting factor w2 (t) for the second range 532 is set to w1 (t) = g (t) in the interval between times t1 and t2 that overlaps the first range 531 and does not overlap the first range 531. The section after t2 is set to w2 (t) = 1. Here, the function g (t) is a function that monotonically increases from g (t1) = 0 to g (t2) = 1 and satisfies f (t) + g (t) = 1 in the interval from time t1 to t2. To do.

図２１の（ａ）に示した音声データの波形ｘ（ｔ）に対し、重み係数ｗ１（ｔ），ｗ２（ｔ）を用いて基本周期Ｋの音声波形５２２を重ね合わせた波形ｙ（ｔ）は、下記式（１０）により算出する。 Waveform y (t) obtained by superimposing speech waveform 522 of basic period K using weighting factors w1 (t) and w2 (t) to waveform x (t) of speech data shown in FIG. Is calculated by the following equation (10).

ｙ（ｔ）＝ｗ１（ｔ）・ｘ（ｔ）＋ｗ２（ｔ）・ｘ（ｔ＋Ｋ）・・・（１０） y (t) = w1 (t) .x (t) + w2 (t) .x (t + K) (10)

このように、基本周期Ｋの音声波形を重ね合わせる際の境界となる部分において波形を重み付け加算することで、母音区間を伸長した音声データの該境界部分において波形（サンプル）が不連続になり音質が劣化することを防ぐことが可能となる。 As described above, the waveform (sample) becomes discontinuous at the boundary portion of the voice data obtained by extending the vowel interval by weighting and adding the waveforms at the portion that becomes the boundary when superimposing the speech waveforms of the basic period K. Can be prevented from deteriorating.

以上のように、本実施形態の携帯電話端末２５は、通話中に自装置の収音装置２から入力される入力音声データの話速を推定し、該入力音声データの話速を適正な話速に制御して通話相手の電話機に送信することが可能である。また、本実施形態の携帯電話端末２５における話速調整部２６では、第１の実施形態で説明した話速推定装置１と同様、話速を推定する際に、入力音声データに含まれる母音区間の時間長についての頻度分布に基づいて、各母音区間の母音数を決定する。このため、本実施形態によれば、音声データにおける話速を精度良く推定することが可能であり、容易に音声データの話速を適正な話速に制御する（減速させる）ことが可能となる。また、基本周期Ｋの音声波形を重ね合わせる際に、境界となる部分において波形を重み付け加算することで、母音区間を伸長した音声データにおける音質の劣化を防ぐことが可能となる。 As described above, the mobile phone terminal 25 of the present embodiment estimates the speech speed of the input voice data input from the sound collection device 2 of the own apparatus during a call, and sets the speech speed of the input voice data to an appropriate speech. It is possible to transmit to the other party's telephone with speed control. Further, in the speech speed adjustment unit 26 in the mobile phone terminal 25 of the present embodiment, as in the speech speed estimation device 1 described in the first embodiment, when estimating the speech speed, a vowel section included in the input speech data is used. The number of vowels in each vowel section is determined based on the frequency distribution for the time length of. For this reason, according to the present embodiment, it is possible to accurately estimate the speech speed in speech data, and it is possible to easily control (decelerate) the speech speed of speech data to an appropriate speech speed. . Further, when superimposing speech waveforms of the basic period K, it is possible to prevent deterioration of sound quality in speech data obtained by extending a vowel section by weighting and adding waveforms at the boundary portions.

なお、基本周期Ｋの音声波形を重ね合わせる処理は、図２０及び図２１を参照して説明した上記の方法に限らず、他の方法に従って行ってもよい。 Note that the process of superimposing the speech waveforms of the basic period K is not limited to the method described with reference to FIGS. 20 and 21 and may be performed according to another method.

また、話速調整部２６において発話区間及び母音区間を検出する際の検出方法は、既知の検出方法のいずれかであればよい。例えば、発話区間及び母音区間を検出する際には、音声データにおける自己相関係数ＡＣ（ｔ_ｍ）、或いはフォルマント周波数の時間変化平均ΔＦＭ（ｔ_ｍ）に基づいて検出してもよい。 Moreover, the detection method used when the speech speed adjustment unit 26 detects the speech segment and the vowel segment may be any known detection method. For example, when detecting an utterance section and a vowel section, detection may be performed based on an autocorrelation coefficient AC (t _m ) in speech data or a time-change average ΔFM (t _m ) of a formant frequency.

更に、本実施形態に係る携帯電話端末２５の話速調整部２６で行う処理は、図１７のフローチャートに限らず、適宜変更可能である。例えば、話速調整部２６では、ステップＳ３〜Ｓ５の処理の代わりに、上述したステップＳ３’〜Ｓ５’の処理を行ってもよい。話速調整部２６においてステップＳ３’〜Ｓ５’の処理を行う場合、モーラ数決定部１４０は、ステップＳ３’の処理を行う時間長算出部と、ステップＳ４’の処理を行う最小値特定部と、ステップＳ５’の処理を行うモーラ数決定部とを含む。 Furthermore, the processing performed by the speech speed adjustment unit 26 of the mobile phone terminal 25 according to the present embodiment is not limited to the flowchart of FIG. 17 and can be changed as appropriate. For example, the speech speed adjustment unit 26 may perform the above-described steps S3 'to S5' instead of the steps S3 to S5. When the speech speed adjustment unit 26 performs the processes of steps S3 ′ to S5 ′, the mora number determination unit 140 includes a time length calculation unit that performs the process of step S3 ′, and a minimum value identification unit that performs the process of step S4 ′. And a mora number determination unit that performs the process of step S5 ′.

第１の実施形態及び第２の実施形態で挙げた話速推定装置１、並びに第３の実施形態で挙げた携帯電話端末２５は、それぞれ、コンピュータと、該コンピュータに実行させるプログラムとにより実現可能である。以下、図２２を参照して、コンピュータとプログラムとにより実現される話速推定装置１について説明する。 The speech speed estimation device 1 mentioned in the first embodiment and the second embodiment and the mobile phone terminal 25 mentioned in the third embodiment can be realized by a computer and a program executed by the computer. It is. Hereinafter, the speech speed estimation apparatus 1 realized by a computer and a program will be described with reference to FIG.

図２２は、コンピュータのハードウェア構成を示す図である。
図２２に示すように、コンピュータ８は、プロセッサ８０１と、主記憶装置８０２と、補助記憶装置８０３と、入力装置８０４と、出力装置８０５と、入出力インタフェース８０６と、通信制御装置８０７と、媒体駆動装置８０８と、を備える。コンピュータ８におけるこれらの要素８０１〜８０８は、バス８１０により相互に接続されており、要素間でのデータの受け渡しが可能になっている。 FIG. 22 is a diagram illustrating a hardware configuration of a computer.
As shown in FIG. 22, the computer 8 includes a processor 801, a main storage device 802, an auxiliary storage device 803, an input device 804, an output device 805, an input / output interface 806, a communication control device 807, and a medium. A driving device 808. These elements 801 to 808 in the computer 8 are connected to each other via a bus 810 so that data can be exchanged between the elements.

プロセッサ８０１は、Central Processing Unit（ＣＰＵ）やMicro Processing Unit（ＭＰＵ）等である。プロセッサ８０１は、オペレーティングシステムを含む各種のプログラムを実行することにより、コンピュータ８の全体の動作を制御する。プロセッサ８０１は、例えば、図２のステップＳ１〜Ｓ６の処理を含む音声処理プログラムを実行する。 The processor 801 is a central processing unit (CPU), a micro processing unit (MPU), or the like. The processor 801 controls the overall operation of the computer 8 by executing various programs including an operating system. For example, the processor 801 executes a sound processing program including the processes of steps S1 to S6 in FIG.

主記憶装置８０２は、図示しないRead Only Memory（ＲＯＭ）及びRandom Access Memory（ＲＡＭ）を含む。主記憶装置８０２のＲＯＭには、例えば、コンピュータ８の起動時にプロセッサ８０１が読み出す所定の基本制御プログラム等が予め記録されている。一方、主記憶装置８０２のＲＡＭは、プロセッサ８０１が、各種のプログラムを実行する際に必要に応じて作業用記憶領域として使用する。主記憶装置８０２のＲＡＭは、例えば、音声データの一部、発話区間及び母音区間を示す情報、母音区間の時間長の頻度分布、算出した話速等の保持（記憶）に利用可能である。 The main storage device 802 includes a read only memory (ROM) and a random access memory (RAM) not shown. In the ROM of the main storage device 802, for example, a predetermined basic control program read by the processor 801 when the computer 8 is started is recorded in advance. On the other hand, the RAM of the main storage device 802 is used as a working storage area as needed when the processor 801 executes various programs. The RAM of the main storage device 802 can be used, for example, for holding (storing) a part of voice data, information indicating a speech segment and a vowel segment, a frequency distribution of time length of the vowel segment, a calculated speech speed, and the like.

補助記憶装置８０３は、主記憶装置８０２のＲＡＭと比べて容量の大きい記憶装置であり、例えば、Hard Disk Drive（ＨＤＤ）や、フラッシュメモリのような不揮発性メモリ（Solid State Drive（ＳＳＤ）を含む）等である。補助記憶装置８０３は、プロセッサ８０１によって実行される各種のプログラムや各種のデータ等の記憶に利用可能である。補助記憶装置８０３は、例えば、図２のステップＳ１〜Ｓ６の処理を含む音声処理プログラムの記憶に利用可能である。また、補助記憶装置８０３は、例えば、音声データ、該音声データに含まれる発話区間及び母音区間を示す情報、母音区間の時間長の頻度分布、算出した話速等の保持（記憶）に利用可能である。 The auxiliary storage device 803 is a storage device having a larger capacity than the RAM of the main storage device 802, and includes, for example, a hard disk drive (HDD) and a non-volatile memory (Solid State Drive (SSD)) such as a flash memory. ) Etc. The auxiliary storage device 803 can be used for storing various programs executed by the processor 801 and various data. The auxiliary storage device 803 can be used, for example, for storing a voice processing program including the processes of steps S1 to S6 in FIG. In addition, the auxiliary storage device 803 can be used for holding (storing) voice data, information indicating speech sections and vowel sections included in the voice data, frequency distribution of time length of vowel sections, calculated speech speed, and the like. It is.

入力装置８０４は、例えば、キーボード装置やタッチパネル装置等である。コンピュータ８のオペレータ（利用者）が入力装置８０４に対して所定の操作を行うと、入力装置８０４は、その操作内容に対応付けられている入力情報をプロセッサ８０１に送信する。入力装置８０４は、例えば、話速を推定する処理を開始させる命令の入力や、各種設定値の入力等に利用可能である。 The input device 804 is, for example, a keyboard device or a touch panel device. When an operator (user) of the computer 8 performs a predetermined operation on the input device 804, the input device 804 transmits input information associated with the operation content to the processor 801. The input device 804 can be used, for example, for inputting a command for starting a process for estimating speech speed, inputting various setting values, and the like.

出力装置８０５は、例えば、液晶表示装置等の表示装置やプリンタ等の印刷装置である。出力装置８０５は、図２のステップＳ６で算出した発話区間の話速の出力に利用可能である。 The output device 805 is, for example, a display device such as a liquid crystal display device or a printing device such as a printer. The output device 805 can be used to output the speech speed of the utterance section calculated in step S6 of FIG.

入出力インタフェース８０６は、コンピュータ８と、他の電子機器とを接続する。入出力インタフェース８０６は、例えば、フォーンジャックや、Universal Serial Bus（ＵＳＢ）規格のコネクタ等を備える。入出力インタフェース８０６は、例えば、コンピュータ８と、収音装置２との接続に利用可能である。 The input / output interface 806 connects the computer 8 and other electronic devices. The input / output interface 806 includes, for example, a phone jack, a universal serial bus (USB) standard connector, and the like. The input / output interface 806 can be used, for example, for connection between the computer 8 and the sound collection device 2.

通信制御装置８０７は、コンピュータ８をインターネット等のネットワークに接続し、ネットワークを介したコンピュータ８と他の通信機器との各種通信を制御する装置である。通信制御装置８０７は、例えば、コンピュータ８と、電話機１６等との間での音声データの送受信に利用可能である。 The communication control device 807 is a device that connects the computer 8 to a network such as the Internet and controls various communications between the computer 8 and other communication devices via the network. The communication control device 807 can be used for transmission / reception of audio data between the computer 8 and the telephone set 16, for example.

媒体駆動装置８０８は、可搬型記憶媒体８９０に記録されているプログラムやデータの読み出し、補助記憶装置８０３に記憶されたデータ等の可搬型記憶媒体８９０への書き込みを行う。媒体駆動装置８０８には、例えば、１種類又は複数種類の規格に対応したメモリカード用リーダ／ライタが利用可能である。媒体駆動装置８０８としてメモリカード用リーダ／ライタを用いる場合、可搬型記憶媒体８９０としては、メモリカード用リーダ／ライタが対応している規格、例えば、Secure Digital（ＳＤ）規格のメモリカード（フラッシュメモリ）等を利用可能である。また、可搬型記録媒体８９０としては、例えば、ＵＳＢ規格のコネクタを備えたフラッシュメモリが利用可能である。更に、コンピュータ８が媒体駆動装置８０８として利用可能な光ディスクドライブを搭載している場合、当該光ディスクドライブで認識可能な各種の光ディスクを可搬型記録媒体８９０として利用可能である。可搬型記録媒体８９０として利用可能な光ディスクには、例えば、Compact Disc（ＣＤ）、Digital Versatile Disc（ＤＶＤ）、Blu-ray Disc（Blu-rayは登録商標）等がある。可搬型記録媒体８９０は、例えば、図２のステップＳ１〜Ｓ６の処理を含む音声処理プログラムの記憶に利用可能である。また、可搬型記録媒体８９０は、例えば、音声データ、該音声データに含まれる発話区間及び母音区間を示す情報、母音区間の時間長の頻度分布、算出した話速等の保持（記憶）に利用可能である。 The medium driving device 808 reads a program or data recorded in the portable storage medium 890 and writes data stored in the auxiliary storage device 803 to the portable storage medium 890. For the medium driving device 808, for example, a memory card reader / writer corresponding to one type or a plurality of types of standards can be used. When a memory card reader / writer is used as the medium driving device 808, the portable storage medium 890 is a memory card (flash memory) conforming to a standard supported by the memory card reader / writer, for example, the Secure Digital (SD) standard. ) Etc. can be used. Further, as the portable recording medium 890, for example, a flash memory having a USB standard connector can be used. Further, when the computer 8 is equipped with an optical disk drive that can be used as the medium driving device 808, various optical disks that can be recognized by the optical disk drive can be used as the portable recording medium 890. Examples of the optical disc that can be used as the portable recording medium 890 include a Compact Disc (CD), a Digital Versatile Disc (DVD), and a Blu-ray Disc (Blu-ray is a registered trademark). The portable recording medium 890 can be used, for example, for storing a voice processing program including the processes in steps S1 to S6 in FIG. The portable recording medium 890 is used to hold (store), for example, voice data, information indicating the utterance section and vowel section included in the voice data, frequency distribution of time length of the vowel section, and calculated speech speed. Is possible.

例えば、オペレータが入力装置８０４等を利用して話速を推定する処理を開始する命令をコンピュータ８に入力すると、プロセッサ８０１が、補助記憶装置８０３等の非一時的な記録媒体に記憶させた音声処理プログラムを読み出して実行する。音声処理プログラムが図２のステップＳ１〜Ｓ７の処理を含むプログラムである場合、コンピュータ８は、入力音声データから話速を算出し、算出した話速を表示装置等の出力装置８０５に出力する処理を繰り返す。音声処理プログラムを実行している間、プロセッサ８０１は、話速推定装置１における音声取得部１１０、発話区間検出部１２０、母音区間検出部１３０、モーラ数決定部１４０、話速算出部１５０、及び出力部１６０として機能する（動作する）。また、プロセッサ８０１が音声処理プログラムを実行している間、主記憶装置８０２のＲＡＭや補助記憶装置８０３等は、話速推定装置１の図示してない記憶部として機能する。すなわち、主記憶装置８０２のＲＡＭや補助記憶装置８０３等は、音声データ、該音声データに含まれる発話区間及び母音区間を示す情報、母音区間の時間長の頻度分布、算出した話速等を記憶する記憶部として機能する。 For example, when the operator inputs a command to start processing for estimating speech speed using the input device 804 or the like to the computer 8, the voice stored by the processor 801 in a non-temporary recording medium such as the auxiliary storage device 803. Read and execute the processing program. When the voice processing program is a program including the processes of steps S1 to S7 in FIG. 2, the computer 8 calculates the speech speed from the input voice data and outputs the calculated speech speed to the output device 805 such as a display device. repeat. While executing the speech processing program, the processor 801 includes the speech acquisition unit 110, the speech segment detection unit 120, the vowel segment detection unit 130, the mora number determination unit 140, the speech rate calculation unit 150, and the speech speed estimation apparatus 1. Functions (operates) as the output unit 160. While the processor 801 is executing the speech processing program, the RAM of the main storage device 802, the auxiliary storage device 803, and the like function as a storage unit (not shown) of the speech speed estimation device 1. That is, the RAM of the main storage device 802, the auxiliary storage device 803, and the like store voice data, information indicating the utterance section and vowel section included in the voice data, frequency distribution of time length of the vowel section, calculated speech speed, and the like. Functions as a storage unit.

なお、話速推定装置１として動作させるコンピュータ８は、図２２に示した全ての要素８０１〜８０８を含む必要はなく、用途や条件に応じて一部の要素を省略することも可能である。例えば、コンピュータ８は、通信制御装置８０７や媒体駆動装置８０８が省略されたものであってもよい。 Note that the computer 8 operated as the speech speed estimation apparatus 1 does not need to include all the elements 801 to 808 shown in FIG. 22, and some elements can be omitted depending on the application and conditions. For example, the computer 8 may be one in which the communication control device 807 and the medium driving device 808 are omitted.

また、コンピュータ８に実行させる音声処理プログラムは、図１７のフローチャートのように、算出した話速に基づいて音声データの話速を制御し、話速を制御した音声データを出力する処理を含むプログラムであってもよい。 The voice processing program to be executed by the computer 8 includes a process of controlling the voice speed of the voice data based on the calculated voice speed and outputting the voice data with the voice speed controlled as shown in the flowchart of FIG. It may be.

更に、コンピュータ８は、携帯電話端末２５等の電話機として動作させることも可能である。コンピュータ８を電話機として動作させる、例えば、コンピュータ８に、各実施形態で説明した処理を行う音声処理プログラムと並行して、コンピュータ８と、他の電話機との間で音声データを送受信する通話処理プログラムを実行させる。この場合、コンピュータ８に実行させるプログラムは、ステップＳ１〜Ｓ６の処理により算出した（推定した）話速に基づいて、音声データの話速を制御する処理を行うプログラムであってもよい。 Further, the computer 8 can be operated as a telephone such as the mobile phone terminal 25. For example, a call processing program for transmitting and receiving voice data between the computer 8 and another telephone in parallel with the voice processing program for causing the computer 8 to perform the processing described in each embodiment. Is executed. In this case, the program to be executed by the computer 8 may be a program that performs processing for controlling the speech speed of the voice data based on the speech speed calculated (estimated) by the processing in steps S1 to S6.

加えて、コンピュータ８に実行させる音声処理プログラムは、図２及び図１７のフローチャートにおけるステップＳ３〜Ｓ５の処理が、上述したステップＳ３’〜Ｓ５’の処理に置換されたプログラムであってもよい。 In addition, the voice processing program to be executed by the computer 8 may be a program in which the processes in steps S3 to S5 in the flowcharts of FIGS. 2 and 17 are replaced with the processes in steps S3 'to S5' described above.

以上記載した各実施形態に関し、更に以下の付記を開示する。
（付記１）
入力音声データにおける発話区間に含まれる複数の母音区間を検出し、
検出した前記複数の母音区間それぞれの時間長を算出し、
前記複数の母音区間の時間長についての頻度分布を算出するとともに、前記複数の母音区間それぞれの時間長のうちの最小値を特定し、
特定した前記時間長の最小値と、前記頻度分布とに基づいて、前記複数の母音区間それぞれの時間長と対応するモーラ数を算出し、
算出した前記モーラ数に応じて前記入力音声データにおける前記発話区間と対応する出力信号を制御する、
処理をコンピュータに実行させることを特徴とする音声処理プログラム。
（付記２）
前記時間長の最小値を特定する処理では、
前記頻度分布において頻度がピークとなる複数の時間長のうちの最小値を特定する、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記３）
前記時間長の最小値を特定する処理では、
前記頻度分布において頻度がピークとなる複数の時間長を特定し、
前記複数の時間長に基づいて、前記頻度分布において隣接する前記ピーク間の時間長を算出し、
算出した前記ピーク間の時間長の平均値を前記時間長の最小値に特定する、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記４）
前記頻度分布を算出する処理では、
前記発話区間に含まれる全ての母音区間のうちの、該発話区間における末尾の母音区間を除外して前記頻度分布を算出する、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記５）
前記頻度分布を算出する処理では、
前記発話区間に含まれる全ての母音区間のうちの、時間長が所定の範囲内である母音区間を抽出して前記頻度分布を算出する、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記６）
前記モーラ数を算出する処理では、
前記母音区間の時間長を前記時間長の最小値で除した値に最も近い整数値を、該母音区間のモーラ数とする、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記７）
前記母音区間を検出する処理では、
前記入力音声データにおける信号対雑音比を算出し、
算出した前記信号対雑音比が閾値以上で連続する区間を前記母音区間として検出する、
ことを特徴とする付記1に記載の音声処理プログラム。
（付記８）
前記母音区間を検出する処理では、
前記入力音声データにおける波形自己相関を算出し、
算出した前記波形自己相関が閾値以上で連続する区間を前記母音区間として検出する、
ことを特徴とする付記1に記載の音声処理プログラム。
（付記９）
前記母音区間を検出する処理では、
前記入力音声データにおけるフォルマント周波数を算出し、
算出した前記フォルマント周波数の時間変化量が閾値以上となる時刻を前記母音区間と非母音区間との境界として検出する、処理を含む、
ことを特徴とする付記1に記載の音声処理プログラム。
（付記１０）
前記出力信号を制御する処理は、
算出した前記複数の母音区間それぞれのモーラ数と、前記入力音声データにおける前記発話区間の時間長とに基づいて、前記入力音声データにおける前記発話区間の話速を算出する、処理を含む、
ことを特徴とする付記１に記載の音声処理プログラム。
（付記１１）
前記出力信号を制御する処理は、
算出した前記話速が閾値以上である場合に、前記入力音声データにおける前記母音区間を伸長して話速を低下させる、処理を更に含む、
ことを特徴とする付記９に記載の音声処理プログラム。
（付記１２）
コンピュータが、
入力音声データにおける発話区間に含まれる複数の母音区間を検出し、
検出した前記複数の母音区間それぞれの時間長を算出し、
前記複数の母音区間の時間長についての頻度分布を算出するとともに、前記複数の母音区間それぞれの時間長のうちの最小値を特定し、
特定した前記時間長の最小値と、前記頻度分布とに基づいて、前記複数の母音区間それぞれの時間長と対応するモーラ数を算出し、
算出した前記モーラ数に応じて前記入力音声データにおける前記発話区間と対応する出力信号を制御する、
処理を実行することを特徴とする音声処理方法。
（付記１３）
入力音声データにおける発話区間に含まれる複数の母音区間を検出する母音区間検出部と、
検出した前記複数の母音区間それぞれの時間長を算出して前記複数の母音区間の時間長についての頻度分布を算出するとともに、前記複数の母音区間それぞれの時間長のうちの最小値を特定し、特定した前記時間長の最小値と、前記頻度分布とに基づいて、前記複数の母音区間それぞれの時間長と対応するモーラ数を決定するモーラ数決定部と、
決定した前記モーラ数と、前記発話区間の時間長とに基づいて前記入力音声データにおける前記発話区間の話速を算出する話速算出部と、
を備えることを特徴とする音声処理装置。 The following additional notes are disclosed for each of the embodiments described above.
(Appendix 1)
Detect a plurality of vowel sections included in the utterance section in the input voice data,
Calculating a time length of each of the detected vowel intervals;
Calculating a frequency distribution for the time lengths of the plurality of vowel intervals, and specifying a minimum value of the time lengths of the plurality of vowel intervals,
Based on the identified minimum value of the time length and the frequency distribution, calculate the number of mora corresponding to the time length of each of the plurality of vowel intervals,
Controlling an output signal corresponding to the utterance interval in the input voice data according to the calculated number of mora;
A voice processing program for causing a computer to execute processing.
(Appendix 2)
In the process of specifying the minimum value of the time length,
In the frequency distribution, specify a minimum value among a plurality of time lengths at which the frequency peaks.
The speech processing program according to appendix 1, wherein
(Appendix 3)
In the process of specifying the minimum value of the time length,
Identify a plurality of time lengths in which the frequency peaks in the frequency distribution;
Based on the plurality of time lengths, calculating a time length between the adjacent peaks in the frequency distribution,
The average value of the time length between the calculated peaks is specified as the minimum value of the time length.
The speech processing program according to appendix 1, wherein
(Appendix 4)
In the process of calculating the frequency distribution,
Of all the vowel sections included in the utterance section, calculate the frequency distribution by excluding the last vowel section in the utterance section.
The speech processing program according to appendix 1, wherein
(Appendix 5)
In the process of calculating the frequency distribution,
Of all the vowel sections included in the utterance section, the vowel section whose time length is within a predetermined range is extracted to calculate the frequency distribution.
The speech processing program according to appendix 1, wherein
(Appendix 6)
In the process of calculating the number of mora,
An integer value closest to a value obtained by dividing the time length of the vowel section by the minimum value of the time length is set as the number of mora of the vowel section.
The speech processing program according to appendix 1, wherein
(Appendix 7)
In the process of detecting the vowel section,
Calculating a signal-to-noise ratio in the input voice data;
Detecting a section in which the calculated signal-to-noise ratio is continuous at a threshold value or more as the vowel section,
The speech processing program according to supplementary note 1, wherein
(Appendix 8)
In the process of detecting the vowel section,
Calculating a waveform autocorrelation in the input voice data;
Detecting the section in which the calculated waveform autocorrelation is equal to or greater than a threshold as the vowel section,
The speech processing program according to supplementary note 1, wherein
(Appendix 9)
In the process of detecting the vowel section,
Calculate the formant frequency in the input voice data,
Including a process of detecting a time at which the calculated time change amount of the formant frequency is equal to or greater than a threshold as a boundary between the vowel section and the non-vowel section.
The speech processing program according to supplementary note 1, wherein
(Appendix 10)
The process of controlling the output signal includes:
Calculating the speech speed of the utterance section in the input speech data based on the calculated number of mora for each of the plurality of vowel sections and the time length of the utterance section in the input speech data.
The speech processing program according to appendix 1, wherein
(Appendix 11)
The process of controlling the output signal includes:
When the calculated speech speed is equal to or higher than a threshold value, further includes a process of extending the vowel section in the input speech data to reduce the speech speed.
The speech processing program according to appendix 9, wherein
(Appendix 12)
Computer
Detect a plurality of vowel sections included in the utterance section in the input voice data,
Calculating a time length of each of the detected vowel intervals;
Calculating a frequency distribution for the time lengths of the plurality of vowel intervals, and specifying a minimum value of the time lengths of the plurality of vowel intervals,
Based on the identified minimum value of the time length and the frequency distribution, calculate the number of mora corresponding to the time length of each of the plurality of vowel intervals,
Controlling an output signal corresponding to the utterance interval in the input voice data according to the calculated number of mora;
A voice processing method characterized by executing processing.
(Appendix 13)
A vowel section detection unit for detecting a plurality of vowel sections included in the utterance section in the input voice data;
Calculating a time length of each of the plurality of detected vowel intervals and calculating a frequency distribution of the time lengths of the plurality of vowel intervals; Based on the specified minimum value of the time length and the frequency distribution, the number of mora determining unit that determines the number of mora corresponding to the time length of each of the plurality of vowel sections,
A speech speed calculation unit that calculates the speech speed of the utterance section in the input speech data based on the determined number of mora and the time length of the utterance section;
An audio processing apparatus comprising:

１話速推定装置
２，１６２０収音装置
３表示装置
８コンピュータ
９Ａ，９Ｂ話者
１０通話システム
１１通話処理装置
１２情報処理装置
１３，１６３０レシーバ
１４Ａ，１４Ｂ交換機
１５ネットワーク
１６電話機
２１，１６１０通話処理部
２５携帯電話端末
２６話速調整部
３０基地局
１１０音声取得部
１２０発話区間検出部
１３０母音区間検出部
１４０モーラ数決定部
１４１頻度分布算出部
１４２ピーク検出部
１４３モーラ数算出部
１５０話速算出部
１６０，１６２出力部
１７０話速制御部
１７１基本周期検出部
１７２波形処理部
１７３伸長制御部
１９０記憶部
１９１音声データ
１９２目標伸長率 DESCRIPTION OF SYMBOLS 1 Speech speed estimation apparatus 2,1620 Sound collection apparatus 3 Display apparatus 8 Computer 9A, 9B Speaker 10 Call system 11 Call processing apparatus 12 Information processing apparatus 13,1630 Receiver 14A, 14B Switch 15 Network 16 Telephone 21,1610 Call processing part 25 mobile phone terminal 26 speech speed adjustment unit 30 base station 110 speech acquisition unit 120 utterance interval detection unit 130 vowel interval detection unit 140 mora number determination unit 141 frequency distribution calculation unit 142 peak detection unit 143 mora number calculation unit 150 speech rate calculation unit 160, 162 Output unit 170 Speech rate control unit 171 Basic period detection unit 172 Waveform processing unit 173 Extension control unit 190 Storage unit 191 Audio data 192 Target extension rate

Claims

Detect a plurality of vowel sections included in the utterance section in the input voice data,
Calculating a time length of each of the detected vowel intervals;
Calculating a frequency distribution for the time lengths of the plurality of vowel intervals, and specifying a minimum value of the time lengths of the plurality of vowel intervals,
Based on the identified minimum value of the time length and the frequency distribution, calculate the number of mora corresponding to the time length of each of the plurality of vowel intervals,
Controlling an output signal corresponding to the utterance interval in the input voice data according to the calculated number of mora;
A voice processing program for causing a computer to execute processing.

In the process of specifying the minimum value of the time length,
In the frequency distribution, specify a minimum value among a plurality of time lengths at which the frequency peaks.
The voice processing program according to claim 1.

In the process of specifying the minimum value of the time length,
Identify a plurality of time lengths in which the frequency peaks in the frequency distribution;
Based on the plurality of time lengths, calculating a time length between the adjacent peaks in the frequency distribution,
The average value of the time length between the calculated peaks is specified as the minimum value of the time length.
The voice processing program according to claim 1.

In the process of calculating the frequency distribution,
Of all the vowel sections included in the utterance section, calculate the frequency distribution by excluding the last vowel section in the utterance section.
The voice processing program according to claim 1.

In the process of calculating the number of mora,
An integer value closest to a value obtained by dividing the time length of the vowel section by the minimum value of the time length is set as the number of mora of the vowel section.
The voice processing program according to claim 1.

The process of controlling the output signal includes:
Including calculating a speech speed of the utterance section in the input speech data based on the determined number of mora of each of the plurality of vowel sections and a time length of the utterance section in the input speech data.
The voice processing program according to claim 1.

The process of controlling the output signal includes:
When the calculated speech speed is equal to or higher than a threshold, further includes a process of extending the vowel section in the input speech data to reduce the speech speed.
The voice processing program according to claim 6.

Computer
Detect a plurality of vowel sections included in the utterance section in the input voice data,
Calculating a time length of each of the detected vowel intervals;
Calculating a frequency distribution for the time lengths of the plurality of vowel intervals, and specifying a minimum value of the time lengths of the plurality of vowel intervals,
Based on the identified minimum value of the time length and the frequency distribution, calculate the number of mora corresponding to the time length of each of the plurality of vowel intervals,
Controlling an output signal corresponding to the utterance interval in the input voice data according to the calculated number of mora;
A voice processing method characterized by executing processing.

A vowel section detection unit for detecting a plurality of vowel sections included in the utterance section in the input voice data;
Calculating a time length of each of the plurality of detected vowel intervals to calculate a frequency distribution for the time length of the plurality of vowel intervals, and specifying a minimum value of the time length of each of the plurality of vowel intervals; Based on the specified minimum value of the time length and the frequency distribution, the number of mora determining unit that determines the number of mora corresponding to the time length of each of the plurality of vowel sections,
A speech speed calculation unit that calculates the speech speed of the utterance section in the input speech data based on the determined number of mora and the time length of the utterance section;
An audio processing apparatus comprising: