JP5863472B2

JP5863472B2 - Speaking speed conversion device and program thereof

Info

Publication number: JP5863472B2
Application number: JP2012008073A
Authority: JP
Inventors: 今井　篤; 篤今井; 信正清山; 都木　徹; 徹都木
Original assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2012-01-18
Filing date: 2012-01-18
Publication date: 2016-02-16
Anticipated expiration: 2032-01-18
Also published as: JP2013148654A

Description

本発明は、音声コンテンツを再生する際に話速を変換させる話速変換装置およびそのプログラムに関する。 The present invention relates to a speech speed converting device and a program to convert the speech rate when reproducing audio content.

近年、オーディオブックなどの予め録音された音声コンテンツや、インターネットなどの通信媒体を介して配信される音声コンテンツが普及している。それに伴い、利用者からこれらの音声コンテンツを高速に聞きたいという要望が高まっている。この要望を解決する一手法としては、音声コンテンツの再生速度を一律に上げる手法が一般的である。
この手法は、再生倍率に従って、音声波形を線形に伸縮させるものである。また、音声波形を伸縮させる際に、アナログ音声信号のように音程の変化を生じさせることなく、原音声の高さ（ピッチ）を保ちながら時間短縮を行う手法が、特許文献に開示されている（例えば、特許文献１参照）。 In recent years, pre-recorded audio contents such as audio books and audio contents distributed via communication media such as the Internet have become widespread. Accordingly, there is an increasing demand from users to listen to these audio contents at high speed. As a technique for solving this demand, a technique for uniformly increasing the playback speed of audio content is generally used.
This method linearly expands and contracts the audio waveform according to the reproduction magnification. Moreover, when expanding / contracting a voice waveform, a technique for shortening the time while maintaining the height (pitch) of the original voice without causing a change in pitch like an analog voice signal is disclosed in Patent Documents. (For example, refer to Patent Document 1).

しかし、このように、音声波形を伸縮させる手法では、元の音声の話速にもよるが、概ね３倍速再生が、人が聞き取れる限界とされている。
そこで、このような再生速度を上げても聞き取り易くする手法として、音声信号内の無音区間を一部削除して、その時間を音声の再生時間に割り当てる手法が提案されている（例えば、特許文献２参照）。 However, in the method of expanding / contracting the voice waveform as described above, although it depends on the speaking speed of the original voice, the triple-speed playback is regarded as a limit that humans can hear.
Therefore, as a technique for facilitating listening even when the playback speed is increased, a technique has been proposed in which a silent section in the audio signal is partially deleted and the time is assigned to the audio playback time (for example, Patent Documents). 2).

この手法は、指定された変換倍率で音声信号を話速変換して目標時間長の音声信号とする際に、その時間内で、音声をできるだけゆっくり再生させる手法である。すなわち、この手法は、話速変換によって音声の再生速度を上げる際に、無音区間の一部を削除し、目標時間長における音声の再生時間の割合を高めている。これによって、この手法は、無音区間を削除せずに、音声信号を目標時間長に話速変換した場合に比べて、目標時間長における音声の時間長の割合が高くなり、ゆっくり音声が再生されることになる。 This method is a method of reproducing a voice as slowly as possible within the time when the voice signal is converted into a voice signal having a target time length by converting the voice speed with a designated conversion magnification. That is, in this method, when the voice reproduction speed is increased by the speech speed conversion, a part of the silent section is deleted, and the ratio of the voice reproduction time in the target time length is increased. As a result, this method increases the ratio of the time length of the voice to the target time length and does not delete the silent section, and the voice is played back slowly compared to the case where the speech speed is converted to the target time length. Will be.

また、音声の再生速度を保持しつつ、部分的に音声信号を削除して、音声に対応した映像の再生速度を上げる手法が提案されている（例えば、特許文献３参照）。
この手法は、音声信号をリングメモリに書き込み、読み出し時に１倍速の再生タイミングで音声信号を再生し、対応する映像信号をｎ倍速で再生する。このとき、この手法は、リングメモリの容量以内の音声信号については１倍速で再生するが、容量を超えた音声信号については削除している。 In addition, a method has been proposed in which the audio signal is partially deleted while the audio reproduction speed is maintained to increase the reproduction speed of the video corresponding to the audio (for example, see Patent Document 3).
In this method, an audio signal is written in a ring memory, and the audio signal is reproduced at a reproduction speed of 1 × speed at the time of reading, and a corresponding video signal is reproduced at an n × speed. At this time, in this method, an audio signal within the capacity of the ring memory is reproduced at a single speed, but an audio signal exceeding the capacity is deleted.

特開平９−１６１９３号公報JP-A-9-16193 特開平１０−３０１５９８号公報JP-A-10-301598 特開平８−１４７８７４号公報JP-A-8-147874

前記した特許文献１の手法では、概ね３倍速再生が人が聞き取れる限界とされており、それよりも高速に再生すると、内容を把握することができないという問題がある。
一方、人が目視で文章を読む場合、いわゆる斜め読みを行うことで、文章を３倍速で音声再生する場合よりも、さらに早く文章の内容を把握することができる。特に、視覚障害者からは、少なくとも斜め読みと同程度の速さで、音声コンテンツを高速に聞きたいという要望がある。 In the method of Patent Document 1 described above, it is generally considered that 3 × speed playback is the limit that humans can hear, and there is a problem that the content cannot be grasped if playback is performed at a higher speed.
On the other hand, when a person visually reads a sentence, the content of the sentence can be grasped faster by performing so-called oblique reading than when the sentence is reproduced by voice at a triple speed. In particular, there is a demand from visually impaired people to listen to audio content at high speed at least as fast as oblique reading.

また、前記した特許文献２の手法によって、高速再生時の聞き取り易さを改善することは可能である。しかし、再生速度は３倍速で十分であるという人であっても、長時間視聴は疲れてしまう。そこで、この手法に対して、さらなる聞き取り易さの改善が望まれていた。
また、前記した特許文献３の手法では、リングバッファを超過する音声信号を無条件に削除してしまう。そのため、この手法では、重要な意味内容を持つ音声が音声信号から削除されてしまうことがあり、内容を把握することが困難であるという問題がある。 Moreover, it is possible to improve the ease of hearing at the time of high-speed reproduction by the method of Patent Document 2 described above. However, even those who say that 3 times the playback speed is sufficient are tired of watching for a long time. Therefore, further improvement in the ease of hearing has been desired for this method.
Also, with the technique disclosed in Patent Document 3, an audio signal that exceeds the ring buffer is unconditionally deleted. Therefore, with this method, there is a problem that a voice having important meaning content may be deleted from the voice signal, and it is difficult to grasp the content.

本発明は、以上のような問題、要望に鑑みてなされたものであり、文章を斜め読みするかのように、音声コンテンツをその内容を把握しつつ高速に再生するとともに、高速再生時においても聞き取り易くすることが可能な話速変換装置を提供することを課題とする。 The present invention has been made in view of the above problems and demands, and plays back audio content at high speed while grasping the content as if reading the text obliquely, and also during high-speed playback. It is an object of the present invention to provide a speech speed conversion device that can be easily heard.

本発明は、前記課題を解決するために創案されたものであり、まず、本発明の話速変換装置は、音声コンテンツを部分的に削除し、指定された再生倍率で再生させる話速変換装置であって、音声コンテンツ記憶手段と、音響特徴量記憶手段と、区間情報記憶手段と、削除区間探索手段と、出力時間長調整手段と、を備える構成とした。 The present invention has been made to solve the above-mentioned problems. First, the speech speed conversion apparatus according to the present invention partially deletes audio contents and reproduces them at a specified reproduction magnification. And it was set as the structure provided with an audio | voice content storage means, an acoustic feature-value storage means, an area information storage means, a deletion area search means, and an output time length adjustment means.

かかる構成において、話速変換装置は、音声コンテンツ記憶手段に、話速変換する対象となる音声コンテンツを予め記憶する。また、話速変換装置は、音響特徴量記憶手段に、音声コンテンツの時刻ごとの音響特徴量を時刻に対応付けて予め記憶する。この音響特徴量は、音声の音響としての物理的特徴量であって、ピッチ（物理的な声の高さ）およびパワー（物理的な声の大きさ）、またはピッチもしくはパワーである。 In such a configuration, the speech rate conversion apparatus stores in advance the audio content that is subject to speech rate conversion in the audio content storage unit. Further, the speech speed conversion apparatus stores in advance the acoustic feature amount for each time of the audio content in association with the time in the acoustic feature amount storage unit. This acoustic feature amount is a physical feature amount as sound of speech , and is pitch (physical voice height) and power (physical voice magnitude) , or pitch or power .

また、話速変換装置は、区間情報記憶手段に、音声コンテンツの音声区間および非音声区間を音声コンテンツの時刻に対応付けて予め記憶する。この音声区間および非音声区間は、例えば、音声のパワーが予め定めた閾値より大きいか小さいかによって、区分することができる。ここで、音声区間は、話者が発話した区間をいい、非音声区間は、話者が発話していない区間をいう。また、この非音声区間には、話者が発話していない区間に加え、ノイズや無音等も含まれる。 Further, the speech speed converting apparatus stores in advance the voice section and the non-voice section of the voice content in association with the time of the voice content in the section information storage unit. The voice segment and the non-speech segment can be classified, for example, depending on whether the voice power is larger or smaller than a predetermined threshold. Here, the voice section refers to a section where the speaker speaks, and the non-voice section refers to a section where the speaker does not speak. The non-speech section includes noise, silence and the like in addition to the section where the speaker is not speaking.

そして、話速変換装置は、削除区間探索手段によって、非音声区間の直前の音声区間において、当該音声区間の終了時刻から遡って、音響特徴量の変化が予め定めた基準よりも少ない区間を、音声コンテンツの削除区間として探索する。すなわち、削除区間探索手段は、非音声区間の直前で、音声の大きさの変化が少ない場合等、音響特徴量の変化が少ない音声区間については、音声コンテンツから削除する区間として設定する。このように、音響特徴量の変化が少ない音声区間は、発話者が相手に伝える意思が弱いと判断し、本発明においては削除する。 Then, the speech speed conversion device uses the deletion section search means to extract a section in which the change in the acoustic feature amount is smaller than a predetermined reference in the speech section immediately before the non-speech section from the end time of the speech section. Search as a deletion section of audio content. That is, the deletion section search means sets a voice section having a small change in acoustic feature amount as a section to be deleted from the voice content immediately before the non-voice section, such as when there is little change in the sound volume. As described above, it is determined that the speech section having a small change in the acoustic feature value is weak in the intention of the speaker to convey to the other party, and is deleted in the present invention.

そして、話速変換装置は、出力時間長調整手段によって、削除区間探索手段で探索された複数の削除区間を除いた音声コンテンツの出力時間長が、元の音声コンテンツの時間長に対する指定された再生倍率の出力時間長となるように、削除区間を除いた音声コンテンツを伸縮させて出力する。このとき、削除区間が設定されていることから、出力すべき音声は、その削除された分だけ余分に再生時間が確保されることになる。これにより、元の音声コンテンツをそのまま話速変換する場合に比べ、出力すべき音声がゆっくり再生されることになる。 Then, the speech speed conversion device reproduces the output time length of the audio content excluding the plurality of deletion intervals searched by the deletion interval search means by the output time length adjustment means, with the specified time length of the original audio content being specified. The audio content excluding the deletion section is extended and output so that the output time length of the magnification is obtained. At this time, since the deletion section is set, an extra reproduction time is secured for the sound to be output by the amount deleted. As a result, compared to the case where the original audio content is directly converted into the speech speed, the audio to be output is reproduced slowly.

また、本発明の話速変換装置は、音響特徴量抽出手段と、区間情報検出手段と、をさらに備えることを特徴とする。 In addition, the speech speed conversion device of the present invention is further characterized by further comprising an acoustic feature amount extraction means and a section information detection means.

かかる構成において、話速変換装置は、音響特徴量抽出手段によって、音声コンテンツから音響特徴量を抽出し、時刻に対応付けて音響特徴量記憶手段に書き込む。この音響特徴量は、パワー、ピッチ等の物理的特徴量である。
また、話速変換装置は、区間情報検出手段によって、音声コンテンツにおいて、音声区間および非音声区間を検出し、時刻に対応付けて当該音声区間および当該非音声区間を区間情報記憶手段に書き込む。 In such a configuration, the speech speed conversion apparatus extracts the acoustic feature amount from the audio content by the acoustic feature amount extraction unit, and writes it in the acoustic feature amount storage unit in association with the time. This acoustic feature quantity is a physical feature quantity such as power and pitch.
Further, the speech speed conversion apparatus detects the voice section and the non-speech section in the voice content by the section information detection unit, and writes the voice section and the non-speech section in the section information storage unit in association with the time.

このように、音響特徴量抽出手段および区間情報検出手段は、音響特徴に基づいて音声コンテンツを予め分析し、時刻ごとの音響特徴量や、音声区間および非音声区間といった特徴を抽出し、削除区間を探索するための準備を行う。これによって、本発明の話速変換装置は、任意の音声コンテンツを入力として、話速変換を行うことができる。 As described above, the acoustic feature amount extraction unit and the section information detection unit preliminarily analyze the audio content based on the acoustic feature, extract features such as an acoustic feature amount for each time, a voice segment, and a non-speech segment, and a deletion segment. Prepare for exploring. As a result, the speech rate conversion apparatus of the present invention can perform speech rate conversion with any audio content as input.

また、本発明の話速変換装置は、音響特徴量が、声の高さを示すピッチおよび声の大きさを示すパワーであるときは、削除区間探索手段が、ピッチ参照探索手段と、パワー参照探索手段と、削除区間決定手段と、を備える。 Further, the speech speed conversion apparatus of the present invention, the acoustic feature quantity, when a power showing the pitch and loudness indicates the height of the voice, the deletion interval search section, a pitch reference search means, the power reference Search means and deletion interval determination means are provided .

かかる構成において、話速変換装置は、ピッチ参照探索手段によって、音声区間の終了時刻から遡って、ピッチの変化が予め定めた基準よりも少ない区間を、削除区間として探索する。また、話速変換装置は、パワー参照探索手段によって、音声区間の終了時刻から遡って、パワーの変化が予め定めた基準よりも少ない区間を、削除区間として探索する。すなわち、話速変換装置は、同じ音声区間において、ピッチとパワーとで、終了時刻からの時間長が異なる削除区間をそれぞれ探索する。 In such a configuration, the speech speed conversion device searches, as a deletion section, a section in which the pitch change is smaller than a predetermined reference, by using the pitch reference search unit, retroactively from the end time of the speech section. In addition, the speech speed conversion device searches, as a deletion section, a section in which the power change is smaller than a predetermined reference retroactively from the end time of the voice section by the power reference search means. That is, the speech speed conversion device searches for a deletion section having a different time length from the end time by the pitch and power in the same voice section.

そして、話速変換装置は、削除区間決定手段によって、ピッチ参照探索手段で探索された削除区間とパワー参照探索手段で探索された削除区間とから、予め定めたピッチおよびパワーの重みの比率に応じて、音声区間における削除区間を決定する。すなわち、ピッチの重みが大きければ、ピッチ参照探索手段で探索された削除区間に近い区間が設定され、パワーの重みが大きければ、パワー参照探索手段で探索された削除区間に近い区間が設定されることになる。 Then, the speech rate conversion apparatus responds to the predetermined ratio of the weight of the pitch and the power from the deletion section searched by the pitch reference search means and the deletion section searched by the power reference search means by the deletion section determination means. Then, the deletion section in the voice section is determined. In other words, if the pitch weight is large, a section close to the deletion section searched by the pitch reference search means is set, and if the power weight is large, a section close to the deletion section searched by the power reference search means is set. It will be.

これによって、本発明の話速変換装置は、ピッチとパワーとの重みに応じて、削除区間を定めることができる。また、本発明の話速変換装置は、各国の言語の特性に応じて、予め重みを変更することで、言語に適した話速変換を行うことができる。例えば、パワーの変化が少ない言語であれば、ピッチの重みを大きくすることで、ピッチを優先させた削除区間を設定することができる。 As a result, the speech speed conversion apparatus of the present invention can determine the deletion section according to the weight of the pitch and power. Moreover, the speech rate conversion apparatus of the present invention can perform speech rate conversion suitable for the language by changing the weight in advance according to the language characteristics of each country. For example, in a language with little change in power, it is possible to set a deletion section in which the pitch is prioritized by increasing the weight of the pitch.

また、音響特徴量が声の高さを示すピッチであるときは、削除区間探索手段は、ピッチ参照探索手段を備える。 Further, when the acoustic feature quantity is a pitch indicating the height of the voice, deletion section searching means comprises a pitch reference search means.

かかる構成において、話速変換装置は、ピッチ参照探索手段によって、音声区間の終了時刻から遡って、ピッチの変化が予め定めた基準よりも少ない区間を、削除区間として探索する。 In such a configuration, the speech speed conversion device searches, as a deletion section, a section in which the pitch change is smaller than a predetermined reference, by using the pitch reference search unit, retroactively from the end time of the speech section.

また、音響特徴量が声の大きさを示すパワーであるときは、削除区間探索手段は、パワー参照探索手段を備える。 Further, when the acoustic feature quantity is a power indicating the magnitude of voice, deletion section searching means comprises a power reference search means.

かかる構成において、話速変換装置は、パワー参照探索手段によって、音声区間の終了時刻から遡って、パワーの変化が予め定めた基準よりも少ない区間を、削除区間として探索する。 In such a configuration, the speech speed conversion device searches, as a deletion section, a section in which the power change is smaller than a predetermined criterion, retroactively from the end time of the voice section, by the power reference search means.

また、本発明の話速変換装置は、削除区間探索手段が音声区間の終了時刻から遡って削除区間を探索する時刻が、当該音声区間の開始時刻から予め定めた時刻を越えず、かつ、探索する削除区間の時間長が予め定めた最大時間長を超えない範囲とすることを特徴とする。 In the speech speed conversion device of the present invention, the time when the deletion section search means searches for the deletion section retroactively from the end time of the voice section does not exceed a predetermined time from the start time of the voice section. It is characterized in that the time length of the deletion section to be performed is in a range not exceeding a predetermined maximum time length.

かかる構成において、話速変換装置は、削除区間探索手段によって、音声区間において、削除区間を探索する際に、その時間長を制限することで、当該音声区間の音声をすべて削除区間とすることなく、少なくとも先頭から予め定めた時間長が削除されない区間となる。これによって、本発明の話速変換装置は、音響特徴量の変化によらず、少なくとも一連の発話間隔（呼気段落）の先頭から所定の時間長が確保される。そのため、本発明の話速変換装置は、音声を部分的に削除しても、利用者が意味内容を把握することが容易になる。 In such a configuration, the speech rate converting apparatus limits the time length when searching for a deletion section in the voice section by the deletion section searching means, so that all the voices in the voice section are not set as the deletion section. At least a predetermined time length from the beginning is not deleted. As a result, the speech rate conversion apparatus of the present invention ensures a predetermined time length from the beginning of at least a series of speech intervals (expiratory paragraphs) regardless of changes in the acoustic feature amount. Therefore, the speech speed conversion apparatus of the present invention makes it easy for the user to grasp the meaning content even if the voice is partially deleted.

また、本発明の話速変換装置は、削除区間探索手段が、予め定めた時間長以上の非音声区間において、当該時間長より短い予め定めた時間長を残した他の区間を、さらに音声コンテンツの削除区間とすることを特徴とする。 Further, in the speech speed converting apparatus according to the present invention, the deletion section search means further selects another section in the non-speech section that has a predetermined time length or more and that has left a predetermined time length shorter than the time length. It is characterized by the deletion section.

かかる構成において、話速変換装置は、削除区間探索手段によって、非音声区間の一部を削除区間とすることで、当該削除区間を音声の再生時間に割り当てる。
これによって、本発明の話速変換装置は、音声区間と非音声区間を削除した分だけ、他の音声を割り当てる時間的余裕を確保することができる。そして、本発明の話速変換装置は、同じ再生速度で音声コンテンツを再生する場合、従来の話速変換よりも音声をゆっくり再生することになり、利用者は聞き易くなる。 In such a configuration, the speech rate conversion apparatus assigns the deleted section to the playback time of the voice by setting a part of the non-voice section as the deleted section by the deleted section searching means.
As a result, the speech speed converting apparatus according to the present invention can secure a time margin for assigning other voices by the amount corresponding to the deletion of the voice and non-voice sections. And when the audio | voice content conversion apparatus of this invention reproduces | regenerates audio | voice content at the same reproduction speed, it will reproduce | regenerate audio | voice slowly rather than the conventional speech speed conversion, and a user becomes easy to hear.

また、本発明は、コンピュータを、上記記載のいずれかの話速変換装置として機能させるためのプログラムである。 Further, the present invention is a program for causing a computer to function as any of the above speech rate conversion devices.

本発明は、以下に示す優れた効果を奏するものである。
本発明によれば、音声コンテンツの音声区間の一部を削除することで、音声コンテンツを高速に再生することができる。これによって、利用者は、文章を斜め読みするかのように、音声コンテンツを部分的に高速に聞くことができる。また、本発明によれば、音声区間を削除した分だけ、他の音声を割り当てる時間的余裕を確保することができる。そのため、本発明は、同じ再生速度で音声コンテンツを再生する場合、従来の話速変換よりも音声をゆっくり再生することになり、利用者は聞き易くなる。 The present invention has the following excellent effects.
According to the present invention, it is possible to reproduce audio content at high speed by deleting a part of the audio section of the audio content. As a result, the user can listen to the audio content partially at high speed as if reading the text obliquely. Further, according to the present invention, it is possible to secure a time margin for allocating other speech by the amount corresponding to the deletion of the speech section. Therefore, according to the present invention, when the audio content is reproduced at the same reproduction speed, the audio is reproduced more slowly than the conventional speech speed conversion, so that the user can easily listen.

本発明の実施形態に係る話速変換装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the speech rate converter which concerns on embodiment of this invention. 本発明の実施形態に係る話速変換装置において、削除区間探索手段が探索する非音声区間内の削除区間を説明するための説明図である。It is explanatory drawing for demonstrating the deletion area in the non-voice area which a deletion area search means searches in the speech-speed converter which concerns on embodiment of this invention. 本発明の実施形態に係る話速変換装置において、削除区間探索手段が探索する音声区間内の削除区間を説明するための説明図である。It is explanatory drawing for demonstrating the deletion area in the audio | voice area which a deletion area search means searches in the speech rate conversion apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る話速変換装置において、出力時間長調整手段が削除区間を除いて音声コンテンツの出力長を制御する手法を説明するための説明図である。It is explanatory drawing for demonstrating the method in which the output time length adjustment means controls the output length of audio | voice content except a deletion area in the speech speed conversion apparatus which concerns on embodiment of this invention. 本発明の実施形態に係る話速変換装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the speech-speed converter which concerns on embodiment of this invention. 本発明の他の実施形態に係る話速変換装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the speech-speed converter which concerns on other embodiment of this invention. 本発明の他の実施形態に係る話速変換装置の構成を示すブロック構成図である。It is a block block diagram which shows the structure of the speech-speed converter which concerns on other embodiment of this invention.

以下、本発明の実施形態について図面を参照して説明する。
［話速変換装置の構成］
まず、図１を参照して、本発明の実施形態に係る話速変換装置１の構成について説明する。この話速変換装置１は、音声コンテンツ（音声信号）を１倍速再生よりも高速に再生する際に、無音等の非音声区間のみならず、音声区間を部分的に削除して、話速変換を行うものである。すなわち、話速変換装置１は、人が活字で表された文章を目視で斜め読みするかのように、音声コンテンツを部分再生するものである。
なお、話速変換装置１は、一連の発話のかたまりである、吸気で区切られた呼気段落の終了部分において、音響特徴量の変化が少なくなった箇所を削除することで、発話の意味内容の欠落を極力防止する。
ここでは、話速変換装置１は、音響分析手段１０と、記憶手段２０と、削除区間探索手段３０と、出力時間長調整手段４０と、を備えている。 Embodiments of the present invention will be described below with reference to the drawings.
[Configuration of speech speed converter]
First, with reference to FIG. 1, the structure of the speech speed converter 1 which concerns on embodiment of this invention is demonstrated. This speech speed conversion device 1 performs speech speed conversion by partially deleting a speech section as well as a non-speech section such as silence when reproducing audio content (audio signal) at a speed higher than the 1 × speed playback. Is to do. In other words, the speech speed conversion device 1 performs partial reproduction of audio content as if a person reads a sentence represented by printed characters obliquely visually.
Note that the speech speed conversion device 1 deletes the portion where the change in the acoustic feature amount is reduced at the end of the exhalation paragraph divided by inspiration, which is a series of utterances. Prevent omissions as much as possible.
Here, the speech rate conversion apparatus 1 includes an acoustic analysis unit 10, a storage unit 20, a deletion section search unit 30, and an output time length adjustment unit 40.

音響分析手段１０は、入力された音声コンテンツを音響分析し、時間ごとの音響特徴量（音響特徴情報）や、音声区間、非音声区間（無音区間を含む）の区間情報を抽出するものである。この音響分析手段１０は、分析によって抽出した音響特徴情報、区間情報を記憶手段２０に書き込み、分析が完了した旨を削除区間探索手段３０に通知する。
ここでは、音響分析手段１０は、パワー抽出手段１１と、ピッチ抽出手段１２と、音声区間検出手段１３と、を備えている。 The acoustic analysis means 10 performs acoustic analysis on the input audio content, and extracts time-specific acoustic feature amounts (acoustic feature information) and section information of speech sections and non-speech sections (including silence sections). . The acoustic analysis unit 10 writes the acoustic feature information and section information extracted by the analysis to the storage unit 20 and notifies the deletion section search unit 30 that the analysis is completed.
Here, the acoustic analysis unit 10 includes a power extraction unit 11, a pitch extraction unit 12, and a voice section detection unit 13.

パワー抽出手段（音響特徴量抽出手段）１１は、外部から入力された音声コンテンツ（音声信号）から、音響特徴量の一つであるパワー（音の強さ、大きさ）を抽出するものである。このパワー抽出手段１１におけるパワー抽出の手法は、一般的な手法を用いればよい。例えば、パワー抽出手段１１は、音声コンテンツを、所定の時間間隔ごとに、所定のフレーム幅で周波数変換（ＦＦＴ）し、振幅値を２乗することで、パワー（パワースペクトル）を算出する。 The power extraction means (acoustic feature quantity extraction means) 11 extracts power (sound intensity and magnitude), which is one of acoustic feature quantities, from audio content (audio signal) input from the outside. . A general method may be used as a power extraction method in the power extraction unit 11. For example, the power extraction unit 11 calculates the power (power spectrum) by frequency-converting (FFT) the audio content at a predetermined frame width at predetermined time intervals and squaring the amplitude value.

なお、パワー抽出手段１１は、パワーの時間経過に伴う信号レベルをスムージング（平滑化）しておく。例えば、パワー抽出手段１１は、パワーの逐次変化を、カットオフ周波数６〜１０Ｈｚ程度でスムージングする。これによって、パワー抽出手段１１は、音声コンテンツから、パワーの変化を滑らかにして、ノイズの影響を抑えた、時間経過に伴うパワーの変化を抽出することができる。 The power extraction unit 11 smoothes (smooths) the signal level with the passage of time of power. For example, the power extraction unit 11 smoothes the sequential change in power at a cutoff frequency of about 6 to 10 Hz. As a result, the power extraction unit 11 can extract the change in power with the lapse of time from the audio content by smoothing the change in power and suppressing the influence of noise.

このパワー抽出手段１１は、抽出した時間経過に伴うパワーの値（ｄＢ）を、音響特徴情報の一つとして、音声コンテンツの開始からの時刻と対応付けて、記憶手段２０に書き込む。すなわち、パワー抽出手段１１は、パワーのスムージング波形におけるある時刻の瞬時値を、その時刻に対応付けて記憶手段２０に書き込む。 The power extraction unit 11 writes the extracted power value (dB) with the passage of time in the storage unit 20 as one of the acoustic feature information in association with the time from the start of the audio content. That is, the power extraction unit 11 writes an instantaneous value at a certain time in the power smoothing waveform in the storage unit 20 in association with the time.

ピッチ抽出手段（音響特徴量抽出手段）１２は、外部から入力された音声コンテンツ（音声信号）から、音響特徴量の一つであるピッチ（音の高さ）を抽出するものである。このピッチ抽出手段１２におけるピッチ抽出の手法は、一般的な手法を用いればよい。例えば、ピッチ抽出手段１２は、パワー抽出手段１１で抽出されたパワースペクトルの自己相関関数を求め、その自己相関関数の係数の極大値の周期間隔として、ピッチ（基本周波数）を抽出する。 The pitch extraction means (acoustic feature quantity extraction means) 12 extracts a pitch (sound pitch), which is one of acoustic feature quantities, from audio content (audio signal) input from the outside. A general method may be used as the pitch extraction method in the pitch extraction means 12. For example, the pitch extraction unit 12 obtains the autocorrelation function of the power spectrum extracted by the power extraction unit 11 and extracts the pitch (fundamental frequency) as the period interval of the maximum value of the coefficient of the autocorrelation function.

なお、ピッチ抽出手段１２は、パワー抽出手段１１と同様に、ピッチの時間経過に伴う信号レベルをスムージング（平滑化）しておく。例えば、ピッチ抽出手段１２は、ピッチの逐次変化を、カットオフ周波数１０Ｈｚ程度でスムージングする。これによって、ピッチ抽出手段１２は、音声コンテンツから、一般的な会話音声において知覚されない音響成分を除外して、時間経過に伴うピッチの変化を抽出することができる。 Note that the pitch extraction unit 12 smoothes (smooths) the signal level with the passage of time of the pitch, similarly to the power extraction unit 11. For example, the pitch extraction unit 12 smoothes the pitch change at a cutoff frequency of about 10 Hz. As a result, the pitch extraction unit 12 can extract a change in pitch over time by excluding an acoustic component that is not perceived in general conversational voice from the audio content.

このピッチ抽出手段１２は、抽出した時間経過に伴うピッチの値（Ｈｚ）を、音響特徴情報の一つとして、音声コンテンツの開始からの時刻と対応付けて、記憶手段２０に書き込む。すなわち、ピッチ抽出手段１２は、ピッチのスムージング波形におけるある時刻の瞬時値を、その時刻に対応付けて記憶手段２０に書き込む。 The pitch extraction unit 12 writes the extracted pitch value (Hz) with the passage of time in the storage unit 20 as one of the acoustic feature information in association with the time from the start of the audio content. That is, the pitch extraction unit 12 writes an instantaneous value at a certain time in the pitch smoothing waveform in the storage unit 20 in association with the time.

音声区間検出手段（区間情報検出手段）１３は、外部から入力された音声コンテンツ（音声信号）から、音声を含んだ音声区間や、音声を含まない非音声区間（無音区間を含む）を検出するものである。 The voice section detecting means (section information detecting means) 13 detects a voice section including voice and a non-voice section not containing voice (including a silent section) from the voice content (sound signal) input from the outside. Is.

この音声区間検出手段１３における音声／非音声区間の検出手法は、一般的な手法を用いればよい。
例えば、音声区間検出手段１３は、パワー抽出手段１１で抽出されたパワーが、予め定めた閾値よりも大きい場合に当該時間区間を音声区間と判別し、それ以外を非音声区間とする。なお、この閾値は、音声信号のレベルに応じて適応的に変化させてもよく、特開平１０−３０１５９３号公報に記載された手法を用いることとしてもよい。 A general technique may be used as the voice / non-speech section detection technique in the voice section detector 13.
For example, when the power extracted by the power extraction unit 11 is larger than a predetermined threshold, the voice section detection unit 13 determines that the time section is a voice section, and sets the rest as a non-voice section. Note that this threshold value may be adaptively changed according to the level of the audio signal, or a technique described in Japanese Patent Laid-Open No. 10-301593 may be used.

すなわち、音声区間検出手段１３は、入力された音声コンテンツに対して、過去の所定の時間内のパワーの最大値および最小値を図示を省略したメモリ等に保持し、その保持されている最大値より予め定めた値だけ小さいパワーに関する閾値を決定する。そして、音声区間検出手段１３は、パワーの最大値と最小値との差が予め定めた基準値より小さくなった場合には、その差に応じて閾値を大きくする。これによって、音声レベルの変化に逐次適応させながら、音声区間と非音声区間とを判別することができる。 That is, the audio section detection means 13 holds the maximum value and minimum value of power within a past predetermined time in a memory or the like (not shown) for the input audio content, and holds the maximum value. A threshold for power that is smaller by a predetermined value is determined. Then, when the difference between the maximum value and the minimum value of the power becomes smaller than a predetermined reference value, the voice section detection unit 13 increases the threshold value according to the difference. As a result, it is possible to discriminate between a speech segment and a non-speech segment while sequentially adapting to changes in the speech level.

この音声区間検出手段１３は、検出した音声区間および非音声区間（無音区間を含む）のそれぞれの開始時刻および終了時刻を、区間情報として、音声コンテンツの開始からの時刻と対応付けて、記憶手段２０に書き込む。なお、区間情報は、開始時刻および終了時刻以外に、当該区間が、音声区間であるか非音声区間であるかを示す種類情報や、音声コンテンツの最初から何番目の区間であるかを示す識別情報（例えば、シリアル番号等）を含んでいる。 The voice section detection means 13 stores the start time and end time of the detected voice section and non-speech section (including the silent section) in association with the time from the start of the audio content as section information. Write to 20. In addition to the start time and end time, the section information includes type information indicating whether the section is an audio section or a non-audio section, and an identification indicating the first section from the beginning of the audio content. Contains information (eg, serial number).

また、ここでは、音響分析手段１０は、入力された音声コンテンツから、逐次、音響分析を行うこととした。しかし、音響分析手段１０は、音声コンテンツが予め記憶手段２０に書き込まれた後、記憶手段２０に記憶されている音声コンテンツに対して、音響分析を行うこととしてもよい。 Here, the acoustic analysis means 10 sequentially performs acoustic analysis from the input audio content. However, the acoustic analysis unit 10 may perform acoustic analysis on the audio content stored in the storage unit 20 after the audio content is written in the storage unit 20 in advance.

記憶手段（音声コンテンツ記憶手段、音響特徴量記憶手段、区間情報記憶手段）２０は、外部から入力された音声コンテンツや、音響分析手段１０によって分析された音響特徴情報（パワー、ピッチ）、区間情報（音声区間、非音声区間）を記憶するものである。この記憶手段２０は、ハードディスク等の一般的な記憶装置を用いることができる。
この記憶手段２０に記憶される音響特徴情報および区間情報は、削除区間探索手段３０によって参照され、音声コンテンツの削除区間を探索する際に用いられる。 The storage means (audio content storage means, acoustic feature amount storage means, section information storage means) 20 is an audio content input from outside, acoustic feature information (power, pitch) analyzed by the acoustic analysis means 10, and section information. (Speech segment, non-speech segment) is stored. The storage means 20 can be a general storage device such as a hard disk.
The acoustic feature information and the section information stored in the storage unit 20 are referred to by the deletion section searching unit 30 and used when searching for a deletion section of the audio content.

また、記憶手段２０には、削除区間探索手段３０によって探索された音声コンテンツの削除区間（削除区間情報）が書き込まれるものとする。
この記憶手段２０に記憶される音声コンテンツ、区間情報および削除区間情報は、出力時間長調整手段４０によって参照され、部分的に削除した音声コンテンツの出力時間長を調整する際に用いられる。 In addition, it is assumed that the deletion section (deletion section information) of the audio content searched by the deletion section search means 30 is written in the storage unit 20.
The audio content, section information, and deletion section information stored in the storage unit 20 are referred to by the output time length adjusting unit 40 and used when adjusting the output time length of the partially deleted audio content.

削除区間探索手段３０は、外部から入力される削除条件に基づいて、音声コンテンツの音声区間および非音声区間において、削除する区間を探索するものである。
この削除区間探索手段３０は、削除条件により、非音声区間において、予め定めた時間長以上の区間を削除区間とする。また、削除区間探索手段３０は、削除条件により、音声区間において、音声区間の終了時刻から遡って、音響特徴量の変化が予め定めた基準よりも少ない区間を、音声コンテンツの削除区間とする。この削除条件については、後で詳細に説明する。
この削除区間探索手段３０は、削除すべき区間を示す削除区間情報を記憶手段２０に書き込み、削除区間を探索し終わった旨を出力時間長調整手段４０に通知する。
ここでは、削除区間探索手段３０は、非音声区間探索手段３１と、非音声区間部分削除手段３２と、音声削除区間探索手段３３と、音声区間部分削除手段３４と、を備えている。 The deletion section search means 30 searches for a section to be deleted in the audio section and the non-audio section of the audio content based on the deletion condition input from the outside.
The deletion section search means 30 sets a section having a predetermined time length or more as a deletion section in the non-voice section according to the deletion condition. Further, the deletion section search means 30 sets a section having a change in acoustic feature amount less than a predetermined reference in the voice section retroactively from the end time of the voice section as a deletion section of the audio content according to the deletion condition. This deletion condition will be described in detail later.
The deletion section search means 30 writes the deletion section information indicating the section to be deleted in the storage means 20, and notifies the output time length adjustment means 40 that the deletion section has been searched.
Here, the deletion section search means 30 includes a non-speech section search means 31, a non-speech section part deletion means 32, a speech deletion section search means 33, and a speech section part deletion means 34.

非音声区間探索手段３１は、記憶手段２０に記憶されている区間情報に基づいて、予め定めた時間長以上の非音声区間（対象非音声区間）を探索するものである。
この非音声区間探索手段３１は、区間情報として記憶されている開始時刻と終了時刻との差が、削除条件として予め定められた時間長よりも長い非音声区間を探索する。この予め定めた時間長は、呼気段落内の短い非音声区間を削除対象とすることを除外するための時間長であって、例えば、３００ｍｓとする。
この非音声区間探索手段３１は、探索した対象非音声区間の識別情報（シリアル番号）を、非音声区間部分削除手段３２と、音声削除区間探索手段３３とに出力する。 The non-speech segment searching unit 31 searches for a non-speech segment (target non-speech segment) having a predetermined time length or longer based on the segment information stored in the storage unit 20.
The non-speech segment searching means 31 searches for a non-speech segment in which the difference between the start time and the end time stored as segment information is longer than a predetermined time length as a deletion condition. This predetermined time length is a time length for excluding the short non-voice interval in the exhalation paragraph from being subject to deletion, and is set to 300 ms, for example.
The non-speech segment searching unit 31 outputs the identification information (serial number) of the searched target non-speech segment to the non-speech segment partial deleting unit 32 and the speech deletion segment searching unit 33.

非音声区間部分削除手段３２は、非音声区間探索手段３１で探索された対象非音声区間において、削除条件として予め定められた最低限残す時間長分の区間を残して、区間を部分的に削除するものである。例えば、この最低限残す非音声区間の時間長は、１００ｍｓとする。 The non-speech segment partial deletion means 32 partially deletes a section in the target non-speech section searched by the non-speech section search means 31 while leaving a section for a minimum length of time that is set in advance as a deletion condition. To do. For example, the minimum length of the non-voice interval to be left is 100 ms.

すなわち、非音声区間部分削除手段３２は、記憶手段２０に記憶されている区間情報の対象非音声区間において、開始時刻から、最低限残す時間長を加算した時刻を終了時刻とし、識別情報（シリアル番号）に対応した新たな区間情報（削除区間情報）として記憶手段２０に書き込む。なお、対象非音声区間において、削除する部分区間は、必ずしも対象非音声区間の終端側である必要はなく、始端側であっても構わない。 That is, the non-speech segment partial deletion unit 32 sets the time obtained by adding the minimum time length to the end non-speech segment from the start time in the target non-speech segment stored in the storage unit 20 as the end time, and the identification information (serial Number)) as new section information (deleted section information) corresponding to the number). In the target non-speech section, the partial section to be deleted does not necessarily need to be on the end side of the target non-speech section, and may be on the start end side.

音声削除区間探索手段３３は、非音声区間探索手段３１で探索された対象非音声区間の直前の音声区間（対象音声区間）において、予め定めた条件に基づいて、当該音声区間の終端から、音響特徴の変化が少ない削除対象となる区間を探索するものである。
このように、非音声区間探索手段３１で探索された対象非音声区間の直前の音声区間を探索対象とすることで、呼気段落の途中で音声が削除されてしまうことを防止することができる。
ここでは、音声削除区間探索手段３３は、パワー参照探索手段３３１と、ピッチ参照探索手段３３２と、削除区間決定手段３３３と、を備えている。 The voice deletion section search means 33 starts from the end of the voice section in the voice section (target voice section) immediately before the target non-voice section searched by the non-voice section search means 31 from the end of the voice section. This is to search for a section to be deleted with little feature change.
In this way, by setting the speech segment immediately before the target non-speech segment searched by the non-speech segment search means 31 as a search target, it is possible to prevent the voice from being deleted during the exhalation paragraph.
Here, the voice deletion section search means 33 includes a power reference search means 331, a pitch reference search means 332, and a deletion section determination means 333.

パワー参照探索手段３３１は、対象非音声区間の直前の音声区間（対象音声区間）において、音響特徴量の１つであるパワーの変化が予め定めた変化量よりも少ない時間区間を、当該音声区間の終端から遡って探索するものである。 The power reference search means 331 determines a time interval in which a change in power, which is one of acoustic feature values, is smaller than a predetermined change amount in a speech segment (target speech segment) immediately before the target non-speech segment. Search backward from the end of.

すなわち、パワー参照探索手段３３１は、対象非音声区間の開始時刻に対応する記憶手段２０に記憶されているパワー値を基準値とし、対象非音声区間の直前の音声区間（対象音声区間）の終了時刻から当該音声区間の開始時刻の方向に時間を遡って、記憶手段２０に記憶されているパワー値と基準値との変化量が予め定めた変化よりも少ない区間を求める。そして、パワー参照探索手段３３１は、その求めた区間の開始時刻を、パワー値により求めた削除区間の開始時刻（パワー参照削除開始時刻）とする。 That is, the power reference search unit 331 uses the power value stored in the storage unit 20 corresponding to the start time of the target non-speech section as a reference value, and ends the speech segment (target speech segment) immediately before the target non-speech segment. By going back in time from the time in the direction of the start time of the voice section, a section in which the amount of change between the power value stored in the storage unit 20 and the reference value is smaller than a predetermined change is obtained. And the power reference search means 331 makes the start time of the calculated | required area the start time (power reference deletion start time) of the deletion area calculated | required with the power value.

ここで、パワー参照探索手段３３１が音声区間の終了時刻から開始時刻の方向に時間を遡る時間は、その最大時間が予め設定されているものとする。また、パワー参照探索手段３３１は、予め定めた最大時間以内であっても、音声区間の開始時刻から予め定めた時間区間以上は遡らないこととする。これによって、音声区間には、少なくとも先頭から予め定めた時間以上の削除対象外の区間が設定されることになり、呼気段落における文頭部分が削除されることがない。なお、パワー参照探索手段３３１は、音声区間の時間長が予め定めた時間長より短い区間については、削除を行う対象音声区間と扱わないこととする。
このパワー参照探索手段３３１は、音声区間における削除区間の開始時刻（パワー参照削除開始時刻）を削除区間決定手段３３３に出力する。
なお、パワー参照探索手段３３１が探索する削除区間の例については、後でさらに具体的に説明する。 Here, it is assumed that the maximum time is set in advance for the time for which the power reference search means 331 goes back in the direction of the start time from the end time of the voice section. Further, the power reference search means 331 does not go back more than a predetermined time interval from the start time of the audio interval even within a predetermined maximum time. As a result, a section that is not subject to deletion is set at least for a predetermined time or more from the beginning of the voice section, and the sentence head portion in the expiratory paragraph is not deleted. Note that the power reference search unit 331 does not treat a section in which the time length of the speech section is shorter than a predetermined time length as a target speech section to be deleted.
The power reference search unit 331 outputs the deletion section start time (power reference deletion start time) in the voice section to the deletion section determination unit 333.
An example of the deletion section searched by the power reference search unit 331 will be described more specifically later.

ピッチ参照探索手段３３２は、対象非音声区間の直前の音声区間（対象音声区間）において、音響特徴量の１つであるピッチの変化が予め定めた変化量よりも少ない時間区間を、当該音声区間の終端から遡って探索するものである。 The pitch reference searching means 332 selects a time interval in which a change in pitch, which is one of acoustic feature values, is smaller than a predetermined change amount in the audio interval (target audio interval) immediately before the target non-audio interval. Search backward from the end of.

すなわち、ピッチ参照探索手段３３２は、対象非音声区間の開始時刻に対応する記憶手段２０に記憶されているピッチ値を基準値とし、対象非音声区間の直前の音声区間（対象音声区間）の終了時刻から当該音声区間の開始時刻の方向に時間を遡って、記憶手段２０に記憶されているピッチ値と基準値との変化量が予め定めた変化よりも少ない区間を求める。そして、ピッチ参照探索手段３３２は、その求めた区間の開始時刻を、ピッチ値により求めた削除区間の開始時刻（ピッチ参照削除開始時刻）とする。 That is, the pitch reference search unit 332 uses the pitch value stored in the storage unit 20 corresponding to the start time of the target non-speech segment as a reference value, and ends the speech segment (target speech segment) immediately before the target non-speech segment. The time is traced back from the time in the direction of the start time of the voice section, and a section in which the change amount between the pitch value and the reference value stored in the storage unit 20 is smaller than a predetermined change is obtained. Then, the pitch reference search unit 332 sets the start time of the obtained section as the start time of the deletion section (pitch reference deletion start time) obtained from the pitch value.

ここで、ピッチ参照探索手段３３２が音声区間の終了時刻から開始時刻の方向に時間を遡る時間は、その最大時間が予め設定されているものとする。また、ピッチ参照探索手段３３２は、パワー参照探索手段３３１と同様に、予め定めた最大時間以内であっても、音声区間の開始時刻から予め定めた時間区間以上は遡らないこととする。これによって、音声区間には、少なくとも先頭から予め定めた時間以上の削除対象外の区間が設定されることになり、呼気段落における文頭部分が削除されることがない。なお、ピッチ参照探索手段３３２は、音声区間の時間長が予め定めた時間長より短い区間については、削除を行う対象音声区間と扱わないこととする。
このピッチ参照探索手段３３２は、音声区間における削除区間の開始時刻（ピッチ参照削除開始時刻）を削除区間決定手段３３３に出力する。
なお、ピッチ参照探索手段３３２が探索する削除区間の例については、後でさらに具体的に説明する。 Here, it is assumed that the maximum time is set in advance for the time for which the pitch reference search means 332 goes back in the direction of the start time from the end time of the voice section. Similarly to the power reference search unit 331, the pitch reference search unit 332 does not go back more than a predetermined time interval from the start time of the voice interval even within a predetermined maximum time. As a result, a section that is not subject to deletion is set at least for a predetermined time or more from the beginning of the voice section, and the sentence head portion in the expiratory paragraph is not deleted. Note that the pitch reference search means 332 does not handle a section in which the time length of the speech section is shorter than a predetermined time length as a target speech section to be deleted.
This pitch reference searching means 332 outputs the deletion section start time (pitch reference deletion start time) in the voice section to the deletion section determination means 333.
An example of the deletion section searched by the pitch reference search unit 332 will be described in more detail later.

削除区間決定手段３３３は、パワー参照探索手段３３１で探索されたパワー参照削除開始時刻と、ピッチ参照探索手段３３２で探索されたピッチ参照削除開始時刻とに基づいて、対応する音声区間における削除区間（開始時刻）を決定するものである。なお、削除区間の終了時刻は、当該音声区間の終了時刻と同じである。 Based on the power reference deletion start time searched for by the power reference search means 331 and the pitch reference deletion start time searched for by the pitch reference search means 332, the deletion section determination means 333 deletes the corresponding voice section ( Start time). Note that the end time of the deletion section is the same as the end time of the voice section.

ここでは、削除区間決定手段３３３は、パワーとピッチとでいずれに重みをおくかを予め設定し、その重み（比率）に応じて削除区間を決定する。例えば、パワーの重みをｍ、ピッチの重みをｎとし、パワー参照削除開始時刻がｔ_ｐｗ、ピッチ参照削除開始時刻がｔ_ｐｉであった場合、削除区間決定手段３３３は、以下の（１）式によって、削除区間の開始時刻ｔ_ｄを算出する。 Here, the deletion section determination means 333 sets in advance which of the power and the pitch is to be weighted, and determines the deletion section according to the weight (ratio). For example, when the power weight is m, the pitch weight is n, the power reference deletion start time is t _pw , and the pitch reference deletion start time is t _pi , the deletion section determination unit 333 uses the following equation (1): by, to calculate the start time t _d of the deletion section.

この削除区間決定手段３３３は、決定した削除区間（開始時刻）を音声区間部分削除手段３４に出力する。
なお、ここでは、削除区間決定手段３３３は、パワーとピッチとの重み（比率）に応じて、削除区間を決定したが、パワー参照削除開始時刻またはピッチ参照削除開始時刻の早い方の時間区間を当該音声区間における削除区間としてもよいし、パワー参照削除開始時刻またはピッチ参照削除開始時刻の遅い方の時間区間を当該音声区間における削除区間としてもよい。 The deletion section determination unit 333 outputs the determined deletion section (start time) to the voice section partial deletion unit 34.
Here, the deletion section determining means 333 determines the deletion section according to the weight (ratio) between the power and the pitch, but the time section with the earlier power reference deletion start time or pitch reference deletion start time is selected. It is good also as a deletion area in the said audio | voice area, and it is good also considering the time area of the later one of a power reference deletion start time or a pitch reference deletion start time as a deletion area in the said audio | voice area.

この削除区間決定手段３３３において、パワーとピッチとにより、どのように削除区間を決定するかは、例えば、音声コンテンツの言語の種類によって予め定めておくこととしてもよい。例えば、各国の言語の特性に応じて、パワーの変化が少ない言語であれば、ピッチの重みを大きくすることで、ピッチを優先させた削除区間を設定することができる。あるいは、ピッチの変化が少ない言語であれば、パワーの重みを大きくすることで、パワーを優先させた削除区間を設定することができる。 For example, how the deletion section determining unit 333 determines the deletion section based on the power and the pitch may be determined in advance according to the language type of the audio content. For example, in a language with little change in power according to the characteristics of the language of each country, it is possible to set a deletion section in which the pitch is prioritized by increasing the weight of the pitch. Alternatively, in the case of a language with little change in pitch, it is possible to set a deletion section in which power is prioritized by increasing the power weight.

音声区間部分削除手段３４は、対象非音声区間の直前の音声区間から、音声削除区間探索手段３３で探索された削除区間を部分的に削除するものである。
すなわち、音声区間部分削除手段３４は、記憶手段２０に記憶されている区間情報の対応する音声区間の終了時刻を、音声削除区間探索手段３３で探索された削除区間の開始時刻の直前の時刻に設定し、新たな区間情報（削除区間情報）として記憶手段２０に書き込む。 The voice section partial deletion means 34 is for partially deleting the deletion section searched by the voice deletion section search means 33 from the voice section immediately before the target non-voice section.
That is, the voice segment partial deletion unit 34 sets the end time of the voice segment corresponding to the segment information stored in the storage unit 20 to the time immediately before the start time of the deletion segment searched by the voice deletion segment search unit 33. It is set and written in the storage means 20 as new section information (deleted section information).

ここで、図２，図３を参照（適宜図１参照）して、削除区間探索手段３０が音声コンテンツ内で削除する区間を探索する手法の具体例について説明する。 Here, referring to FIGS. 2 and 3 (refer to FIG. 1 as appropriate), a specific example of a method for searching for a section to be deleted in the audio content by the deletion section searching means 30 will be described.

〔非音声区間の削除区間〕
まず、図２を参照して、非音声区間において削除する区間について説明する。
図２に示すように、削除区間探索手段３０は、区間長が予め定めた時間長以上の非音声区間（無音区間を含む）を対象非音声区間Ｓｅｇ１とし、予め定めた最低限残す時間長ｌｅａｖｅＳ１だけを残して、残りの区間を削除区間とする。例えば、対象非音声区間Ｓｅｇ１の時間長は３００ｍｓ以上とし、最低限残す時間長ｌｅａｖｅＳ１は１００ｍｓとする。 [Deleted section of non-voice section]
First, a section to be deleted in the non-voice section will be described with reference to FIG.
As shown in FIG. 2, the deletion section searching means 30 sets a non-speech section (including a silent section) whose section length is equal to or longer than a predetermined time length as a target non-speech section Seg1, and leaves a predetermined minimum remaining time length leaveS1. The remaining section is set as the deletion section. For example, the time length of the target non-speech section Seg1 is set to 300 ms or more, and the minimum time length leaveS1 is set to 100 ms.

すなわち、非音声区間探索手段３１が、音声コンテンツから、区間長が３００ｍｓ以上の非音声区間を対象非音声区間Ｓｅｇ１として探索する。そして、非音声区間部分削除手段３２が、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓに最低限残す時間長ｌｅａｖｅＳ１を加算した時刻（ｔ_１ｓ＋ｌｅａｖｅＳ１）から、対象非音声区間Ｓｅｇ１の終了時刻ｔ_１ｅまでの区間を削除区間とする。
これによって、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓから、ｌｅａｖｅＳ１の時間長分の非音声区間が削除されずに残されることになる。 That is, the non-speech segment searching means 31 searches the non-speech segment having a segment length of 300 ms or more from the audio content as the target non-speech segment Seg1. Then, the non-speech section part deletion unit 32, from the time obtained by adding the time length LeaveS1 leave minimum start time _{t 1s} of the subject non-speech section Seg1 _(t 1s + leaveS1), until the end time _{t 1e} of the target non-speech segments Seg1 This section is the deletion section.
Thus, from the start time _{t 1s} of the subject non-speech section Seg1, so that the time length of the non-speech interval of leaveS1 is left without being removed.

〔音声区間の削除区間〕
次に、図３を参照して、音声区間において削除する区間について説明する。なお、部分的に削除する対象となる音声区間は、図２で説明した対象非音声区間Ｓｅｇ１の直前の音声区間（対象音声区間Ｓｅｇ２）である。この対象音声区間Ｓｅｇ２は、予め定めた最低限残す時間長ｌｅａｖｅＳ２よりも長い区間とする。 [Audio section deletion section]
Next, a section to be deleted in the voice section will be described with reference to FIG. Note that the speech segment to be partially deleted is the speech segment (target speech segment Seg2) immediately before the target non-speech segment Seg1 described in FIG. This target speech section Seg2 is a section longer than a predetermined minimum remaining time length leaveS2.

図３に示すように、削除区間探索手段３０は、対象非音声区間Ｓｅｇ１の直前であって、最低限残す時間長ｌｅａｖｅＳ２よりも長い音声区間である対象音声区間Ｓｅｇ２において、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓにおける音響特徴を基準に、終了時刻ｔ_２ｅから、当該音響特徴との変化量が予め定めた基準よりも多くなる探索最終時刻ｔ_ｓｔｏｐまで遡り、音響特徴の変化量が予め定めた基準より少ない区間を削除区間とする。このとき、探索最終時刻ｔ_ｓｔｏｐと終了時刻ｔ_２ｅとの時間長は、最大でも予め定めた最大削除時間長ｃｕｔＭａｘを超過しないこととし、削除区間探索手段３０は、対象音声区間Ｓｅｇ２の冒頭から最低限残す時間長ｌｅａｖｅＳ２の時間区間までは探索を行わないこととする。例えば、最大削除時間長ｃｕｔＭａｘは２５０ｍｓ、最低限残す時間長ｌｅａｖｅＳ２は５０ｍｓとする。 As shown in FIG. 3, the deletion section searching means 30 is configured to set the target non-voice section Seg1 in the target voice section Seg2 that is immediately before the target non-voice section Seg1 and is longer than the minimum remaining time length leaveS2. Based on the acoustic feature at the start time t _{1 s} , the change amount of the acoustic feature is determined in advance from the end time t _2e to the search final time t _stop where the amount of change with the acoustic feature is larger than a predetermined reference. A section that is smaller than the reference is set as a deletion section. At this time, it is assumed that the time length between the search final time t _stop and the end time t _2e does not exceed the predetermined maximum deletion time length cutMax at the maximum, and the deletion interval search means 30 starts from the beginning of the target speech interval Seg2. It is assumed that the search is not performed until the time interval of the remaining time length leaveS2. For example, the maximum deletion time length cutMax is 250 ms, and the minimum remaining time length leaveS2 is 50 ms.

すなわち、音声削除区間探索手段３３のパワー参照探索手段３３１およびピッチ参照探索手段３３２は、対象音声区間Ｓｅｇ２の終了時刻ｔ_２ｅから遡って削除区間を探索する時刻が、対象音声区間Ｓｅｇ２の開始時刻ｔ_２ｓから予め定めた時刻（ｔ_２ｓ＋ｌｅａｖｅＳ２）を越えず、かつ、探索する削除区間の時間長が予め定めた最大時間長（最大削除時間長ｃｕｔＭａｘ）を超えない範囲で、対象非音声区間Ｓｅｇ１の先頭の基準となる音響特徴に対して、変化が少ない区間を削除区間とする。 That is, the power reference search means 331 and pitch reference search means of the voice delete interval search unit 33 332 times to search the deletion section retroactively from the end time t _2e of the target speech section Seg2 is, the start time of the target speech section Seg2 t _The target non-speech section Seg1 is within a range that does not exceed a predetermined time (t _2s + leaveS2) from _2s , and that the time length of the deletion section to be searched does not exceed the predetermined maximum time length (maximum deletion time length cutMax). A section having a small change with respect to the acoustic feature serving as the head reference is set as a deletion section.

以下、対象音声区間Ｓｅｇ２において、探索を継続する条件、すなわち、音響特徴の変化が予め定めた基準より少ない条件（削除条件）について、例を挙げて説明する。 Hereinafter, conditions for continuing the search in the target speech section Seg2, that is, conditions (deletion conditions) in which changes in acoustic features are less than a predetermined reference will be described with examples.

（パワーを参照する場合）
まず、パワー参照探索手段３３１が、対象音声区間Ｓｅｇ２において、パワーを参照して、削除区間を探索する条件について説明する。なお、以下の条件を満たす場合であっても、最大削除時間長ｃｕｔＭａｘ、最低限残す時間長ｌｅａｖｅＳ２によって、探索時刻が制限を受けることは前記したとおりである。 (When referring to power)
First, conditions for the power reference searching unit 331 to search for a deletion section with reference to power in the target voice section Seg2 will be described. Even when the following conditions are satisfied, the search time is limited by the maximum deletion time length cutMax and the minimum time length leaveS2 as described above.

＜例１＞
例１として、パワー参照探索手段３３１は、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓにおけるパワー値（時刻ｔ_１ｓにおけるスムージング波形の瞬時値）をＰＷ_ＢＡＳＥ、探索時点におけるパワー値（探索時刻におけるスムージング波形の瞬時値）をＰＷ_ＮＯＷとしたとき、以下の（２）式の条件を満たす間、探索を続ける。 <Example 1>
As an example 1, the power reference search means 331 uses PW _BASE as the power value at the start time t _1s of the target non-speech segment Seg1 (instantaneous value of the smoothing waveform at time t _1s ), and the power value at the search time (smoothing waveform at the search time). instantaneous value) when the _{PW the NOW} of, among the following equation (2) satisfies the condition, continue searching.

ここで、ｔｈ１は、予め定めた閾値であって、例えば、１０（ｄＢ）とする。
パワー参照探索手段３３１は、この条件を満たさなくなった探索時刻を、探索最終時刻ｔ_ｓｔｏｐとする。
この例１によれば、音声区間の終端部分のパワーと、非音声区間の先頭のパワーとを比較して、その差が小さいことを条件に、削除する区間を特定する。 Here, th1 is a predetermined threshold value, for example, 10 (dB).
The power reference search unit 331 sets the search time when the condition is no longer satisfied as the search final time t _stop .
According to this example 1, the power of the end portion of the voice section is compared with the power of the head of the non-voice section, and the section to be deleted is specified on the condition that the difference is small.

＜例２＞
例２として、パワー参照探索手段３３１は、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓにおけるパワー値（時刻ｔ_１ｓにおけるスムージング波形の瞬時値）をＰＷ_ＢＡＳＥ、探索時点におけるパワー値（探索時刻におけるスムージング波形の瞬時値）をＰＷ_ＮＯＷ、対象音声区間Ｓｅｇ２におけるパワー値の最大値をＰＷ_ＭＡＸ、最小値をＰＷ_ＭＩＮとしたとき、以下の（３）式の条件を満たす間、探索を続ける。 <Example 2>
As an example 2, the power reference search means 331 uses PW _BASE as the power value at the start time t _1s of the target non-speech segment Seg1 (the instantaneous value of the smoothing waveform at time t _1s ), and the power value at the search time (smoothing waveform at the search time). Search is continued while satisfying the following expression (3), where PW _NOW is the instantaneous value), PW _{MAX is} the maximum power value in the target speech section Seg2, and PW _MIN is the minimum value.

ここで、ｔｈ２は、閾値を調整する予め定めた係数であって、例えば、０．１とする。
パワー参照探索手段３３１は、この条件を満たさなくなった探索時刻を、探索最終時刻ｔ_ｓｔｏｐとする。
この例２によれば、音声区間の終端部分のパワーが、非音声区間の先頭のパワーと比較して、その差が小さいことを条件とする点においては、例１と同じである。しかし、その差が発話者によって異なることに鑑み、音声区間内のパワーに応じて閾値を変化させることとした。これによって、音声コンテンツにおいて発話者が変化する場合であっても、適宜最適な削除区間を特定することができる。 Here, th2 is a predetermined coefficient for adjusting the threshold, and is set to 0.1, for example.
The power reference search unit 331 sets the search time when the condition is no longer satisfied as the search final time t _stop .
The second example is the same as the first example in that the power of the end portion of the voice section is required to have a small difference as compared with the first power of the non-voice section. However, considering that the difference varies depending on the speaker, the threshold value is changed according to the power in the speech section. As a result, even when the speaker changes in the audio content, it is possible to appropriately specify the optimum deletion section.

（ピッチを参照する場合）
次に、ピッチ参照探索手段３３２が、対象音声区間Ｓｅｇ２において、ピッチを参照して、削除区間を探索する条件について説明する。
ピッチ参照探索手段３３２は、対象非音声区間Ｓｅｇ１の開始時刻ｔ_１ｓにおけるピッチ値（時刻ｔ_１ｓにおけるピッチ波形のスムージング周波数）をＰＴ_ＢＡＳＥ、探索時点におけるピッチ値（探索時刻におけるピッチ波形のスムージング周波数）をＰＴ_ＮＯＷとしたとき、以下の（４）式の条件を満たす間、探索を続ける。 (When referring to the pitch)
Next, conditions for the pitch reference searching unit 332 to search for a deletion section with reference to the pitch in the target speech section Seg2 will be described.
The pitch reference search means 332 uses PT _BASE as the pitch value (smooth frequency of the pitch waveform at time t _1s ) at the start time t _1s of the target non-speech section Seg1, and the pitch value at the search time (smooth frequency of the pitch waveform at the search time). Is set to PT _NOW , the search is continued while satisfying the following expression (4).

ここで、ｔｈ３は、予め定めた閾値であって、例えば、２とする。
ピッチ参照探索手段３３２は、この条件を満たさなくなった探索時刻を、探索最終時刻ｔ_ｓｔｏｐとする。
この例によれば、音声区間の終端部分のピッチが、非音声区間の先頭のピッチの所定倍数よりも小さいことを条件に、削除する区間を特定する。
なお、この条件を満たす場合であっても、最大削除時間長ｃｕｔＭａｘ、最低限残す時間長ｌｅａｖｅＳ２によって、探索時刻が制限を受けることは前記したとおりである。
図１に戻って、話速変換装置１の構成について説明を続ける。 Here, th3 is a predetermined threshold value, for example, 2.
The pitch reference search means 332 sets the search time at which this condition is no longer satisfied as the search final time t _stop .
According to this example, the section to be deleted is specified on the condition that the pitch of the end portion of the voice section is smaller than a predetermined multiple of the leading pitch of the non-voice section.
Even when this condition is satisfied, the search time is limited by the maximum deletion time length cutMax and the minimum time length leaveS2 as described above.
Returning to FIG. 1, the description of the configuration of the speech speed conversion device 1 will be continued.

出力時間長調整手段４０は、削除区間探索手段３０によって探索された、音声区間および非音声区間の削除する区間（削除区間情報）に基づいて、音声コンテンツを部分的に削除するとともに、指定された再生速度となるように、音声コンテンツの出力時間長を調整するものである。ここでは、出力時間長調整手段４０は、伸縮率算出手段４１と、出力時間長変更手段４２と、を備えている。 The output time length adjusting unit 40 partially deletes the audio content based on the section (deletion section information) to be deleted from the audio section and the non-speech section searched by the deletion section searching unit 30 and specified. The output time length of the audio content is adjusted so as to achieve the playback speed. Here, the output time length adjusting means 40 includes an expansion / contraction rate calculating means 41 and an output time length changing means 42.

伸縮率算出手段４１は、削除区間探索手段３０によって探索された削除区間を削除した音声コンテンツの再生時間長が、指定された再生速度（再生倍率）で削除前の音声コンテンツを再生した時間長と同じになるように、音声区間の伸縮率を算出するものである。なお、非音声区間については、伸縮率を“１”として、伸縮を行わないこととする。 The expansion / contraction rate calculating means 41 is configured such that the reproduction time length of the audio content deleted from the deletion section searched by the deletion section searching means 30 is the time length of reproducing the audio content before deletion at the specified reproduction speed (reproduction magnification). The expansion / contraction rate of the voice section is calculated so as to be the same. For the non-speech section, the expansion / contraction rate is set to “1”, and the expansion / contraction is not performed.

具体的には、伸縮率算出手段４１は、削除前の音声コンテンツにおける音声区間の総時間長をＰ_Ｏ、非音声区間の総時間長をＱ_Ｏ、指定された再生速度（再生倍率）をＲ_Ｏ、削除区間削除後の音声コンテンツにおける音声区間の総時間長をＰ_Ｄ、非音声区間の総時間長をＱ_Ｄ、としたとき、以下の（５）式によって、音声区間の伸縮率Ｒ_Ｄを算出する。 Specifically, the expansion / contraction rate calculating means 41 sets the total time length of the audio section in the audio content before deletion as P _O , the total time length of the non-audio section as Q _O , and the designated playback speed (playback magnification) as R _O , where the total time length of the audio section in the audio content after deletion of the deletion section is P _D and the total time length of the non-audio section is Q _D , the expansion / contraction rate R _{D of the} audio section is expressed by the following equation (5). Is calculated.

この伸縮率算出手段４１は、算出した音声区間の伸縮率を、出力時間長変更手段４２に出力する。
なお、音声コンテンツを高速再生する場合、基本的には、伸縮率算出手段４１は、音声区間を短くする方向に伸縮率を算出することになる。しかし、再生倍率が小さく、また、音声区間内の削除区間が長い場合、伸縮率算出手段４１は、残った音声区間を伸ばす方向に伸縮率を算出する場合もあり得る。 The expansion / contraction rate calculating unit 41 outputs the calculated expansion / contraction rate of the voice section to the output time length changing unit 42.
When audio content is played back at high speed, basically, the expansion / contraction rate calculating means 41 calculates the expansion / contraction rate in the direction of shortening the audio section. However, when the reproduction magnification is small and the deletion interval in the audio section is long, the expansion / contraction rate calculating means 41 may calculate the expansion / contraction rate in the direction of extending the remaining audio section.

出力時間長変更手段４２は、伸縮率算出手段４１で算出された伸縮率に基づいて、削除区間探索手段３０で探索された削除区間を削除した音声コンテンツの出力時間長を変更するものである。すなわち、出力時間長変更手段４２は、記憶手段２０に記憶されている区間情報および削除区間情報に基づいて、音声コンテンツの音声データを区間（音声区間、非音声区間）ごとに読み出し、出力時間長を調整する。 The output time length changing means 42 is for changing the output time length of the audio content from which the deletion section searched by the deletion section searching means 30 is deleted based on the expansion ratio calculated by the expansion ratio calculation means 41. That is, the output time length changing unit 42 reads out the audio data of the audio content for each section (audio section, non-speech section) based on the section information and the deletion section information stored in the storage unit 20, and outputs the output time length. Adjust.

ここで、出力時間長変更手段４２は、音声区間については、削除区間情報によって削除区間が定められている場合、当該削除区間を削除した音声データを記憶手段２０から読み出して、伸縮率算出手段４１で算出された伸縮率で時間長を変更する。
また、出力時間長変更手段４２は、削除区間が定められていない音声区間については、そのまま音声区間分の音声データを記憶手段２０から読み出して、伸縮率算出手段４１で算出された伸縮率で時間長を変更する。 Here, when the deletion section is determined by the deletion section information for the voice section, the output time length changing means 42 reads out the voice data from which the deletion section has been deleted from the storage means 20, and the expansion / contraction rate calculation means 41. The time length is changed with the expansion / contraction rate calculated in.
Further, the output time length changing means 42 reads the voice data for the voice section as it is from the storage means 20 for the voice section for which the deletion section is not determined, and uses the expansion ratio calculated by the expansion ratio calculation means 41 for the time. Change the length.

ここで、伸縮率に応じて音声データを伸縮させるには、ピッチの周期に応じて音声波形の間引き／繰り返しを行い、音声波形どうしを伸縮率に応じた時間長で重ね合わせて接続すればよい。このような音声データの伸縮には、一般的な話速変換手法を用いればよく、例えば、特許第３３２７９３６号、特許第２９５５２４７等の技術を用いることができる。 Here, in order to expand / contract the audio data according to the expansion / contraction rate, the audio waveform may be thinned / repeated according to the pitch period, and the audio waveforms may be overlapped and connected with a time length corresponding to the expansion / contraction rate. . For such expansion and contraction of voice data, a general speech speed conversion method may be used. For example, techniques such as Japanese Patent No. 3327936 and Japanese Patent No. 2955247 can be used.

また、出力時間長変更手段４２は、非音声区間については、削除区間情報によって削除区間が定められている場合、当該削除区間を削除した音声データ（非音声データ）を記憶手段２０から読み出し、伸縮を行わずにそのまま出力する。
また、出力時間長変更手段４２は、削除区間が定められていない非音声区間については、非音声区間分の音声データ（非音声データ）を記憶手段２０から読み出し、伸縮を行わずにそのまま出力する。 The output time length changing means 42 reads out the voice data (non-speech data) from which the deletion section is deleted from the storage means 20 when the deletion section is determined by the deletion section information for the non-speech section. It outputs as it is without performing.
Further, the output time length changing means 42 reads the voice data (non-voice data) for the non-voice section from the storage means 20 for the non-voice section for which the deletion section is not defined, and outputs it as it is without expansion / contraction. .

このように、話速変換装置１は、非音声区間のみならず、音声区間においても削除区間を設けて出力時間長を調整することで、従来の話速変換と同じ再生速度であっても、部分的に再生する音声に対する時間長を、従来よりも長く割り当てることができ、再生時において、音声を聞き取り易くすることができる。 In this way, the speech speed conversion apparatus 1 adjusts the output time length by providing a deletion section not only in the non-speech section but also in the speech section, so that even if the playback speed is the same as the conventional speech speed conversion, The time length for the partially reproduced sound can be assigned longer than before, and the sound can be easily heard during the reproduction.

ここで、図４を参照（適宜図１参照）して、話速変換装置１における音声コンテンツの話速変換処理を模式的に説明する。
図４（ａ）は、話速変換前の音声コンテンツのデータを示し、音声区間と非音声区間とが含まれた状態を示している。なお、非音声区間は、所定時間長以上の削除の対象となる対象非音声区間とする。また、音声区間には、非音声区間の前に音響特徴量の変化が少ない区間Ｂが含まれていることとする。
すなわち、話速変換装置１は、音響分析手段１０によって、音声コンテンツを音響分析することで、音声区間や非音声区間（無音区間を含む）の区間情報や、区間Ａ，Ｂを特定ためのパワーやピッチ等の音響特徴情報を生成する。 Here, referring to FIG. 4 (refer to FIG. 1 as appropriate), the speech speed conversion processing of the audio content in the speech speed conversion apparatus 1 will be schematically described.
FIG. 4A shows audio content data before speech speed conversion, and shows a state in which an audio section and a non-voice section are included. Note that the non-speech segment is a target non-speech segment to be deleted for a predetermined time length or longer. Further, it is assumed that the speech section includes a section B in which the change in the acoustic feature amount is small before the non-speech section.
That is, the speech speed converting apparatus 1 performs acoustic analysis on the audio content by the acoustic analysis unit 10, so that the section information of the voice section and the non-voice section (including the silent section) and the power for specifying the sections A and B are identified. And acoustic feature information such as pitch.

図４（ｂ）は、図４（ａ）の音声コンテンツにおいて、削除区間を設定した状態を示している。すなわち、話速変換装置１は、図４（ｂ）に示すように、削除区間探索手段３０によって、図４（ａ）で示した音響特徴量の変化が少ない区間Ｂを削除区間Ｄ１として特定するとともに、非音声区間についても所定時間長以上の区間を削除区間Ｄ２として特定する。そして、話速変換装置１は、音声区間において、区間Ａのみを再生対象とする。 FIG. 4B shows a state in which a deletion section is set in the audio content of FIG. That is, as shown in FIG. 4B, the speech speed conversion apparatus 1 uses the deletion section search means 30 to identify the section B with a small change in the acoustic feature amount shown in FIG. 4A as the deletion section D1. At the same time, for the non-speech section, a section having a predetermined time length or more is specified as the deletion section D2. Then, the speech speed conversion apparatus 1 sets only the section A as a reproduction target in the voice section.

図４（ｃ）は、話速変換装置１が、図４（ａ）の音声コンテンツを話速変換した後のデータを示している。ここでは、一例として再生速度を３倍としている。
すなわち、話速変換装置１は、出力時間長調整手段４０によって、音声コンテンツから削除区間Ｄ１，Ｄ２を削除して、総時間長が、図４（ａ）の音声コンテンツに対して３倍速となるように、音声区間の出力時間長を調整する。ここでは、図４（ａ）の音声コンテンツの音声区間のうちで区間Ａのみが、（ｃ）の区間Ａ１に変換されたことを示している。 FIG. 4C shows data after the speech speed converting apparatus 1 converts the speech content of FIG. 4A to the speech speed. Here, the reproduction speed is tripled as an example.
That is, the speech speed conversion apparatus 1 deletes the deletion sections D1 and D2 from the audio content by the output time length adjusting unit 40, so that the total time length becomes 3 times the audio content of FIG. As described above, the output time length of the voice section is adjusted. Here, it is shown that only the section A in the audio section of the audio content in FIG. 4A is converted into the section A1 in FIG.

図４（ｄ）は、従来の話速変換によって、図４（ａ）の音声コンテンツを話速変換した後のデータを示している。なお、従来手法においても、非音声区間から部分的にデータ（Ｄ２）を削除しているものとする。この従来手法では、図４（ａ）の音声コンテンツの音声区間の区間Ａと区間Ｂとが、（ｄ）の区間Ａ２と区間Ｂ２とにそれぞれ変換されたことを示している。すなわち、この従来手法では、話速変換装置１が音声区間から削除する音響特徴量の変化が少ない区間Ｂに対しても話速変換を行っている。 FIG. 4D shows data after the speech speed of the audio content in FIG. 4A is converted by the conventional speech speed conversion. In the conventional method, it is assumed that the data (D2) is partially deleted from the non-voice section. This conventional method shows that the sections A and B of the audio section of the audio content in FIG. 4A are converted into the sections A2 and B2 in FIG. 4D, respectively. That is, in this conventional method, the speech speed conversion is also performed for the section B with a small change in the acoustic feature amount that the speech speed conversion apparatus 1 deletes from the speech section.

この図４（ｃ）と図４（ｄ）を比較しても分かるように、同じ再生速度であっても、図４（ａ）に示した音声区間のうち、部分的に再生したい区間Ａの音声データの時間長が、（ｃ）では区間Ａ１、（ｄ）では区間Ａ２の時間長となり、図４（ｃ）の方が長い時間長を確保することができる。このように、話速変換装置１によって話速変換された音声は、従来手法によって話速変換された音声よりもゆっくり再生されることになり、聞き取り易くなる。
以上、話速変換装置１の構成について説明したが、この話速変換装置１は、一般的なコンピュータを前記した各手段として機能させるプログラム（話速変換プログラム）により動作させることができる。また、このプログラムは、コンピュータで読み取り可能なＣＤ−ＲＯＭ等の記録媒体に記録して配布することもできる。 As can be seen by comparing FIG. 4 (c) and FIG. 4 (d), even in the same playback speed, among the audio sections shown in FIG. The time length of the audio data is the time length of the section A1 in (c) and the time length of the section A2 in (d), and a longer time length can be secured in FIG. 4 (c). As described above, the speech converted by the speech speed conversion device 1 is reproduced more slowly than the speech converted by the conventional method, and is easy to hear.
Although the configuration of the speech speed conversion device 1 has been described above, the speech speed conversion device 1 can be operated by a program (speech speed conversion program) that causes a general computer to function as each of the means described above. The program can also be recorded and distributed on a computer-readable recording medium such as a CD-ROM.

以上説明したように、話速変換装置１は、音響特徴の変化が少ない音声を削除して、高速再生が可能な音声コンテンツを生成することができる。これによって、話速変換装置１は、従来では３倍速が限界であった再生速度をさらに早めることができ、人が目視で文章を斜め読みするのと同様に、音声コンテンツを聞くことができる。
また、話速変換装置１は、従来と同じ再生速度で再生する場合であっても、再生音声に割り当てる時間が相対的に長いため、従来よりも聞き取り易い音声コンテンツに変換することができる。 As described above, the speech speed conversion apparatus 1 can generate voice content that can be reproduced at high speed by deleting voice with little change in acoustic characteristics. As a result, the speech speed converting apparatus 1 can further increase the playback speed, which is conventionally limited to 3 × speed, and can listen to the audio content in the same way that a person visually reads a sentence obliquely.
Further, even when the speech speed converting apparatus 1 plays back at the same playback speed as before, since the time allocated to the playback voice is relatively long, it can be converted into audio content that is easier to hear than before.

［話速変換装置の動作］
次に、図５を参照（構成については適宜図１参照）して、話速変換装置１の動作について説明する。
まず、話速変換装置１は、音響分析手段１０によって、入力された音声コンテンツについて、音響分析を行う（ステップＳ１）。すなわち、話速変換装置１は、音響分析手段１０のパワー抽出手段１１によって、音響特徴量の一つであるパワー（音の強さ、大きさ）を抽出し、ピッチ抽出手段１２によって、ピッチ（音の高さ）を抽出する。さらに、話速変換装置１は、音響分析手段１０の音声区間検出手段１３によって、音声コンテンツから、音声を含んだ音声区間や、音声を含まない非音声区間（無音区間を含む）を検出する。これらの音響特徴量や区間情報は、記憶手段２０に記憶される。また、入力された音声コンテンツも記憶手段２０に記憶される。 [Operation of speech speed converter]
Next, referring to FIG. 5 (refer to FIG. 1 as appropriate for the configuration), the operation of the speech speed converting apparatus 1 will be described.
First, the speech speed conversion apparatus 1 performs acoustic analysis on the input audio content by the acoustic analysis means 10 (step S1). That is, the speech speed converting apparatus 1 extracts power (sound intensity and magnitude), which is one of acoustic features, by the power extraction unit 11 of the acoustic analysis unit 10, and the pitch ( (Pitch) is extracted. Furthermore, the speech speed converting apparatus 1 detects a voice section including voice and a non-voice section including no voice (including a silent section) from the voice content by the voice section detecting unit 13 of the acoustic analysis unit 10. These acoustic feature amounts and section information are stored in the storage unit 20. The input audio content is also stored in the storage unit 20.

そして、話速変換装置１は、削除区間探索手段３０の非音声区間探索手段３１によって、記憶手段２０に記憶されている区間情報を参照して、予め定めた時間長以上の非音声区間（対象非音声区間）を探索する（ステップＳ２）。そして、話速変換装置１は、削除区間探索手段３０の非音声区間部分削除手段３２によって、対象非音声区間において、削除条件として予め定められた最低限残す時間長分の区間を残して区間を部分的に削除し、新たな区間情報（削除区間情報）として記憶手段２０に書き込む（ステップＳ３）。 Then, the speech speed conversion apparatus 1 refers to the section information stored in the storage unit 20 by the non-speech section search unit 31 of the deletion section search unit 30 and refers to the non-speech section (target A non-voice section) is searched (step S2). Then, the speech speed conversion apparatus 1 uses the non-speech section partial deletion unit 32 of the deletion section search unit 30 to leave a section in the target non-speech section, leaving a section for a minimum length of time that is preset as a deletion condition. It deletes partially, and it writes in the memory | storage means 20 as new area information (deletion area information) (step S3).

そして、話速変換装置１は、ステップＳ２で探索された対象非音声区間の直前の音声区間において、音声削除区間探索手段３３によって、記憶手段２０に記憶されている音響特徴情報を参照して、当該音声区間の終端から、音響特徴の変化が少ない削除対象となる区間（削除区間）を探索する（ステップＳ４）。 Then, the speech speed conversion apparatus 1 refers to the acoustic feature information stored in the storage unit 20 by the voice deletion section search unit 33 in the voice section immediately before the target non-speech section searched in step S2. From the end of the voice section, a section (deletion section) to be deleted with a small change in acoustic features is searched (step S4).

すなわち、話速変換装置１は、音声削除区間探索手段３３のパワー参照探索手段３３１によって、対象非音声区間の直前の音声区間（対象音声区間）において、パワーの変化が予め定めた変化量よりも少ない削除区間を、当該音声区間の終端から遡って探索する。また、話速変換装置１は、音声削除区間探索手段３３のピッチ参照探索手段３３２によって、同じ音声区間において、ピッチの変化が予め定めた変化量よりも少ない削除区間を、当該音声区間の終端から遡って探索する。そして、話速変換装置１は、音声削除区間探索手段３３の削除区間決定手段３３３によって、パワーおよびピッチに基づいてそれぞれで独立して探索した削除区間から、予め定めたパワーとピッチとの重みに基づいて削除区間を決定する。 That is, the speech speed conversion apparatus 1 uses the power reference search means 331 of the voice deletion section search means 33 to change the power change in a voice section (target voice section) immediately before the target non-voice section from a predetermined amount of change. Search for a few deleted sections retroactively from the end of the speech section. Also, the speech speed conversion apparatus 1 uses the pitch reference search means 332 of the voice deletion section search means 33 to delete a deletion section in which the change in pitch is smaller than a predetermined change amount from the end of the voice section in the same voice section. Explore retrospectively. Then, the speech speed conversion apparatus 1 uses the deletion interval determination means 333 of the voice deletion interval search means 33 to obtain a predetermined weight of power and pitch from the deletion intervals searched independently based on the power and the pitch. Based on this, the deletion interval is determined.

そして、話速変換装置１は、音声区間部分削除手段３４によって、ステップＳ４で探索された削除区間を音声区間から部分的に削除し、新たな区間情報（削除区間情報）として記憶手段２０に書き込む（ステップＳ５）。
そして、話速変換装置１は、区間情報において、対象非音声区間をすべて探索していない場合（ステップＳ６でＮｏ）、ステップＳ２に戻って、次の対象非音声区間を探索する。 Then, the speech speed conversion apparatus 1 partially deletes the deletion section searched in step S4 from the voice section by the voice section partial deletion unit 34 and writes it to the storage unit 20 as new section information (deleted section information). (Step S5).
If the speech information conversion apparatus 1 has not searched for all target non-speech sections in the section information (No in step S6), the speech speed conversion apparatus 1 returns to step S2 and searches for the next target non-speech section.

一方、対象非音声区間をすべて探索し終わった場合（ステップＳ６でＹｅｓ）、話速変換装置１は、出力時間長調整手段４０によって、記憶手段２０に記憶されている新たな区間情報（削除区間情報）に基づいて、音声コンテンツを部分的に削除するとともに、指定された再生速度となるように、音声コンテンツの出力時間長を調整する。 On the other hand, when all the target non-speech sections have been searched (Yes in step S6), the speech speed conversion apparatus 1 uses the output time length adjustment unit 40 to create new section information (deletion section) stored in the storage unit 20. On the basis of the information), the audio content is partially deleted, and the output time length of the audio content is adjusted so that the designated reproduction speed is obtained.

すなわち、話速変換装置１は、出力時間長調整手段４０の伸縮率算出手段４１によって、削除区間を削除した音声コンテンツの再生時間長が、指定された再生速度（再生倍率）で削除前の音声コンテンツを再生した時間長と同じになるように、音声区間の伸縮率を算出する（ステップＳ７）。そして、話速変換装置１は、出力時間長調整手段４０の出力時間長変更手段４２によって、記憶手段２０に記憶されている区間情報および削除区間情報に基づいて、音声コンテンツの音声データを区間（音声区間、非音声区間）ごとに読み出し、伸縮率に基づいて出力時間長を調整する（ステップＳ８）。
以上の動作によって、話速変換装置１は、音声区間においても音声データを削除することで、高速に再生可能な音声コンテンツを出力することができる。 In other words, the speech speed conversion apparatus 1 uses the expansion rate calculation means 41 of the output time length adjustment means 40 so that the playback time length of the audio content from which the deletion section is deleted is the audio before deletion at the specified playback speed (playback magnification) The expansion / contraction rate of the audio section is calculated so as to be the same as the time length for reproducing the content (step S7). Then, the speech speed converting apparatus 1 converts the audio data of the audio content into the segment (based on the segment information and the deletion segment information stored in the storage unit 20 by the output time length changing unit 42 of the output time length adjusting unit 40. Reading is performed for each voice section and non-voice section), and the output time length is adjusted based on the expansion / contraction rate (step S8).
With the above operation, the speech speed conversion apparatus 1 can output audio content that can be reproduced at high speed by deleting the audio data even in the audio section.

以上、本発明の実施形態について説明したが、本発明は、この実施形態に限定されるものではない。
例えば、ここでは、話速変換装置１が音響分析手段１０を備える構成としたが、予め外部の分析装置において、音声コンテンツに対応したデータ（音響特徴情報、区間情報）が分析されているのであれば、そのデータのみを入力し、記憶手段２０に記憶する形態でも構わない。 Although the embodiment of the present invention has been described above, the present invention is not limited to this embodiment.
For example, here, the speech speed conversion device 1 is configured to include the acoustic analysis means 10, but data (acoustic feature information, section information) corresponding to the audio content is analyzed in advance by an external analysis device. For example, only the data may be input and stored in the storage unit 20.

また、本発明は、音声区間を部分的に削除することに特徴があり、非音声区間部分削除手段３２は必須の構成ではない。ただし、非音声区間部分削除手段３２を備えることで、削除した非音声区間に、音声の再生時間を割り当てることができるため、当該手段を備えることはより好ましい形態であるといえる。 Further, the present invention is characterized in that the speech section is partially deleted, and the non-speech section partial deletion means 32 is not an essential configuration. However, by providing the non-speech segment part deletion means 32, it is possible to assign a playback time of the voice to the deleted non-speech segment, so that it can be said that it is a more preferable mode.

また、ここでは、話速変換装置１において、音響分析手段１０のパワー抽出手段１１やピッチ抽出手段１２が、それぞれの音響特徴をスムージングして、記憶手段２０に書き込むこととした。しかし、パワー抽出手段１１やピッチ抽出手段１２は、抽出した時刻における音響特徴をそのまま書き込むこととしてもよい。
その場合、音声削除区間探索手段３３は、記憶手段２０に記憶されている音響特徴を順次スムージングし、時刻ごとの瞬時値を用いて削除区間を探索することとすればよい。 Here, in the speech rate conversion apparatus 1, the power extraction unit 11 and the pitch extraction unit 12 of the acoustic analysis unit 10 smooth the respective acoustic features and write them into the storage unit 20. However, the power extraction unit 11 and the pitch extraction unit 12 may write the acoustic features at the extracted time as they are.
In that case, the voice deletion section searching means 33 may sequentially smooth the acoustic features stored in the storage means 20 and search for the deletion section using the instantaneous value for each time.

また、ここでは、話速変換装置１が、音響特徴としてパワーおよびピッチの両方を用いることとしたが、いずれか一方であっても構わない。
例えば、図６の話速変換装置１Ｂの構成として示すように、話速変換装置１（図１）の構成から、パワー抽出手段１１、パワー参照探索手段３３１および削除区間決定手段３３３を省略して構成することで、音響特徴としてピッチのみを用いて話速変換を行うこととしてもよい。 In addition, here, the speech speed conversion apparatus 1 uses both power and pitch as acoustic features, but either one may be used.
For example, as shown in the configuration of the speech speed conversion apparatus 1B in FIG. 6, the power extraction means 11, the power reference search means 331, and the deletion interval determination means 333 are omitted from the configuration of the speech speed conversion apparatus 1 (FIG. 1). By configuring, speech speed conversion may be performed using only the pitch as an acoustic feature.

また、例えば、図７の話速変換装置１Ｃの構成として示すように、話速変換装置１（図１）の構成から、ピッチ抽出手段１２、ピッチ参照探索手段３３２および削除区間決定手段３３３を省略して構成することで、音響特徴としてパワーのみを用いて話速変換を行うこととしてもよい。 Further, for example, as shown in the configuration of the speech speed conversion apparatus 1C in FIG. 7, the pitch extraction means 12, the pitch reference search means 332, and the deletion interval determination means 333 are omitted from the configuration of the speech speed conversion apparatus 1 (FIG. 1). Thus, the speech speed conversion may be performed using only power as an acoustic feature.

１話速変換装置
１０音響分析主手段
１１パワー抽出手段（音響特徴量抽出手段）
１２ピッチ抽出手段（音響特徴量抽出手段）
１３音声区間検出手段（区間情報検出手段）
２０記憶手段
（音声コンテンツ記憶手段、音響特徴量記憶手段、区間情報記憶手段）
３０削除区間探索手段
３１非音声区間探索手段
３２非音声区間部分削除手段
３３音声削除区間探索手段
３３１パワー参照探索手段
３３２ピッチ参照探索手段
３３３削除区間決定手段
３４音声区間部分削除手段
４０出力時間長調整手段
４１伸縮率算出手段
４２出力時間長変更手段 DESCRIPTION OF SYMBOLS 1 Speech speed converter 10 Acoustic analysis main means 11 Power extraction means (acoustic feature amount extraction means)
12 Pitch extraction means (acoustic feature quantity extraction means)
13 Voice section detection means (section information detection means)
20 storage means
(Audio content storage means, acoustic feature quantity storage means, section information storage means)
30 Deletion Interval Search Unit 31 Non-Voice Interval Search Unit 32 Non-Voice Interval Partial Deletion Unit 33 Voice Deletion Interval Search Unit 331 Power Reference Search Unit 332 Pitch Reference Search Unit 333 Deletion Interval Determination Unit 34 Voice Segment Partial Deletion Unit 40 Output Time Length Adjustment Means 41 Expansion / contraction rate calculation means 42 Output time length changing means

Claims

A speech speed conversion device that partially deletes audio content and plays back at a specified playback magnification,
Audio content storage means for storing the audio content in advance;
The audio content time per voice pitch power indicating the size of the pitch and voice indicating the or pitch or power as acoustic features, the acoustic feature storage unit that stores in advance in association with the time,
Section information storage means for storing in advance the voice section and the non-voice section of the voice content in association with the time of the voice content;
Wherein immediately before the speech section of the non-speech section, going back from the end time of the speech section, the fewer sections than the reference change is a predetermined acoustic feature quantity, deletion interval search that searches a deletion section of said audio content Means,
The deletion section so that the output time length of the audio content excluding the plurality of deletion sections searched by the deletion section search means becomes the output time length of the designated reproduction magnification with respect to the time length of the original audio content. Output time length adjusting means for expanding and outputting audio content excluding
A speech speed conversion device comprising:

An acoustic feature extraction means for writing from the audio content the extracted acoustic features, in association with a time in the acoustic feature value memory,
Section information detection means for detecting a voice section and a non-voice section in the audio content, and writing the voice section and the non-voice section in the section information storage means in association with time;
The speech speed conversion apparatus according to claim 1, further comprising:

When the acoustic feature amount is a pitch indicating the pitch of the voice and a power indicating the volume of the voice,
The deletion section search means includes
Pitch reference searching means for searching for a section having a change in pitch smaller than a predetermined criterion retroactively from the end time of the voice section as the deletion section;
Power reference search means for searching, as the deletion section, a section whose power change is less than a predetermined criterion, retroactively from the end time of the voice section;
Deletion for determining a deletion section in the voice section from a deletion section searched by the pitch reference search means and a deletion section searched by the power reference search means according to a predetermined ratio of pitch and power weight Section determination means ,
When the acoustic feature amount is a pitch,
The deletion section search means includes
Pitch reference search means for searching as a deletion section a section having a change in pitch that is smaller than a predetermined reference retroactively from the end time of the voice section,
When the acoustic feature is power,
The deletion section search means includes
Power reference search means for searching, as the deletion section, a section whose power change is less than a predetermined criterion, retroactively from the end time of the voice section,
The speech speed converting apparatus according to claim 1 or 2, further comprising:

The deletion section searching means is such that the time for searching for the deletion section retroactively from the end time of the voice section does not exceed a predetermined time from the start time of the voice section, and the time length of the deletion section to be searched is The speech speed converting apparatus according to claim 1 or 2, wherein a range not exceeding a predetermined maximum time length is set.

The deletion section search means is characterized in that, in a non-speech section having a predetermined time length or longer, another section that has a predetermined time length shorter than the time length is further set as a deletion section of the audio content. The speech speed conversion device according to claim 1 or 2.

A speech speed conversion program for causing a computer to function as the speech speed conversion device according to any one of claims 1 to 5.