JP2008058956A

JP2008058956A - Speech reproduction device

Info

Publication number: JP2008058956A
Application number: JP2007195708A
Authority: JP
Inventors: Meiko Maeda; 芽衣子前田; Masayuki Misaki; 正之三崎; Takeshi Kawamura; 岳河村
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2006-07-31
Filing date: 2007-07-27
Publication date: 2008-03-13
Anticipated expiration: 2027-07-27
Also published as: JP4965371B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a speech reproduction device capable of performing appropriate speed change in compliance with an inputted audio signal while achieving the target time. <P>SOLUTION: The speech reproduction device is provided with; a discrimination means which discriminates a speech section including a speech and a non-speech section not including the speech with respect to an audio signal; a speech content calculation means which calculates the speech content indicating the ratio of the speech section included in the audio signal based on the result discriminated by the discrimination means; a speed ratio calculation means which calculates respectively the speed ratio of the speech section and the non-speech section with respect to the reproduction speed preset in the audio signal based on the speech content in such a manner that the reproduction time of the audio signal becomes a prescribed reproduction time; and a speed change means which is inputted with the audio signal and respectively changes the reproduction speeds of the speech section and non-speech section included in the audio signal based on the speed ratio. <P>COPYRIGHT: (C)2008,JPO&INPIT

Description

本発明は、音声再生装置に関し、より特定的には、オーディオ信号の再生速度を変えて再生する音声再生装置に関する。 The present invention relates to an audio reproduction device, and more particularly to an audio reproduction device that reproduces an audio signal by changing the reproduction speed.

従来、オーディオ信号の再生速度を変えて再生する音声再生装置として、音声を含む音声区間の再生速度と、音声を含まない非音声区間の再生速度とを別々に変える音声再生装置が提案されている（例えば、特許文献１参照）。以下、図３３を参照して従来の音声再生装置について説明する。図３３は、従来の音声再生装置の構成を示したブロック図である。 2. Description of the Related Art Conventionally, as an audio reproduction device that reproduces audio signals by changing the reproduction speed, an audio reproduction device that separately changes the reproduction speed of an audio section that includes sound and the reproduction speed of a non-audio section that does not include sound has been proposed. (For example, refer to Patent Document 1). Hereinafter, a conventional audio reproducing apparatus will be described with reference to FIG. FIG. 33 is a block diagram showing a configuration of a conventional audio reproducing apparatus.

図３３に示す従来の音声再生装置において、ユーザがオーディオ信号全体の再生時間に対して目標時間を設定する。この目標時間は、オーディオ信号全体を等倍の再生速度で再生したときの再生時間よりも短い時間とする。音響分析部９１は、入力されるオーディオ信号を音声区間及び非音声区間に分離する。速度変換部９２は、一定時間長以上の非音声区間に挟まれた音声区間のオーディオ信号に対して、その冒頭部分が所定の再生速度よりも遅くなり、かつ末尾に向けて次第に所定の再生速度に戻るように速度変換を行っている。ここで、話速変換部９２における上記速度変換処理によって、音声区間の再生時間が長くなり、結果的にオーディオ信号全体の再生時間が目標時間に対して遅延してしまうという問題があった。そこで、非音声区間長制御部９３は、速度変換部９２から出力される遅延時間情報を参照して、非音声区間に対して当該遅延時間を短くするための処理を行う。具体的には、非音声区間長制御部９３は、非音声区間を削除したり、圧縮したりする処理を行って、遅延時間を短くしている。速度変換部９２で速度変換された音声区間のオーディオ信号と、非音声区間長制御部９３で処理された非音声区間のオーディオ信号は、合成部９４で合成され、合成部９４から出力される。
特開２００１−２２２３００号公報（第１−２頁、図１） In the conventional audio reproduction device shown in FIG. 33, the user sets a target time for the reproduction time of the entire audio signal. This target time is set to be shorter than the playback time when the entire audio signal is played at the same playback speed. The acoustic analysis unit 91 separates the input audio signal into a voice section and a non-voice section. For the audio signal in the audio section sandwiched between the non-audio sections of a certain time length or longer, the speed conversion unit 92 has a beginning portion that is slower than the predetermined reproduction speed and gradually increases toward the end. Speed conversion is performed to return to Here, due to the speed conversion processing in the speech speed conversion unit 92, there is a problem that the reproduction time of the voice section becomes long, and as a result, the reproduction time of the entire audio signal is delayed with respect to the target time. Therefore, the non-speech interval length control unit 93 refers to the delay time information output from the speed conversion unit 92 and performs processing for shortening the delay time for the non-speech interval. Specifically, the non-speech section length control unit 93 performs a process of deleting or compressing the non-speech section to shorten the delay time. The audio signal of the voice section that has been speed-converted by the speed conversion unit 92 and the audio signal of the non-speech section processed by the non-speech section length control unit 93 are synthesized by the synthesis unit 94 and output from the synthesis unit 94.
JP 2001-222300 A (page 1-2, FIG. 1)

ここで、入力されるオーディオ信号に含まれる音声区間の比率は、入力されるオーディオ信号に応じて異なっている。しかしながら、従来の音声再生装置では、音声区間が含まれる比率に関わらず、音声区間に対しては上記速度変換処理を一律に行い、非音声区間に対しては目標時間を達成するための削除や圧縮を行っている。したがって、例えば入力されるオーディオ信号が音声区間を多く含む信号である場合、音声区間に対しては上記速度変換処理が一律に行われるので、話速変換部９２において生じる遅延時間が長くなってしまう。そして遅延時間が長くなれば、非音声区間に対する区間の削除量や圧縮量も大きくなってしまい、情報の欠落が大きくなったり、再生が聞き取り難くなったりする。このように従来の音声再生装置では、目標時間を達成しつつ、入力されるオーディオ信号に応じた適切な速度変換を行うことができなかった。 Here, the ratio of the voice sections included in the input audio signal differs depending on the input audio signal. However, in the conventional audio reproduction device, regardless of the ratio of the audio section, the speed conversion process is uniformly performed for the audio section, and deletion or non-audio section for achieving the target time is performed. Compression is in progress. Therefore, for example, when the input audio signal is a signal including a lot of speech sections, the speed conversion process is uniformly performed on the speech sections, so that the delay time generated in the speech speed conversion unit 92 becomes long. . If the delay time is increased, the amount of deletion and compression of the non-voice interval is increased, and the loss of information becomes large and the reproduction becomes difficult to hear. As described above, in the conventional audio reproduction device, it is impossible to perform an appropriate speed conversion according to the input audio signal while achieving the target time.

それ故、本発明の目的は、目標時間を達成しつつ、入力されるオーディオ信号に応じた適切な速度変換を行うことが可能な音声再生装置を提供することを目的とする。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide an audio reproduction device capable of performing appropriate speed conversion according to an input audio signal while achieving a target time.

第１の発明は、入力されるコンテンツのオーディオ信号の再生速度を変えて所定の再生時間で再生する音声再生装置であって、オーディオ信号に対して、音声を含む音声区間と、音声を含まない非音声区間とを判別する判別手段と、判別手段において判別された判別結果に基づいて、オーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する音声含有率算出手段と、オーディオ信号の再生時間が所定の再生時間となるように、オーディオ信号に予め設定された再生速度に対する音声区間及び非音声区間の速度比を音声含有率に基づいてそれぞれ算出する速度比算出手段と、オーディオ信号を入力とし、当該オーディオ信号に含まれる音声区間及び非音声区間の再生速度を速度比に基づいてそれぞれ変換する速度変換手段とを備える。 1st invention is the audio | voice reproduction apparatus which changes the reproduction | regeneration speed of the audio signal of the content input, and reproduces | regenerates by predetermined reproduction time, Comprising: With respect to an audio signal, the audio area which contains an audio | voice, and an audio | voice are not included. A discriminating unit for discriminating a non-speech section; a voice content rate calculating unit for calculating a voice content rate indicating a ratio of a voice segment included in the audio signal based on a discrimination result discriminated by the discriminating unit; A speed ratio calculating means for calculating a speed ratio between a voice section and a non-speech section with respect to a playback speed set in advance in the audio signal based on the voice content so that the playback time becomes a predetermined playback time; Speed conversion means for converting the playback speeds of the speech and non-speech sections included in the audio signal based on the speed ratio. .

第２の発明は、上記第１の発明において、速度比算出手段は、オーディオ信号に予め設定された再生速度で再生される再生時間を所定の再生時間に圧伸する比率を示す圧伸比と、音声含有率と、音声区間の平均速度比を示す音声平均速度比及び非音声区間の平均速度比を示す非音声平均速度比の算出方法との対応を示す対応情報を用いて、音声平均速度比及び非音声平均速度比をそれぞれ算出して速度比条件として設定する速度比条件設定手段と、音声区間が細分化された各区間における速度比を音声平均速度比に基づく速度比に決定するとともに、非音声区間が細分化された各区間における速度比を非音声平均速度比に基づく速度比に決定して音声区間及び非音声区間の速度比をそれぞれ算出する速度比決定手段とを有する。 According to a second aspect, in the first aspect, the speed ratio calculation means includes a companding ratio indicating a ratio of companding a reproduction time reproduced at a reproduction speed set in advance to the audio signal to a predetermined reproduction time. Using the correspondence information indicating correspondence between the voice content rate, the voice average speed ratio indicating the average speed ratio of the voice section, and the non-voice average speed ratio calculating method of the average speed ratio of the non-voice section, Speed ratio condition setting means for calculating the ratio and the non-speech average speed ratio and setting the ratio as a speed ratio condition, and determining the speed ratio in each section into which the speech section is subdivided as a speed ratio based on the speech average speed ratio And a speed ratio determining means for determining a speed ratio based on a non-speech average speed ratio and calculating a speed ratio between the speech section and the non-speech section.

第３の発明は、上記第２の発明において、音声再生装置は、判別手段において判別された各音声区間の開始時刻から終了時刻までの時間を音声区間長としてそれぞれ算出する音声区間長算出手段をさらに備え、速度比条件設定手段は、音声平均速度比に応じた音声区間の終了時刻における終了速度比を速度比条件としてさらに設定し、速度比決定手段は、判別手段において判別された各音声区間に対して、音声平均速度比、音声区間長、及び終了速度比に基づいて音声区間の各区間における速度比を決定する。 In a third aspect based on the second aspect, the sound reproducing device further comprises a sound section length calculating means for calculating the time from the start time to the end time of each sound section determined by the determining means as a sound section length. The speed ratio condition setting means further sets an end speed ratio at the end time of the voice section according to the voice average speed ratio as a speed ratio condition, and the speed ratio determination means is configured to determine each voice section determined by the determination means. On the other hand, the speed ratio in each section of the voice section is determined based on the voice average speed ratio, the voice section length, and the end speed ratio.

第４の発明は、上記第２の発明において、速度比決定手段は、判別手段において判別された各音声区間に対して、音声区間の開始時刻から経過した時間を音声区間長で除算して得られる経過割合に応じて音声区間の各区間における速度比をそれぞれ決定する。 In a fourth aspect based on the second aspect, the speed ratio determination unit obtains the time elapsed from the start time of the speech segment by the speech segment length for each speech segment determined by the determination unit. The speed ratio in each section of the voice section is determined according to the elapsed ratio.

第５の発明は、上記第２の発明において、速度比決定手段は、音声区間の開始時刻から時間が経過するにつれて再生速度が速くなるように、音声区間の各区間における速度比を決定する。 In a fifth aspect based on the second aspect, the speed ratio determining means determines the speed ratio in each section of the voice section so that the playback speed increases as time elapses from the start time of the voice section.

第６の発明は、上記第２の発明において、速度比条件設定手段は、音声平均速度比及び非音声平均速度比の算出方法を少なくとも１種類含む対応情報であって、オーディオ信号によって構成されるコンテンツの種類に応じて異なる対応を示す対応情報を用いて、音声平均速度比及び非音声平均速度比を算出する。 In a sixth aspect based on the second aspect, the speed ratio condition setting means is correspondence information including at least one type of calculation method of the voice average speed ratio and the non-voice average speed ratio, and is configured by an audio signal. Using the correspondence information indicating different correspondence depending on the type of content, the voice average speed ratio and the non-voice average speed ratio are calculated.

第７の発明は、上記第２の発明において、速度比条件設定手段は、音声平均速度比及び非音声平均速度比がユーザによって指定された範囲内となるように対応情報を作成し、当該対応情報を用いて音声平均速度比及び非音声平均速度比を算出する。 In a seventh aspect based on the second aspect, the speed ratio condition setting means creates correspondence information so that the voice average speed ratio and the non-voice average speed ratio are within a range designated by the user, and the correspondence The voice average speed ratio and the non-voice average speed ratio are calculated using the information.

第８の発明は、上記第２の発明において、対応情報は、音声含有率の大きさに応じて当該音声含有率と音声平均速度比及び非音声平均速度比の算出方法とが異なる対応を示す情報である。 In an eighth aspect based on the second aspect, the correspondence information indicates a correspondence in which the voice content rate differs from the calculation method of the voice average speed ratio and the non-voice average speed ratio according to the magnitude of the voice content rate. Information.

第９の発明は、上記第２の発明において、速度比条件設定手段は、音声平均速度比及び非音声平均速度比の算出方法を少なくとも１種類含む対応情報であって、ユーザの使用目的に応じて異なる対応を示す対応情報を用いて、音声平均速度比及び非音声平均速度比を算出する。 In a ninth aspect based on the second aspect, the speed ratio condition setting means is correspondence information including at least one kind of calculation method of the voice average speed ratio and the non-voice average speed ratio, and corresponds to a user's purpose of use. Using the correspondence information indicating different correspondences, the voice average speed ratio and the non-voice average speed ratio are calculated.

第１０の発明は、上記第２の発明において、対応情報は、圧伸比の大きさに応じて当該圧伸比と音声平均速度比及び非音声平均速度比の算出方法とが異なる対応を示す情報である。 In a tenth aspect based on the second aspect, the correspondence information indicates a correspondence in which the companding ratio differs from the calculation method of the voice average speed ratio and the non-voice average speed ratio according to the magnitude of the companding ratio. Information.

第１１の発明は、上記第１の発明において、コンテンツ全体を構成するオーディオ信号と、当該コンテンツ全体を構成するオーディオ信号に対して判別手段において判別された判別結果とを予め蓄積する蓄積手段をさらに備え、音声含有率算出手段は、蓄積手段に予め蓄積された判別結果に基づいて、コンテンツ全体を構成するオーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する。 According to an eleventh aspect of the invention, there is provided storage means for preliminarily storing the audio signal constituting the entire content and the discrimination result discriminated by the discrimination means for the audio signal constituting the entire content in the first invention. The voice content rate calculating unit calculates a voice content rate indicating a ratio of the voice sections included in the audio signal constituting the entire content based on the determination result stored in advance in the storage unit.

第１２の発明は、上記第１の発明において、音声含有率算出手段は、判別手段において過去に判別された判別結果に基づいて、速度比算出手段が算出するときに用いる音声含有率を逐次算出する。 In a twelfth aspect based on the first aspect, the voice content rate calculating unit sequentially calculates the voice content rate used when the speed ratio calculating unit calculates based on the determination result previously determined by the determination unit. To do.

第１３の発明は、上記第１２の発明において、音声含有率算出手段は、速度比算出手段が算出するときから第１の所定時間分だけ過去に判別された判別結果に基づいて、速度比算出手段が算出するときに用いる音声含有率を第１の所定時間以下の時間である第２の所定時間毎に逐次算出し、音声再生装置は、速度変換手段に入力されるデータ量及び速度変換手段から出力されるデータ量と、オーディオ信号に予め設定された再生速度で再生される再生時間を所定の再生時間に圧伸する比率を示す圧伸比とに基づいて、第２の所定時間毎の圧伸比を逐次算出する圧伸比算出手段をさらに備え、速度比算出手段は、第２の所定時間毎の圧伸比と、第２の所定時間毎の音声含有率と、音声区間の平均速度比を示す音声平均速度比及び非音声区間の平均速度比を示す非音声平均速度比の算出方法との対応を示す対応情報を用いて、音声平均速度比及び非音声平均速度比をそれぞれ算出し、算出した音声平均速度比及び非音声平均速度比を速度比条件として第２の所定時間毎に逐次設定する速度比条件設定手段と、第２の所定時間内に含まれる音声区間及び非音声区間に対して、音声区間が細分化された各区間における速度比を音声平均速度比に基づく速度比に決定するとともに、非音声区間が細分化された各区間における速度比を非音声平均速度比に基づく速度比に決定して、第２の所定時間毎に逐次設定される速度比条件に基づいて音声区間及び非音声区間の速度比をそれぞれ算出する速度比決定手段とを有し、速度変換手段は、オーディオ信号に含まれる音声区間及び非音声区間の再生速度を速度比決定手段において算出された音声区間及び非音声区間の速度比に基づいてそれぞれ変換する。 In a thirteenth aspect based on the twelfth aspect, the voice content rate calculating means calculates the speed ratio based on the determination result determined in the past for the first predetermined time from when the speed ratio calculating means calculates. The audio content rate used when the means calculates is sequentially calculated every second predetermined time which is a time equal to or less than the first predetermined time, and the audio reproducing apparatus has the data amount input to the speed converting means and the speed converting means Based on the amount of data output from the audio signal and a companding ratio indicating a ratio of companding the reproduction time reproduced at the reproduction speed set in advance to the audio signal to the predetermined reproduction time. The drawing apparatus further includes a drawing ratio calculating unit that sequentially calculates the drawing ratio, and the speed ratio calculating unit includes a drawing ratio for each second predetermined time, a voice content rate for each second predetermined time, and an average of the voice section. Voice average speed ratio indicating speed ratio and flatness of non-voice section Using the correspondence information indicating the correspondence with the calculation method of the non-voice average speed ratio indicating the speed ratio, the voice average speed ratio and the non-voice average speed ratio are calculated, respectively, and the calculated voice average speed ratio and the non-voice average speed ratio are calculated. Speed ratio condition setting means for sequentially setting the ratio as a speed ratio condition every second predetermined time, and each segment in which the speech segment is subdivided with respect to the speech segment and the non-speech segment included in the second predetermined time Is determined to be a speed ratio based on the voice average speed ratio, and a speed ratio in each section in which the non-voice sections are subdivided is determined to be a speed ratio based on the non-voice average speed ratio, for a second predetermined time. Speed ratio determining means for calculating the speed ratio of the voice section and the non-voice section based on the speed ratio condition sequentially set for each, and the speed conversion means includes the voice section and the non-voice section included in the audio signal. Playback Converting each based degrees on the speed ratio of the calculated speech section and the non-speech section in the speed ratio determining means.

第１４の発明は、上記第１３の発明において、音声再生装置は、速度比決定手段において決定される音声区間の各区間における速度比が示す第２の所定時間毎の変化を抑制するための統計量を算出する統計量算出手段をさらに備え、速度比条件設定手段は、統計量、第２の所定時間毎の音声含有率、及び第２の所定時間毎の圧伸比に基づいて音声平均速度と当該音声平均速度に基づく音声区間の終了速度比とを算出し、算出した音声平均速度及び終了速度比を速度比条件として第２の所定時間毎に逐次設定する。 In a fourteenth aspect based on the thirteenth aspect, the sound reproducing device is a statistic for suppressing a change at every second predetermined time indicated by the speed ratio in each section of the sound section determined by the speed ratio determining means. Statistic calculating means for calculating a quantity, and the speed ratio condition setting means is configured to calculate the voice average speed based on the statistic, the voice content rate for each second predetermined time, and the companding ratio for each second predetermined time. And the end speed ratio of the voice section based on the average voice speed, and sequentially set the calculated voice average speed and end speed ratio as speed ratio conditions every second predetermined time.

第１５の発明は、上記第１４の発明において、統計量算出手段は、コンテンツの開始時刻から速度比算出手段が算出するときまでの判別結果に基づいて、コンテンツの開始時刻から速度比算出手段が算出するときまでに含まれる音声区間の比率を示す音声含有率を統計量として算出する。 In a fifteenth aspect based on the fourteenth aspect, the statistic calculation means is configured so that the speed ratio calculation means starts from the content start time based on the determination result from the content start time to the time when the speed ratio calculation means calculates. The voice content rate indicating the ratio of the voice sections included up to the time of calculation is calculated as a statistic.

第１６の発明は、上記第１３の発明において、音声再生装置は、判別手段によって過去に判別された判別結果に基づいて、速度比算出手段が算出するときに用いる音声区間の開始時刻から終了時刻までの時間である音声区間長を逐次算出する音声区間長算出手段をさらに備え、速度比条件設定手段は、音声平均速度比に応じた音声区間の終了時刻における終了速度比を速度比条件としてさらに第２の所定時間毎に逐次設定し、速度比決定手段は、判別手段において判別された各音声区間に対して、音声平均速度比、音声区間長、及び終了速度比に基づいて音声区間の各区間における速度比を決定する。 In a sixteenth aspect based on the thirteenth aspect, the sound reproducing device uses the start time to the end time of the sound section used when the speed ratio calculating means calculates based on the determination result determined in the past by the determining means. Voice section length calculating means for sequentially calculating the voice section length that is the time until the speed ratio condition setting means further uses the end speed ratio at the end time of the voice section according to the voice average speed ratio as a speed ratio condition. The speed ratio determining means sequentially sets every second predetermined time, and for each voice section discriminated by the discriminating means, each speed section is based on the voice average speed ratio, the voice section length, and the end speed ratio. Determine the speed ratio in the section.

第１７の発明は、上記第１６の発明において、速度比算出手段が算出するときに用いる音声区間長は、判別手段において過去に判別された音声区間の開始及び終了時刻から算出される音声区間長のうち、所定区間長以上の音声区間長のみに基づいて算出される。 In a seventeenth aspect based on the sixteenth aspect, the voice section length used when the speed ratio calculating means calculates is the voice section length calculated from the start and end times of the voice section determined in the past by the determining means. Of these, it is calculated based only on the voice segment length that is greater than or equal to the predetermined segment length.

第１８の発明は、上記第１６の発明において、速度比算出手段が算出するときに用いる音声区間長は、判別手段において過去に判別された音声区間の開始及び終了時刻から算出される音声区間長の最大値に基づいて算出される。 In an eighteenth aspect based on the sixteenth aspect, the voice section length used when the speed ratio calculating means calculates is the voice section length calculated from the start and end times of the voice sections determined in the past by the determining means. It is calculated based on the maximum value of.

第１９の発明は、上記第１の発明において、所定時間分のオーディオ信号と、当該所定時間分のオーディオ信号に対して判別手段において判別された判別結果とを予め蓄積する蓄積手段をさらに備え、音声含有率算出手段は、蓄積手段に予め蓄積された判別結果に基づいて、所定時間分のオーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する。 According to a nineteenth aspect of the present invention, in the first aspect of the invention, further includes an accumulation unit that accumulates in advance the audio signal for a predetermined time and the determination result determined by the determination unit for the audio signal for the predetermined time, The voice content rate calculating unit calculates a voice content rate indicating a ratio of a voice section included in the audio signal for a predetermined time based on the determination result stored in advance in the storage unit.

第２０の発明は、上記第１の発明において、判別手段は、オーディオ信号に対して、特定イベント音を含む特定イベント区間と、当該特定イベント区間以外の音声区間及び非音声区間とを判別し、音声再生装置は、判別手段において判別された判別結果に基づいて、オーディオ信号に含まれる特定イベント区間の比率を示す特定イベント含有率を算出する特定イベント含有率算出手段をさらに備え、速度比算出手段は、オーディオ信号に予め設定された再生速度に対する特定イベント区間の速度比を特定イベント含有率に基づいて算出するとともに、オーディオ信号の再生時間が所定の再生時間となるように、オーディオ信号に予め設定された再生速度に対する特定イベント区間以外の音声区間及び非音声区間の速度比を音声含有率に基づいてそれぞれ算出し、速度変換手段は、オーディオ信号に含まれる特定イベント区間と当該特定イベント区間以外の音声区間及び非音声区間との再生速度を速度比に基づいて変換する。 In a twentieth aspect based on the first aspect, the determining unit determines, for the audio signal, a specific event section including a specific event sound, a voice section other than the specific event section, and a non-voice section, The audio reproduction device further includes a specific event content rate calculating unit that calculates a specific event content rate indicating a ratio of a specific event section included in the audio signal based on the determination result determined by the determining unit, and a speed ratio calculating unit Calculates the speed ratio of the specific event section to the playback speed preset in the audio signal based on the specific event content rate, and sets the audio signal in advance so that the playback time of the audio signal becomes a predetermined playback time. Based on the audio content rate, the speed ratio of the audio and non-audio intervals other than the specific event interval to the playback speed It was calculated, the rate conversion means converts on the basis of the playback speed of a particular event section and the specific events other than segment speech segment and the non-speech section included in the audio signal to the speed ratio.

第２１の発明は、入力されるコンテンツのオーディオ信号の再生速度を変えて所定の再生時間で再生する音声再生方法であって、オーディオ信号に対して、音声を含む音声区間と、音声を含まない非音声区間とを判別する判別ステップと、判別ステップにおいて判別された判別結果に基づいて、オーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する音声含有率算出ステップと、オーディオ信号の再生時間が所定の再生時間となるように、オーディオ信号に予め設定された再生速度に対する音声区間及び非音声区間の速度比を音声含有率に基づいてそれぞれ算出する速度比算出ステップと、オーディオ信号を入力とし、当該オーディオ信号に含まれる音声区間及び非音声区間の再生速度を速度比に基づいてそれぞれ変換する速度変換ステップとを含む。 A twenty-first aspect of the present invention is an audio reproduction method for reproducing at a predetermined reproduction time by changing the reproduction speed of the audio signal of the input content. The audio signal includes an audio section including audio and no audio. A discriminating step for discriminating a non-speech section; a voice content rate calculating step for calculating a voice content rate indicating a ratio of a voice segment included in the audio signal based on the discrimination result discriminated in the discriminating step; A speed ratio calculating step for calculating a speed ratio between a voice section and a non-voice section with respect to a playback speed set in advance in the audio signal based on the voice content so that the playback time becomes a predetermined playback time; As input, the speed at which the playback speed of the speech and non-speech sections included in the audio signal is converted based on the speed ratio. And a conversion step.

第２２の発明は、入力されるコンテンツのオーディオ信号の再生速度を変えて所定の再生時間で再生する音声再生装置のコンピュータに実行させるためのプログラムであって、オーディオ信号に対して、音声を含む音声区間と、音声を含まない非音声区間とを判別する判別ステップと、判別ステップにおいて判別された判別結果に基づいて、オーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する音声含有率算出ステップと、オーディオ信号の再生時間が所定の再生時間となるように、オーディオ信号に予め設定された再生速度に対する音声区間及び非音声区間の速度比を音声含有率に基づいてそれぞれ算出する速度比算出ステップと、オーディオ信号を入力とし、当該オーディオ信号に含まれる音声区間及び非音声区間の再生速度を速度比に基づいてそれぞれ変換する速度変換ステップとを、コンピュータに実行させるプログラムである。 A twenty-second aspect of the invention is a program for causing a computer of an audio reproduction device to reproduce at a predetermined reproduction time by changing the reproduction speed of an audio signal of input content, and includes audio for the audio signal. A speech step and a non-speech segment that does not include speech, and a speech inclusion that calculates a speech content ratio that indicates a ratio of speech segments included in the audio signal based on the discrimination result determined in the discrimination step A rate calculating step, and a speed for calculating a speed ratio of a voice section and a non-voice section with respect to a playback speed set in advance in the audio signal based on a voice content rate so that the playback time of the audio signal becomes a predetermined playback time. A ratio calculation step and an audio signal as an input, and a speech segment and a non-speech segment included in the audio signal are reproduced. A speed conversion step of converting each based on the speed of the speed ratio, a program executed by a computer.

第２３の発明は、上記第２２の発明のプログラムを記録した、コンピュータに読み取り可能な記録媒体である。 A twenty-third invention is a computer-readable recording medium on which the program of the twenty-second invention is recorded.

第２４の発明は、入力されるコンテンツのオーディオ信号の再生速度を変えて所定の再生時間で再生する集積回路であって、オーディオ信号に対して、音声を含む音声区間と、音声を含まない非音声区間とを判別する判別手段と、判別手段において判別された判別結果に基づいて、オーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する音声含有率算出手段と、オーディオ信号の再生時間が所定の再生時間となるように、オーディオ信号に予め設定された再生速度に対する音声区間及び非音声区間の速度比を音声含有率に基づいてそれぞれ算出する速度比算出手段と、オーディオ信号を入力とし、当該オーディオ信号に含まれる音声区間及び非音声区間の再生速度を速度比に基づいてそれぞれ変換する速度変換手段とを備える。 A twenty-fourth aspect of the invention is an integrated circuit that plays back a predetermined playback time by changing the playback speed of the audio signal of the input content. The audio circuit includes a speech section that includes speech and a non-speech that does not include speech. A discriminating unit for discriminating a voice section; a voice content rate calculating unit for calculating a voice content rate indicating a ratio of a voice segment included in the audio signal based on the discrimination result discriminated by the discriminating unit; and reproduction of the audio signal Speed ratio calculating means for calculating the speed ratio of the voice section and the non-voice section with respect to the playback speed set in advance in the audio signal based on the voice content rate, and the audio signal so that the time becomes a predetermined playback time And a speed conversion means for converting the playback speed of the voice section and the non-voice section included in the audio signal based on the speed ratio, respectively.

上記第１の発明によれば、入力されるオーディオ信号の音声含有率を算出することにより、所定の再生時間を達成しつつ、入力されるオーディオ信号に対して当該オーディオ信号の音声含有率に基づいた最適な音声区間及び非音声区間の速度比をそれぞれ算出することができる。つまり、音声含有率を算出することで、入力されるオーディオ信号に含まれる音声区間の比率を知ることができ、音声区間及び非音声区間の両方について所定の再生時間を達成するための最適な速度比を算出することができる。これにより、どのようなオーディオ信号が入力されても、再生内容の不連続性や情報の欠落による不快感などを低減させた、聞き取り易い再生を実現することができる。 According to the first aspect, by calculating the audio content rate of the input audio signal, the audio content rate of the input audio signal is calculated based on the audio content rate of the input audio signal while achieving a predetermined reproduction time. In addition, it is possible to calculate the speed ratio between the optimum voice segment and the non-speech segment. In other words, by calculating the voice content rate, the ratio of the voice sections included in the input audio signal can be known, and the optimum speed for achieving the predetermined playback time for both the voice sections and the non-voice sections. The ratio can be calculated. This makes it possible to realize easy-to-listen playback with reduced discontinuity of playback content and discomfort due to lack of information, regardless of what audio signal is input.

上記第２の発明によれば、所定の再生時間に圧伸する比率を示す圧伸比と音声含有率とに基づく音声平均速度比及び非音声平均速度比をそれぞれ算出して、音声区間及び非音声区間の各区間における速度比が決定されることで、所定の再生時間を達成しつつ、入力されるオーディオ信号の音声含有率に基づいた最適な音声区間及び非音声区間の速度比をそれぞれ算出することができる。 According to the second aspect of the invention, the voice average speed ratio and the non-voice average speed ratio are calculated based on the companding ratio indicating the ratio of companding during a predetermined reproduction time and the voice content rate, respectively, and the voice interval and the non-voice speed are calculated. By determining the speed ratio in each section of the voice section, the optimum speed ratio of the voice section and the non-voice section is calculated based on the voice content of the input audio signal while achieving a predetermined playback time. can do.

上記第３の発明によれば、音声平均速度比、音声区間長、及び終了速度比に基づいて音声区間の各区間における速度比が決定されることで、音声区間の終了時刻における速度比を終了速度比で一定にしつつ、文頭から文末まで音声区間長に適した速度比を決定することができ、音声区間末の再生の高速化による聞き取り難さや不自然さを低減することができる。 According to the third aspect, the speed ratio at the end time of the voice section is terminated by determining the speed ratio in each section of the voice section based on the voice average speed ratio, the voice section length, and the end speed ratio. While maintaining the speed ratio constant, it is possible to determine a speed ratio suitable for the speech section length from the beginning of the sentence to the end of the sentence, and to reduce difficulty in hearing and unnaturalness due to high-speed playback at the end of the speech section.

上記第４の発明によれば、音声区間の各区間における速度比を経過割合に応じて決定することで、音声区間長の長短に関わらず、簡単な関数を用いて音声区間の各区間における速度比を決定することができる。 According to the fourth aspect, by determining the speed ratio in each section of the speech section according to the elapsed rate, the speed in each section of the speech section can be determined using a simple function regardless of the length of the speech section length. The ratio can be determined.

上記第５の発明によれば、音声区間の冒頭部分の再生速度が他の部分と比べて相対的に遅くなるので、冒頭部分の聞き逃しによって再生内容の理解度が低下することを防ぐことができる。 According to the fifth aspect of the invention, since the playback speed of the beginning portion of the voice section is relatively slow compared to other portions, it is possible to prevent the understanding level of the playback content from being lowered due to missed listening to the beginning portion. it can.

上記第６の発明によれば、コンテンツの種類に応じて異なる音声平均速度比及び非音声平均速度比を算出することができ、音声区間及び非音声区間の各区間における速度比をコンテンツに応じたより精度の高いものにすることができる。 According to the sixth aspect, it is possible to calculate different voice average speed ratios and non-voice average speed ratios depending on the type of content, and the speed ratio in each section of the voice section and the non-voice section is determined according to the content. High accuracy can be achieved.

上記第７の発明によれば、音声区間及び非音声区間の各区間における速度比がユーザによって指定された範囲内の音声平均速度比及び非音声平均速度比に基づく速度比となり、ユーザの聞き取り能力や好みに応じた速度変換処理を行うことができる。 According to the seventh aspect, the speed ratio in each of the speech section and the non-speech section becomes a speed ratio based on the speech average speed ratio and the non-speech average speed ratio within the range specified by the user, and the user's listening ability Speed conversion processing according to the user's preference.

上記第８の発明によれば、所定の再生時間を示す圧伸比を達成しつつ、入力されるオーディオ信号の音声含有率に基づいた最適な音声区間及び非音声区間の速度比をそれぞれ算出することができる。 According to the eighth aspect of the invention, the optimal speed ratio between the voice segment and the non-speech segment is calculated based on the voice content rate of the input audio signal while achieving a companding ratio indicating a predetermined reproduction time. be able to.

上記第９の発明によれば、ユーザの使用目的に応じて異なる音声平均速度比及び非音声平均速度比の算出方法を変更することが可能になり、早聞き再生や遅聞き再生だけではなく、挿入や一時停止等を含んだオーディオ信号の出力時間に関する様々な制御について取り扱うことができる。これにより、コンテンツを視聴用、概要把握用、語学学習用、書き起こし用など用途に分けて個別に作成する必要がなく、同一のコンテンツを様々な目的で利用可能となる。 According to the ninth aspect, it is possible to change the calculation method of the voice average speed ratio and the non-voice average speed ratio depending on the purpose of use of the user. Various controls relating to the output time of the audio signal including insertion and pause can be handled. As a result, it is not necessary to create the content separately for viewing, overview comprehension, language learning, transcription, etc., and the same content can be used for various purposes.

上記第１０の発明によれば、圧伸比の大きさに応じて音声平均速度比及び非音声平均速度比の算出方法との対応が異なることで、例えば圧伸比が低い場合は概要把握用途に、圧伸比が高い場合は学習用途等に、ユーザの目的に応じた速度比を決定することができる。これにより、ユーザは目的に応じて機器の使い分けを意識せずに使用でき、またユーザの視聴要望に即した速度変換処理を行うことができる。 According to the tenth aspect of the present invention, the correspondence between the voice average speed ratio and the non-voice average speed ratio calculation method differs depending on the magnitude of the companding ratio, for example, when the companding ratio is low, In addition, when the companding ratio is high, it is possible to determine a speed ratio according to the user's purpose for learning purposes. As a result, the user can use the device without being aware of the proper use of the device according to the purpose, and can perform a speed conversion process in accordance with the viewing request of the user.

上記第１１の発明によれば、コンテンツ全体についての音声含有率を算出することにより、精度の高い音声区間及び非音声区間の速度比を算出することができる。 According to the eleventh aspect of the present invention, by calculating the audio content rate for the entire content, it is possible to calculate the speed ratio between the voice segment and the non-speech segment with high accuracy.

上記第１２の発明によれば、蓄積手段を設けることなく処理が可能なため、オーディオ信号が蓄積手段に蓄積されるのを待つ必要がなく、リアルタイムで速度変換処理を行うことができる。 According to the twelfth aspect, since processing is possible without providing storage means, it is not necessary to wait for an audio signal to be stored in the storage means, and speed conversion processing can be performed in real time.

上記第１３の発明によれば、第１の所定時間分の音声含有率が第２の所定時間に反映されることとなり、音声含有率の変動をすぐに反映した音声区間及び非音声区間の速度比の算出が可能になる。また、音声含有率を第１の所定時間分から算出することで再生時間に誤差が生じた場合であっても、第２の所定時間毎の圧伸比を用いて速度変換処理を行うので、当該誤差をこれから先の速度変換処理において解消させることができる。 According to the thirteenth aspect, the voice content rate for the first predetermined time is reflected in the second predetermined time, and the speed of the voice interval and the non-voice interval that immediately reflects the change in the voice content rate. The ratio can be calculated. Further, even if an error occurs in the reproduction time by calculating the audio content rate from the first predetermined time, the speed conversion process is performed using the companding ratio for each second predetermined time. The error can be eliminated in the speed conversion process from now on.

上記第１４の発明によれば、音声区間の各区間における速度比が示す前記第２の所定時間毎の変化を抑制するための統計量を用いることで、第１の所定時間分の音声含有率が局所的に高くなった場合でも、音声区間の各区間における速度比が上がりすぎることを防ぐことができる。その結果、音声含有率が異なる様々なコンテンツに対応した速度変換処理が可能となる。 According to the fourteenth aspect, by using the statistic for suppressing the change at the second predetermined time indicated by the speed ratio in each section of the voice section, the voice content ratio for the first predetermined time is used. Even when becomes locally high, it is possible to prevent an excessive increase in the speed ratio in each section of the speech section. As a result, it is possible to perform speed conversion processing corresponding to various contents having different audio contents.

上記第１５の発明によれば、第１の所定時間分の音声含有率と時間変化の傾向が異なる音声含有率を統計量として利用することで、第１の所定時間分の音声含有率が局所的に変動した場合であっても、その変動は抑制され、聞き取り易い速度変換処理が可能となる。 According to the fifteenth aspect of the present invention, the voice content rate for the first predetermined time is locally determined by using, as a statistic, the voice content rate that has a tendency to change over time from the voice content rate for the first predetermined time. Even if the frequency fluctuates automatically, the fluctuation is suppressed, and an easy-to-listen speed conversion process becomes possible.

上記第１６の発明によれば、音声区間長が逐次算出されることで、音声区間の終了時刻が分からなくても、音声区間に対して適切な速度変換処理を行うことができ、より精度の高いリアルタイム処理を実現することができる。 According to the sixteenth aspect of the present invention, since the speech section length is sequentially calculated, an appropriate speed conversion process can be performed on the speech section without knowing the end time of the speech section, and more accurate High real-time processing can be realized.

上記第１７の発明によれば、所定区間長以上の音声区間長のみに基づいて算出されることで、「はい」や「うん」など相槌や、「えー」などのフィラーなどを除いた平均的な音声区間長を算出することができる。 According to the seventeenth aspect of the present invention, the calculation is performed based only on the voice section length that is equal to or longer than the predetermined section length, so that the average of the sum excluding the “yes” and “yeah” and the filler such as “e” is removed. It is possible to calculate a long speech interval length.

上記第１８の発明によれば、速度比算出手段が算出するときに用いる音声区間長が過去の音声区間長の最大値に基づき算出されることで、音声区間の終了時刻では終了速度比で変換される割合が低下し、音声区間の平均速度比が更に下がる効果があり、聞き易い速度変換処理を実現することができる。 According to the eighteenth aspect of the invention, the voice section length used when the speed ratio calculating means calculates is calculated based on the maximum value of the past voice section length, so that the end speed ratio is converted at the end time of the voice section. The ratio is reduced, the average speed ratio of the voice section is further reduced, and an easy-to-listen speed conversion process can be realized.

上記第１９の発明によれば、所定時間分のオーディオ信号単位で速度変換を行うことができる。これにより、コンテンツの録画中であっても、全体の録画終了を待たずに速度変換処理を行うことができる。また、音声区間及び非音声区間の判別結果が蓄積手段に蓄積されることにより、音声含有率の実測値を算出することができ、より最適な速度比で速度変換を行うことができる。 According to the nineteenth aspect, speed conversion can be performed in units of audio signals for a predetermined time. As a result, even during content recording, speed conversion processing can be performed without waiting for the end of the entire recording. Further, by accumulating the discrimination results of the voice section and the non-speech section in the storage means, it is possible to calculate the actual value of the voice content rate, and to perform the speed conversion with a more optimal speed ratio.

上記第２０の発明によれば、特定イベント含有率を算出して特定イベント区間の速度比を算出することで、例えば特定イベント音が音楽である場合、音楽番組などのオーディオ信号に対して音楽区間を音声区間及び非音声区間よりも遅い再生速度で再生を行うことができる。その結果、音楽を重視した速度変換処理を行うことができる。 According to the twentieth invention, by calculating the specific event content rate and calculating the speed ratio of the specific event section, for example, when the specific event sound is music, the music section with respect to an audio signal such as a music program Can be reproduced at a slower reproduction speed than the voice and non-voice segments. As a result, it is possible to perform speed conversion processing with emphasis on music.

以下、本発明の実施形態について、図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

（第１の実施形態）
まず、図１を参照して本発明の第１の実施形態に係る音声再生装置について説明する。図１は、第１の実施形態に係る音声再生装置の構成例を示すブロック図である。図１において、本音声再生装置は、音声非音声判別部１１、蓄積部１２、音声含有率算出部１３、速度比条件設定部１４、音声区間長算出部１５、速度比決定部１６、及び速度変換部１７で構成される。なお、本実施形態では、速度変換対象となるオーディオ信号をコンテンツ単位で予め蓄積部１２に蓄積し、この蓄積したオーディオ信号を用いて再生速度を変えた再生処理を行う音声再生装置について説明する。また以下の説明において、音声が含まれる区間を音声区間とする。また、音声区間以外の区間、つまり音声を含まない区間を非音声区間とする。 (First embodiment)
First, an audio reproducing apparatus according to the first embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram illustrating a configuration example of an audio reproduction device according to the first embodiment. In FIG. 1, the audio playback apparatus includes an audio non-audio discriminating unit 11, an accumulation unit 12, an audio content rate calculating unit 13, a speed ratio condition setting unit 14, an audio section length calculating unit 15, a speed ratio determining unit 16, and a speed The conversion unit 17 is configured. In the present embodiment, an audio reproduction apparatus that accumulates audio signals to be speed-converted in the accumulation unit 12 in units of contents in advance and performs a reproduction process using the accumulated audio signals with different reproduction speeds will be described. Further, in the following description, a section including voice is referred to as a voice section. A section other than the voice section, that is, a section that does not include voice is defined as a non-voice section.

音声非音声判別部１１は、オーディオ信号を入力として、音声区間と非音声区間とを判別する。また音声非音声判別部１１は、この判別結果と共に音声区間の始終端時刻（開始時刻及び終了時刻）を出力する。入力されるオーディオ信号は、ＣＤ、ＤＶＤ、メモリ、又はハードディスクなどに記録されたオーディオ信号である。なお、オーディオ信号は、インターネットなどの通信回線を介して配信されたオーディオ信号や放送により受信したオーディオ信号などであってもよい。また、オーディオ信号は音声合成などその場で生成したものや、マイクで収録したもの、電話などの通信機器を通じて出力されるものでもよい。ここで、音声区間及び非音声区間を判別する方法としては、例えばオーディオ信号のパワーを算出し、閾値により判別を行う方法が挙げられる。また例えば「ＣｅｐｓｔｒｕｍＦｌｕｘを用いた音声と音楽のセグメンテーション」＜ＳＰ２０００−１、内田貴之、山下昌毅、杉山雅英による，信学技報，ＳＰ２０００−１７＞に記載されるように、ケプストラムの変化度合いを計測して判別を行う方法もある。ケプストラムの変化度合いを計測する方法では、ＢＧＭが重畳した音声であっても判別が可能である。 The voice / non-voice discrimination unit 11 receives an audio signal and discriminates between a voice zone and a non-voice zone. Further, the voice non-voice discrimination unit 11 outputs the start / end time (start time and end time) of the voice section together with the discrimination result. The input audio signal is an audio signal recorded on a CD, DVD, memory, hard disk, or the like. Note that the audio signal may be an audio signal distributed via a communication line such as the Internet or an audio signal received by broadcasting. The audio signal may be generated on the spot such as voice synthesis, recorded with a microphone, or output through a communication device such as a telephone. Here, as a method of discriminating between the voice segment and the non-speech segment, for example, a method of calculating the power of the audio signal and discriminating with a threshold value can be cited. In addition, as described in, for example, “Speech and Music Segmentation Using Cepstrum Flux” <SP2000-1, Takayuki Uchida, Masami Yamashita, Masahide Sugiyama, Shingaku Technical Report, SP2000-17> There is also a method of measuring and discriminating. In the method of measuring the degree of change of the cepstrum, it is possible to discriminate even a voice on which BGM is superimposed.

蓄積部１２は、ハードディスク、ＤＶＤ、又はメモリ媒体（例えばＳＤカード）などの読み書き可能な記録媒体で構成される。蓄積部１２には、音声非音声判別部１１に入力されるのと同じオーディオ信号がコンテンツ単位で蓄積される。また蓄積部１２には、音声非音声判別部１１から出力された、判別結果と音声区間の始終端時刻とが蓄積される。ここで例えば、ＴＶ放送を録画する場合を考える。この場合、ＴＶ放送を構成するオーディオ信号及びビデオ信号は蓄積部１２に蓄積される。またこの蓄積と共に音声非音声判別部１１において判別処理が行われ、判別結果や音声区間の始終端時刻が蓄積部１２に蓄積される。蓄積部１２には、コンテンツ１つに対して、オーディオ信号、ビデオ信号、判別結果、及び音声区間の始終端時刻が対応付けされて蓄積される。なお、オーディオ信号及びビデオ信号のフォーマットは、どのようなフォーマットであってもかまわない。 The storage unit 12 includes a readable / writable recording medium such as a hard disk, a DVD, or a memory medium (for example, an SD card). The storage unit 12 stores the same audio signal that is input to the voice / non-voice discrimination unit 11 in units of contents. The storage unit 12 stores the determination result and the start / end time of the voice section output from the voice / non-voice discrimination unit 11. Here, for example, consider the case of recording a TV broadcast. In this case, the audio signal and the video signal constituting the TV broadcast are stored in the storage unit 12. Along with this accumulation, a discrimination process is performed in the voice non-voice discrimination unit 11, and the discrimination result and the start / end time of the voice section are accumulated in the accumulation unit 12. In the storage unit 12, an audio signal, a video signal, a discrimination result, and start / end times of an audio section are associated with each content and stored. The format of the audio signal and the video signal may be any format.

音声含有率算出部１３は、コンテンツのオーディオ信号に含まれる音声区間の比率を示す音声含有率を算出する。具体的には、音声含有率算出部１３は、蓄積部１２に蓄積された各コンテンツに対して、それぞれに対応する判別結果や音声区間の始終端時刻を用いて音声含有率を算出する。音声含有率は、所定時間のオーディオ信号に含まれる音声区間長の和を当該所定時間で除算したものである。本実施形態では、コンテンツのオーディオ信号に含まれる音声区間長の和をコンテンツ長で除算したものを音声含有率とする。ここでコンテンツとは、速度変換を行う一番組全体を意味する。したがって、コンテンツ長は通常、番組長に等しく、３０分や１時間といったものが多い。なお、ユーザが番組の一部を速度変換対象として指定した場合、その一部をコンテンツとしてもよい。 The audio content rate calculation unit 13 calculates an audio content rate indicating the ratio of audio intervals included in the audio signal of the content. Specifically, the audio content rate calculation unit 13 calculates the audio content rate for each content stored in the storage unit 12 using the corresponding determination result and the start / end time of the audio section. The voice content rate is obtained by dividing the sum of the voice section lengths included in the audio signal for a predetermined time by the predetermined time. In the present embodiment, the audio content rate is obtained by dividing the sum of the audio section lengths included in the audio signal of the content by the content length. Here, the content means an entire program that performs speed conversion. Therefore, the content length is usually equal to the program length, and is often 30 minutes or 1 hour. In addition, when the user designates a part of the program as a speed conversion target, the part may be the content.

音声含有率は、コンテンツによって異なる。例えば図２に示すようにコンテンツをジャンル別に見た場合、音声含有率はジャンルによって異なることがわかる。図２は、ジャンル別の音声含有率を示した図である。図２において横軸はジャンルを示し、縦軸は音声含有率を示している。また図２に示す音声含有率は、同一週に放送された番組のうち、ジャンル別の視聴率の上位６位までの番組を抽出して、抽出した番組ごとの音声含有率をジャンル別に集計して平均化したものである。ニュースの音声含有率は、約６０％であり、５ジャンルの中で最も高い値となっている。スポーツや音楽の音声含有率は、約４０％となり、ニュースと比べて２０％近い開きがある。また同じジャンルにおいても、音声含有率には図３に示すような多少のばらつきが存在する。図３は、各ジャンルの音声含有率の平均と標準偏差とを示した図である。ドラマやアニメでは標準偏差が１６．２となり、他のジャンルに比べて高くなっている。このように音声含有率は、コンテンツによって異なる。 The audio content varies depending on the content. For example, as shown in FIG. 2, when content is viewed by genre, it can be seen that the audio content varies depending on the genre. FIG. 2 is a diagram showing audio content rates by genre. In FIG. 2, the horizontal axis indicates the genre, and the vertical axis indicates the audio content rate. In addition, the audio content shown in FIG. 2 is obtained by extracting the top six programs in the audience rating by genre from the programs broadcast in the same week, and summing up the audio content for each extracted program by genre. Averaged. The audio content of news is about 60%, which is the highest value among the five genres. The audio content of sports and music is about 40%, which is nearly 20% wider than news. Even in the same genre, there is some variation in the audio content as shown in FIG. FIG. 3 is a diagram showing the average and standard deviation of the audio content of each genre. In drama and animation, the standard deviation is 16.2, which is higher than other genres. Thus, the audio content varies depending on the content.

したがって、音声含有率を考慮しない従来技術では、上述したように、音声含有率の高いニュース番組などで目標時間からの遅延時間が長くなる。その結果、遅延時間を解消するために部分的に音声区間の高速再生や削除を行い、再生されるオーディオ信号が聞き取り難くなるという問題があった。これに対し、本実施形態では、コンテンツの音声含有率を算出する。これにより、目標時間から遅れることなく、コンテンツに応じた最適な音声及び非音声区間の速度比の算出が可能となり、部分的に偏ることなく聞き取り易い再生を実現することができる。なお、音声含有率を用いた速度比の算出方法については、後述にて詳述する。 Therefore, in the conventional technology that does not consider the voice content rate, as described above, the delay time from the target time becomes long in a news program with a high voice content rate. As a result, there has been a problem that in order to eliminate the delay time, the voice section is partially reproduced at high speed or deleted, and the reproduced audio signal becomes difficult to hear. On the other hand, in this embodiment, the audio content rate of content is calculated. Accordingly, it is possible to calculate the optimum voice / non-voice interval speed ratio according to the content without delay from the target time, and it is possible to realize reproduction that is easy to hear without partial bias. A method for calculating the speed ratio using the voice content will be described in detail later.

速度比条件設定部１４は、音声含有率及び目標圧伸比を入力とし、音声区間の平均速度比、非音声区間の平均速度比、及び音声区間の終端速度比を算出し、これらを速度比条件として設定する。 The speed ratio condition setting unit 14 receives the voice content rate and the target companding ratio, calculates the average speed ratio of the voice section, the average speed ratio of the non-voice section, and the terminal speed ratio of the voice section, and calculates these speed ratios. Set as a condition.

圧伸比とは、速度変換処理後の再生時間長を速度変換処理前の再生時間長で除算したものである。等倍速の再生では、圧伸比は１となる。２倍速の再生では、圧伸比は０．５となる。圧伸比が０から１までの値をとるとき、再生時間長は圧縮され、等倍速よりも速い速度で再生される。圧伸比が１より大きな値をとるとき、再生時間長は伸張され、等倍速よりも遅い速度で再生される。また目標圧伸比とは、速度変換を行いたいコンテンツの再生時間長をどれぐらい圧縮もしくは伸張するかを示したものである。目標圧伸比は、圧縮の場合は０から１までの値をとり、伸張の場合は１以上の値をとる。目標圧伸比は、ユーザによって入力されてもよいし、予め装置に設定されていてもよい。また、ユーザが目標圧伸比を直接入力しなくてもよい。この場合、コンテンツ再生の目標時間を入力する。ユーザが目標時間を入力した場合、目標時間を速度変換処理前の再生時間長で除算することで、目標圧伸比を得ることができる。また速度比とは、等倍速に対する速度の比率を示したものである。速度比は、圧伸比の逆数で表される。また音声区間の終端速度比とは、音声区間の終端時刻における速度比を意味する。 The companding ratio is obtained by dividing the reproduction time length after the speed conversion process by the reproduction time length before the speed conversion process. In the normal speed reproduction, the companding ratio is 1. In the double speed reproduction, the draw ratio is 0.5. When the companding ratio takes a value from 0 to 1, the playback time length is compressed and played back at a speed faster than the normal speed. When the companding ratio takes a value greater than 1, the reproduction time length is extended and reproduction is performed at a speed slower than the normal speed. The target companding ratio indicates how much the playback time length of the content to be speed-converted is compressed or expanded. The target companding ratio takes a value from 0 to 1 in the case of compression, and takes a value of 1 or more in the case of expansion. The target companding ratio may be input by the user or may be set in advance in the apparatus. Further, the user may not directly input the target companding ratio. In this case, a target time for content reproduction is input. When the user inputs the target time, the target companding ratio can be obtained by dividing the target time by the reproduction time length before the speed conversion process. The speed ratio indicates the ratio of the speed to the normal speed. The speed ratio is represented by the reciprocal of the drawing ratio. Further, the termination speed ratio of the voice section means the speed ratio at the termination time of the voice section.

次に、音声及び非音声区間の平均速度比を算出する方法について説明する。速度比条件設定部１４は、予め設定された速度比算出分布を用いて平均速度比を算出する。速度比算出分布とは、音声含有率及び目標圧伸比に応じて、どの算出パターンで平均速度比を算出するかを示した分布である。換言すれば、速度比算出分布は、音声含有率と、目標圧伸比と、音声及び非音声区間の平均速度比を算出する方法との対応を示した対応情報である。 Next, a method for calculating the average speed ratio between voice and non-voice sections will be described. The speed ratio condition setting unit 14 calculates an average speed ratio using a preset speed ratio calculation distribution. The speed ratio calculation distribution is a distribution indicating in which calculation pattern the average speed ratio is calculated according to the voice content ratio and the target companding ratio. In other words, the speed ratio calculation distribution is correspondence information indicating the correspondence between the voice content rate, the target companding ratio, and the method of calculating the average speed ratio between the voice and non-voice sections.

以下、算出パターンについて説明する。音声及び非音声区間の平均速度比は、目標圧伸比を達成するように算出される。具体的には、式（１）を満たすように算出される。

なお、Ｓは音声含有率、Ｖｍ１は音声区間の平均速度比、Ｖｍ２は非音声区間の平均速度比、Ｅは目標圧伸比を示す。算出パターンとしては、図４に示すように５種類の算出パターンａ〜ｅが考えられる。図４は、５種類の算出パターンを示した図である。算出パターンａ〜ｅの条件は、以下のようになる。
ａ：非音声区間の平均速度比Ｖｍ２＝Ａｎ（固定値）として、与えられる音声含有率Ｓと目標圧伸比Ｅから式（１）を満たすように、音声区間の平均速度比Ｖｍ１を算出する。但し、Ｖｍ１≦Ａｎを算出条件とする。
ｂ：音声区間の平均速度比Ｖｍ１＝Ｂｓ（固定値）として、与えられる音声含有率Ｓと目標圧伸比Ｅから式（１）を満たすように、非音声区間の平均速度比Ｖｍ２を算出する。但し、Ｖｍ２≧Ｂｓを算出条件とする。
ｃ：音声及び非音声区間の平均速度比をＶｍ１＝Ｖｍ２として、与えられる音声含有率Ｓと目標圧伸比Ｅから式（１）を満たすように、音声及び非音声区間の平均速度比Ｖｍ１及びＶｍ２を算出する。
ｄ：非音声区間の平均速度比Ｖｍ２＝Ｄｎ（固定値）として、与えられる音声含有率Ｓと目標圧伸比Ｅから式（１）を満たすように、音声区間の平均速度比Ｖｍ１を算出する。但し、Ｖｍ１≧Ｄｎを算出条件とする。
ｅ：音声区間の平均速度比Ｖｍ１＝Ｅｓ（固定値）として、与えられる音声含有率Ｓと目標圧伸比Ｅから式（１）を満たすように、非音声区間の平均速度比Ｖｍ２を算出する。但し、Ｖｍ２≦Ｅｓを算出条件とする。 Hereinafter, the calculation pattern will be described. The average speed ratio between the voice and non-voice sections is calculated so as to achieve the target companding ratio. Specifically, it is calculated so as to satisfy Expression (1).

S represents the voice content rate, Vm1 represents the average speed ratio of the voice section, Vm2 represents the average speed ratio of the non-voice section, and E represents the target companding ratio. As the calculation patterns, there are five types of calculation patterns a to e as shown in FIG. FIG. 4 is a diagram showing five types of calculation patterns. The conditions of the calculation patterns a to e are as follows.
a: As the average speed ratio Vm2 = An (fixed value) in the non-speech section, the average speed ratio Vm1 in the speech section is calculated from the given speech content S and the target companding ratio E so as to satisfy the expression (1). . However, Vm1 ≦ An is a calculation condition.
b: As the average speed ratio Vm1 = Bs (fixed value) of the speech section, the average speed ratio Vm2 of the non-speech section is calculated from the given speech content S and the target companding ratio E so as to satisfy the formula (1). . However, Vm2 ≧ Bs is a calculation condition.
c: The average speed ratio Vm1 between the voice and the non-speech section is set so that the average speed ratio between the voice and the non-speech section is Vm1 = Vm2 Vm2 is calculated.
d: As the average speed ratio Vm2 = Dn (fixed value) of the non-voice section, the average speed ratio Vm1 of the voice section is calculated from the given voice content S and the target companding ratio E so as to satisfy the equation (1). . However, Vm1 ≧ Dn is a calculation condition.
e: The average speed ratio Vm2 of the non-voice section is calculated from the given voice content ratio S and the target companding ratio E so that the expression (1) is satisfied as the average speed ratio Vm1 = Es (fixed value) of the voice section. . However, Vm2 ≦ Es is a calculation condition.

このように、平均速度比の算出パターンが異なれば、同じ音声含有率及び目標圧伸比であっても、音声区間の平均速度比と非音声区間の平均速度比の組み合わせは異なることとなる。そこで、音声含有率及び目標圧伸比に応じて、どの算出パターンを選択するか速度比算出分布を用いて決定する。以下、速度比算出分布について説明する。 Thus, if the calculation pattern of the average speed ratio is different, the combination of the average speed ratio of the voice section and the average speed ratio of the non-voice section is different even if the voice content rate and the target companding ratio are the same. Therefore, which calculation pattern is to be selected is determined using the speed ratio calculation distribution according to the voice content rate and the target companding ratio. Hereinafter, the speed ratio calculation distribution will be described.

図５に速度比算出分布の一例を示す。図５において、縦軸は音声含有率、横軸は目標圧伸比を示しており、算出パターンａ〜ｃの領域の分布が示されている。ここで、速度比算出分布は、上述した算出パターンから所定の算出パターンを選択し、選択した算出パターンに対して上述した条件を満足しつつ、音声及び非音声区間の平均速度比の取り得る値を設定することで得られる。図５に示す速度比算出分布では、上述した算出パターンのうち算出パターンａ〜ｃが選択されている。算出パターンａでは、非音声区間の平均速度比が最大値であるＡｎ＝４、音声区間の平均速度比Ｖｍ１の取り得る値が１．３≦Ｖｍ１≦２と設定されている。この取り得る値は、算出パターンａの算出条件（Ｖｍ１≦Ａｎ）を満足している。算出パターンｂでは、音声区間の平均速度比がＢｓ＝１．３、非音声区間の平均速度比Ｖｍ２の取り得る値が１．３≦Ｖｍ２≦４と設定されている。この取り得る値は、算出パターンｂの算出条件（Ｖｍ２≧Ｂｓ）を満足している。算出パターンｃでは、音声及び非音声区間の平均速度比をＶｍ１＝Ｖｍ２（１≦Ｖｍ１≦１．３）と設定されている。このように算出パターンを選択し、音声及び非音声区間の平均速度比の取り得る値を設定することで、図５の速度比算出分布を得ることができる。音声含有率と目標圧伸比が算出パターンａの領域内にある場合、音声及び非音声区間の平均速度比は、算出パターンａで算出される。算出パターンｂ、ｃについても同様である。このように、音声含有率及び目標圧伸比に応じてどの算出パターンで算出するかが、速度比算出分布によって決まることとなる。なお、図５の一番左側にある処理不可の領域は、音声含有率に対して目標圧伸比が極端に小さく、音声及び非音声区間の平均速度比をユーザが聞き取り可能な範囲で最大にしても、目標圧伸比を達成できない領域である。 FIG. 5 shows an example of the speed ratio calculation distribution. In FIG. 5, the vertical axis represents the voice content rate, the horizontal axis represents the target companding ratio, and the distribution of the calculation patterns a to c is shown. Here, the speed ratio calculation distribution is a value that the average speed ratio of the voice and non-voice sections can take while selecting a predetermined calculation pattern from the above-described calculation patterns and satisfying the above-described conditions for the selected calculation pattern. Can be obtained by setting In the speed ratio calculation distribution shown in FIG. 5, calculation patterns a to c are selected from the calculation patterns described above. In the calculation pattern a, the average speed ratio of the non-voice section is set to An = 4, and the average speed ratio Vm1 of the voice section can be set to 1.3 ≦ Vm1 ≦ 2. This possible value satisfies the calculation condition (Vm1 ≦ An) of the calculation pattern a. In the calculation pattern b, the average speed ratio of the voice section is set to Bs = 1.3, and the possible value of the average speed ratio Vm2 of the non-voice section is set to 1.3 ≦ Vm2 ≦ 4. This possible value satisfies the calculation condition (Vm2 ≧ Bs) of the calculation pattern b. In the calculation pattern c, the average speed ratio between the voice and non-voice sections is set as Vm1 = Vm2 (1 ≦ Vm1 ≦ 1.3). By selecting a calculation pattern in this way and setting possible values for the average speed ratio between voice and non-voice sections, the speed ratio calculation distribution of FIG. 5 can be obtained. When the voice content ratio and the target companding ratio are within the area of the calculation pattern a, the average speed ratio between the voice and the non-voice section is calculated by the calculation pattern a. The same applies to the calculation patterns b and c. In this way, which calculation pattern is calculated in accordance with the voice content rate and the target companding ratio is determined by the speed ratio calculation distribution. In the non-processable area on the left side of FIG. 5, the target companding ratio is extremely small with respect to the voice content rate, and the average speed ratio of the voice and non-voice sections is maximized within a range that can be heard by the user. However, it is an area where the target companding ratio cannot be achieved.

なお、図５の速度比算出分布では、音声含有率が高いほど、目標圧伸比に対して算出パターンａによって算出される割合が高くなる。図５の算出パターンａでは、非音声区間の平均速度比が最大値（Ａｎ＝４）に設定されている。これにより、目標圧伸比を達成する上で音声区間の平均速度比Ｖｍ１を可能な限り遅くすることができる。 In the speed ratio calculation distribution of FIG. 5, the higher the voice content rate, the higher the ratio calculated by the calculation pattern a with respect to the target companding ratio. In the calculation pattern a of FIG. 5, the average speed ratio of the non-voice section is set to the maximum value (An = 4). Thereby, in achieving the target companding ratio, the average speed ratio Vm1 of the voice section can be made as slow as possible.

また、図５の速度比算出分布では、音声含有率が低いほど、目標圧伸比に対して算出パターンｂによって算出される割合が高くなる。図５の算出パターンｂでは、音声区間の平均速度比がＢｓ＝１．３（固定値）に設定されている。これにより、目標圧伸比を達成する上で非音声区間の平均速度比Ｖｍ２を可能な限り遅くすることができる。このように、図５の速度比算出分布は、音声含有率の大きさに応じて音声含有率と算出方法との対応が異なるものとなる。 In the speed ratio calculation distribution of FIG. 5, the lower the voice content rate, the higher the ratio calculated by the calculation pattern b with respect to the target companding ratio. In the calculation pattern b of FIG. 5, the average speed ratio of the voice section is set to Bs = 1.3 (fixed value). Thereby, in achieving the target companding ratio, the average speed ratio Vm2 in the non-voice section can be made as slow as possible. As described above, in the speed ratio calculation distribution of FIG. 5, the correspondence between the voice content rate and the calculation method differs depending on the size of the voice content rate.

また、目標圧伸比が大きいとき、算出パターンｃが選択される。つまり、音声と非音声の平均速度比を等しくしている。目標圧伸比が大きいときは、音声と非音声が同じ平均速度比である方が、より自然に再生することができる。このように、図５の速度比算出分布は、目標圧伸比の大きさに応じて目標圧伸比と算出方法とが異なるものとなる。 When the target companding ratio is large, the calculation pattern c is selected. That is, the average speed ratio of voice and non-voice is made equal. When the target companding ratio is large, it is possible to reproduce more naturally when the voice and non-voice have the same average speed ratio. As described above, the speed ratio calculation distribution in FIG. 5 differs in the target companding ratio and the calculation method according to the size of the target companding ratio.

また、図５に示す速度比算出分布では、音声含有率と目標圧伸比で領域が一意に定まるように、音声及び非音声区間の平均速度比の取り得る値が設定されている。つまり、音声及び非音声区間の平均速度比の取り得る値は、隣り合う算出パターン間の境界で平均速度比の値が連続するように設定されている。具体的には、算出パターンａでは音声区間の平均速度比Ｖｍ１の最下限が１．３であり、隣り合う算出パターンｂの音声区間の平均速度比Ｂｓが１．３である。これにより、算出パターンａ及びｂの境界において平均速度比の値が連続することとなる。また、算出パターンｂでは音声区間の平均速度比Ｂｓが１．３であり、隣り合う算出パターンｃでは音声区間の平均速度比Ｖｍ１の最上限が１．３である。これにより、算出パターンｂ及びｃの境界において平均速度比の値が連続することとなる。 Further, in the speed ratio calculation distribution shown in FIG. 5, possible values of the average speed ratio of the voice and non-voice sections are set so that the area is uniquely determined by the voice content ratio and the target companding ratio. That is, the possible values of the average speed ratio between the voice and non-voice sections are set so that the average speed ratio value is continuous at the boundary between adjacent calculation patterns. Specifically, in the calculation pattern a, the lower limit of the average speed ratio Vm1 in the voice section is 1.3, and the average speed ratio Bs in the voice section of the adjacent calculation pattern b is 1.3. As a result, the average speed ratio value continues at the boundary between the calculation patterns a and b. In the calculation pattern b, the average speed ratio Bs in the voice section is 1.3, and in the adjacent calculation pattern c, the maximum upper limit of the average speed ratio Vm1 in the voice section is 1.3. As a result, the average speed ratio value continues at the boundary between the calculation patterns b and c.

なお、目標圧伸比が大きくなるにつれて音声及び非音声区間の平均速度比がどのように連続して変化するかという観点から説明すると、次のようになる。目標圧伸比が処理不可の領域内の値をとるとき、音声及び非音声区間の平均速度比は算出されない。算出パターンａの領域内の値をとるとき、目標圧伸比が大きくなるにつれて、音声区間の平均速度比は２から１．３まで小さくなる。このとき、非音声区間の平均速度比は、４で一定である。算出パターンｂの領域内の値をとるとき、音声区間の平均速度比は１．３で一定となり、非音声区間の平均速度比は目標圧伸比が大きくなるにつれて４から１．３まで小さくなる。算出パターンｃの領域内の値をとるとき、音声及び非音声区間の平均速度比は、共に同じ値となりながら、１．３から１まで小さくなる。 In addition, it will be as follows if it demonstrates from a viewpoint of how the average speed ratio of an audio | voice and a non-voice area changes continuously as a target companding ratio becomes large. When the target companding ratio takes a value within the unprocessable region, the average speed ratio between the voice and non-voice sections is not calculated. When taking a value within the area of the calculation pattern a, the average speed ratio of the voice interval decreases from 2 to 1.3 as the target companding ratio increases. At this time, the average speed ratio in the non-speech section is constant at 4. When taking a value in the area of the calculation pattern b, the average speed ratio of the voice section is constant at 1.3, and the average speed ratio of the non-voice section decreases from 4 to 1.3 as the target companding ratio increases. . When taking a value in the area of the calculation pattern c, the average speed ratio of the voice and non-voice sections is the same value, but decreases from 1.3 to 1.

このように、速度比算出分布が、算出パターンの切り替わる境界の平均速度比が連続値となるように設定されることで、平均速度比が不連続な値をとる際に急激な速度変換が起こり、違和感が生じるという問題を回避することができる。 In this way, the speed ratio calculation distribution is set so that the average speed ratio at the boundary where the calculation pattern changes becomes a continuous value, so that a rapid speed conversion occurs when the average speed ratio takes a discontinuous value. , It is possible to avoid the problem of uncomfortable feeling.

図６は、音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示している。上述した図５に示されるように、目標圧伸比が０．３７５から０．５０１までの値をとるとき、算出パターンａが選択される。目標圧伸比が０．５０１から０．７６９までの値をとるとき、算出パターンｂが選択される。目標圧伸比が０．７６９から１までの値をとるとき、算出パターンｃが選択される。ここで、図５に示した各算出パターンには、上述したように、平均速度比の取り得る値が設定されている。従って、上述した算出パターン及び式（１）により、図６に示すような平均速度比が算出される。 FIG. 6 shows the target companding ratio, the average speed ratio of the voice section, and the average speed ratio of the non-voice section when the voice content rate is 0.5. As shown in FIG. 5 described above, the calculation pattern a is selected when the target companding ratio takes values from 0.375 to 0.501. When the target companding ratio takes values from 0.501 to 0.769, the calculation pattern b is selected. When the target companding ratio takes a value from 0.769 to 1, the calculation pattern c is selected. Here, as described above, values that can be taken by the average speed ratio are set in each calculation pattern shown in FIG. Therefore, the average speed ratio as shown in FIG. 6 is calculated by the above-described calculation pattern and equation (1).

目標圧伸比が０．１及び０．３の値をとるとき、図５に示す速度比算出分布からも分かるように、処理不可の領域となるので、音声及び非音声区間の速度比は算出されない。目標圧伸比が０．４及び０．５の値をとるとき、共に算出パターンａによって算出される。なお、算出パターンａによって算出される場合、目標圧伸比が増加するにつれて音声区間の平均速度比が小さくなっていることが分かる。目標圧伸比が０．６及び０．７の値をとるとき、共に算出パターンｂによって算出される。なお、算出パターンｂによって算出される場合、目標圧伸比が増加するにつれて非音声区間の平均速度比が小さくなっていることが分かる。目標圧伸比が０．９及び１．０の値をとるとき、音声及び非音声区間の平均速度比が等しくなり、目標圧伸比が増加するにつれて音声及び非音声区間の平均速度比が小さくなっていることが分かる。 When the target companding ratio takes values of 0.1 and 0.3, as can be seen from the speed ratio calculation distribution shown in FIG. 5, it becomes an unprocessable region, so the speed ratio of the voice and non-voice sections is calculated. Not. When the target companding ratio takes values of 0.4 and 0.5, both are calculated by the calculation pattern a. In addition, when calculating with the calculation pattern a, it turns out that the average speed ratio of a voice area becomes small as the target companding ratio increases. When the target draw ratio takes values of 0.6 and 0.7, both are calculated by the calculation pattern b. In addition, when calculating by the calculation pattern b, it turns out that the average speed ratio of a non-voice area becomes small as the target companding ratio increases. When the target companding ratio takes values of 0.9 and 1.0, the average speed ratio of the voice and non-speech sections becomes equal, and the average speed ratio of the voice and non-speech sections decreases as the target companding ratio increases. You can see that

次に、音声区間の終端速度比を算出方法について説明する。速度比条件設定部１４は、算出した音声区間の平均速度比から、音声区間の終端速度比を算出する。図７は、音声及び非音声区間の速度比変化を示した模式図である。音声区間１の区間長は、音声区間２の区間長よりも短くなっている。縦軸は変換速度比であり、横軸は経過時間である。変換速度比とは、速度変換部１７の速度変換処理に用いられる速度比を示しており、音声及び非音声区間をそれぞれ細分化した各区間における速度比によって示される。変換速度比の決定方法については、後述にて説明する。図７に示すように、音声区間長が異なっていても、音声区間の終端時刻での速度比は等しくなっている。速度比条件設定部１４は、この終端時刻の速度比を音声区間の終端速度比として算出している。図７からも明らかなとおり、音声区間は終端速度比よりも遅い速度比が設定されている。音声区間の終端速度比Ｖｅｎｄは音声区間の平均速度比Ｖｍ１にαを加算したものとする。つまり、終端速度比Ｖｅｎｄ＝Ｖｍ１＋αとする。なお、聴取実験により、αを０．２とし、Ｖｅｎｄは２．０を超えないものが好ましいことが分かった。なお、図７では、非音声区間の速度比は平均速度比Ｖｍ２ｅｎｄで一定である。 Next, a method for calculating the termination speed ratio of the voice section will be described. The speed ratio condition setting unit 14 calculates the termination speed ratio of the voice section from the calculated average speed ratio of the voice section. FIG. 7 is a schematic diagram showing a change in speed ratio between voice and non-voice sections. The section length of the voice section 1 is shorter than the section length of the voice section 2. The vertical axis is the conversion speed ratio, and the horizontal axis is the elapsed time. The conversion speed ratio indicates a speed ratio used for the speed conversion processing of the speed conversion unit 17 and is indicated by a speed ratio in each section obtained by subdividing the voice and non-voice sections. A method for determining the conversion speed ratio will be described later. As shown in FIG. 7, even if the voice section length is different, the speed ratio at the end time of the voice section is the same. The speed ratio condition setting unit 14 calculates the speed ratio at the end time as the end speed ratio of the voice section. As is clear from FIG. 7, a speed ratio slower than the terminal speed ratio is set in the voice section. The terminal speed ratio Vend of the voice section is obtained by adding α to the average speed ratio Vm1 of the voice section. That is, the terminal speed ratio Vend = Vm1 + α. In addition, as a result of listening experiments, it was found that α is preferably 0.2 and Vend does not exceed 2.0. In FIG. 7, the speed ratio of the non-voice section is constant at the average speed ratio Vm2end.

以上のように、速度比条件設定部１４は、音声含有率及び目標圧伸比を入力とし、音声区間の平均速度比、非音声区間の平均速度比、及び音声区間の終端速度比を算出し、これらを速度比条件として設定する。 As described above, the speed ratio condition setting unit 14 receives the voice content rate and the target companding ratio, and calculates the average speed ratio of the voice section, the average speed ratio of the non-voice section, and the end speed ratio of the voice section. These are set as speed ratio conditions.

音声区間長算出部１５は、音声区間の始終端時刻を入力とし、音声区間長を算出する。速度比決定部１６は、速度比条件設定部１４で設定された速度比条件と音声区間長とに基づき、音声及び非音声区間の変換速度比を決定する。ここで、変換速度比とは、上述したように、速度変換部１７の速度変換処理に用いられる速度比を示しており、音声及び非音声区間をそれぞれ細分化した各区間における速度比によって示される。ただし、音声区間中の速度比を一定にする場合や、一定時間ごとに速度比を切り替える場合は音声区間長を必ずしも算出する必要はないため、音声区間長算出部１５を設けなくてもよい。たとえ、音声区間長算出部１５を設けなかったとしても、音声含有率によって音声区間の平均速度比を設定しているため、従来の方法よりも聞き易くなる。これに対し、音声区間長算出部１５を設けることによって、音声区間長を算出し、音声区間の細分化された各区間に対して速度比を設定することで、更に聞き易くなる効果がある。 The voice segment length calculation unit 15 receives the start and end times of the voice segment and calculates the voice segment length. The speed ratio determining unit 16 determines the conversion speed ratio between the voice and non-voice sections based on the speed ratio condition set by the speed ratio condition setting unit 14 and the voice section length. Here, the conversion speed ratio indicates the speed ratio used for the speed conversion processing of the speed conversion unit 17 as described above, and is indicated by the speed ratio in each section obtained by subdividing the voice and non-voice sections. . However, when the speed ratio in the speech section is made constant or when the speed ratio is switched at regular intervals, it is not always necessary to calculate the speech section length, so the speech section length calculation unit 15 does not have to be provided. Even if the voice section length calculation unit 15 is not provided, since the average speed ratio of the voice section is set according to the voice content rate, it is easier to hear than the conventional method. On the other hand, by providing the voice segment length calculation unit 15, the voice segment length is calculated, and the speed ratio is set for each segment of the voice segment, which makes it easier to hear.

速度変換部１７は、オーディオ信号を入力とし、速度比決定部１６で決定された変換速度比に従って速度変換を行う。速度変換の方法としては、例えば「高品質音声速度変換方式のＤＳＰによる実現」＜鈴木，三崎，電子情報通信学会音声研究会資料ＳＰ９０−３４、（１９９０．８．２３）＞、特許第３１８９５８７号などに記載された公知の方法を用いるとする。このような方法により、１倍速以下の遅い速度比での再生や、１倍速以上の速い速度比での再生が可能となる。また、速度変換の方法はこれに限らず、音を合成したり、区間の削除や挿入などを行ったり、速度比決定部１６で決定された変換速度比を満たすような処理を行っているものであれば方法は問わない。例えば、変換速度比が０．５である場合を仮定すると、ある入力区間に対して出力再生時間が２倍となればよく、音を引き伸ばしたり、無音区間を追加したり、新たに音を合成してもよい。このように、速度変換部１７はある区間に対する入力と出力との関係が対応付けられており、変換速度比を満たすような処理を行っているものであれば、速度変換の方法として含まれる。 The speed conversion unit 17 receives the audio signal and performs speed conversion according to the conversion speed ratio determined by the speed ratio determination unit 16. As a method of speed conversion, for example, “Realization of high-quality voice speed conversion system by DSP” <Suzuki, Misaki, Society of Electronics, Information and Communication Engineers, Speech Research Group Material SP90-34, (1990.8.23)>, Japanese Patent No. 3189588 It is assumed that a known method described in the above is used. By such a method, reproduction at a slow speed ratio of 1 × speed or less and reproduction at a fast speed ratio of 1 × speed or more are possible. In addition, the speed conversion method is not limited to this, and a process that synthesizes sound, deletes or inserts a section, or performs a process that satisfies the conversion speed ratio determined by the speed ratio determination unit 16 is performed. Any method can be used. For example, assuming that the conversion speed ratio is 0.5, it is sufficient that the output playback time is doubled for a certain input section, and the sound is stretched, a silent section is added, or a new sound is synthesized. May be. As described above, the speed conversion unit 17 is associated with the relationship between the input and the output for a certain section, and is included as a speed conversion method as long as it performs processing that satisfies the conversion speed ratio.

以下、図８を参照して、第１の実施形態に係る音声再生装置の処理について説明する。図８は、第１の実施形態に係る音声再生装置の処理の流れを示すフローチャートである。 Hereinafter, with reference to FIG. 8, the process of the audio reproduction device according to the first embodiment will be described. FIG. 8 is a flowchart showing a flow of processing of the audio reproduction device according to the first embodiment.

まず、ユーザが入力装置（図示なし）においてコンテンツを録画する指示をしたとき、当該コンテンツのオーディオ信号及びビデオ信号が蓄積部１２に蓄積される。このとき、音声非音声判別部１１はそのコンテンツのオーディオ信号について音声区間と非音声区間とを判別する（ステップＳ１０１）。なお、ステップＳ１０１において判別された判別結果と音声区間の始終端時刻についても、蓄積部１２に蓄積される。 First, when a user gives an instruction to record content on an input device (not shown), an audio signal and a video signal of the content are stored in the storage unit 12. At this time, the voice / non-voice discriminating unit 11 discriminates a voice section and a non-voice section from the audio signal of the content (step S101). Note that the determination result determined in step S101 and the start and end times of the speech segment are also stored in the storage unit 12.

ステップＳ１０１の次に、入力装置において、ユーザが所望のコンテンツを再生する指示をしたか否かが判断される（ステップＳ１０２）。ユーザの指示があった場合（ステップＳ１０２でＹｅｓ）、音声含有率算出部１３は、指示されたコンテンツの音声含有率を算出する（ステップＳ１０３）。 After step S101, it is determined whether or not the user has instructed the user to play back the desired content (step S102). When there is an instruction from the user (Yes in step S102), the audio content rate calculation unit 13 calculates the audio content rate of the instructed content (step S103).

ステップＳ１０３の次に、ユーザが入力装置（図示なし）において目標圧伸比を設定する（ステップＳ１０４）。速度比条件設定部１４は、ステップＳ１０３で算出された音声含有率と、予め設定された速度比算出分布とから、ステップＳ１０４で設定された目標圧伸比が処理不可の領域内にあるか否かを判断する（ステップＳ１０５）。処理不可の領域内に目標圧伸比が設定された場合（ステップＳ１０５でＮｏ）、処理はステップＳ１０４に戻る。ステップＳ１０４において、速度比条件設定部１４は、目標圧伸比を処理可能な値に再設定する。図５の場合、音声含有率が０．５のとき処理可能な最小の目標圧伸比は、０．３７５となる。したがって速度比条件設定部１４は、最も近い領域境界の値０．３７５を目標圧伸比として再設定する。なお、速度比条件設定部１４が自動で再設定するのではなく、目標圧伸比の入力を再度ユーザに求めるようにしてもよい。 After step S103, the user sets a target companding ratio with an input device (not shown) (step S104). The speed ratio condition setting unit 14 determines whether or not the target companding ratio set in step S104 is within an unprocessable region based on the voice content rate calculated in step S103 and the preset speed ratio calculation distribution. Is determined (step S105). If the target companding ratio is set in the unprocessable area (No in step S105), the process returns to step S104. In step S104, the speed ratio condition setting unit 14 resets the target companding ratio to a processable value. In the case of FIG. 5, the minimum target companding ratio that can be processed when the voice content is 0.5 is 0.375. Therefore, the speed ratio condition setting unit 14 resets the nearest region boundary value 0.375 as the target companding ratio. Note that the speed ratio condition setting unit 14 may not automatically reset the speed ratio condition setting unit 14 but may request the user to input the target companding ratio again.

ステップＳ１０５の次に、速度比条件設定部１４は、ステップＳ１０３で算出された音声含有率、ステップＳ１０４及びＳ１０５で設定された目標圧伸比に基づいて、音声区間の平均速度比、非音声区間の平均速度比、及び音声区間の終端速度比を算出し、速度比条件を設定する（ステップＳ１０６）。なお、速度比条件の算出方法については、上述したとおりである。 After step S105, the speed ratio condition setting unit 14 determines the average speed ratio of the voice interval and the non-voice interval based on the voice content rate calculated in step S103 and the target companding ratio set in steps S104 and S105. Are calculated, and the speed ratio condition is set (step S106). The method for calculating the speed ratio condition is as described above.

ステップＳ１０６の次に、音声区間長算出部１５は、音声区間の始終端時刻を入力とし、音声区間長を算出する（ステップＳ１０７）。音声区間長は、図９及び図１０に示すように、同じコンテンツ内でも長短様々なものが含まれているが、ジャンルによっても大きく異なる。図９は、ニュース番組に含まれる音声区間長とその頻度を示した図である。図１０は、野球番組に含まれる音声区間長とその頻度を示した図である。図９及び図１０において、横軸は音声区間長であり、縦軸は番組中に発生した頻度である。 After step S106, the speech segment length calculator 15 receives the start / end time of the speech segment and calculates the speech segment length (step S107). As shown in FIGS. 9 and 10, the length of the voice section includes various lengths within the same content, but greatly varies depending on the genre. FIG. 9 is a diagram showing the length of a voice segment included in a news program and its frequency. FIG. 10 is a diagram showing the length of a voice segment included in a baseball program and its frequency. 9 and 10, the horizontal axis represents the voice section length, and the vertical axis represents the frequency of occurrence during the program.

ここで、上述した従来技術では、音声区間長を考慮せず始端からの経過時間のみで速度比を設定しており、音声区間長が長いものでは経過時間に伴って速度比が段々速くなり、聞きにくくなるという課題があった。これに対し、本実施形態では、音声区間長を考慮することで、図７に示したように、音声区間長の長短に関わらず、音声区間の終端での速度比が等しくなるように速度比を決定することができる。これにより、音声区間の速度比が始端から徐々に速くなるが、終端が聞き取り可能な速度比までしか速くならないため、従来技術に比べ、聞き易さが大きく改善した。 Here, in the above-described prior art, the speed ratio is set only by the elapsed time from the start without considering the voice section length, and the speed ratio is gradually increased with the elapsed time when the voice section length is long, There was a problem that it was difficult to hear. On the other hand, in the present embodiment, by considering the voice section length, as shown in FIG. 7, the speed ratio is set so that the speed ratio at the end of the voice section becomes equal regardless of the length of the voice section length. Can be determined. As a result, the speed ratio of the voice section is gradually increased from the beginning, but the speed is increased only to the speed ratio at which the end can be heard.

ステップＳ１０７の次に、速度比決定部１６は、蓄積部１２に蓄積された音声区間の始終端時刻を参照して、コンテンツの始端から順に所定の単位時間毎に音声区間であるか否かを判断する（ステップＳ１０８）。音声区間と判断した場合、速度比決定部１６は、音声区間の始終端時刻と、ステップＳ１０７で算出された音声区間長とに基づき、音声区間における経過割合を算出する（ステップＳ１０９）。音声区間の経過割合とは、音声区間の始端を０、終端を１として、始端からの経過時間を音声区間長で除算したものである。 After step S107, the speed ratio determination unit 16 refers to the start / end time of the audio section stored in the storage unit 12, and determines whether or not it is an audio section every predetermined unit time in order from the start end of the content. Judgment is made (step S108). When it is determined that the voice segment is determined, the speed ratio determination unit 16 calculates the elapsed rate in the voice segment based on the start / end time of the voice segment and the voice segment length calculated in step S107 (step S109). The elapsed rate of the speech segment is obtained by dividing the elapsed time from the start end by the speech segment length, where 0 is the start end and 1 is the end of the speech segment.

ステップＳ１０９の次に、速度比決定部１６は、音声区間の経過割合から、音声区間の変換速度比を決定する（ステップＳ１１０）。以下、ステップＳ１１０の具体的な処理例について説明する。変換速度比の算出処理の一例としては、音声区間の平均圧伸比に対する圧伸比変化量の和が０になるように、変換速度比を算出する方法が挙げられる。図１１は、音声区間の圧伸比の変化を示した図である。図１１において、ｘは経過割合、ｖｘは経過割合がｘのときの変換圧伸比、ｖｓは始端圧伸比、ｖｅは終端圧伸比、ｖｍ１は平均圧伸比とする。ここで、始端圧伸比ｖｓと終端圧伸比ｖｅとを結ぶ圧伸比の変化カーブは、式（２）で表現される。

平均圧伸比ｖｍ１は、音声区間の平均速度比Ｖｍ１の逆数である。終端圧伸比ｖｅは、終端速度比Ｖｅｎｄの逆数である。ここで、圧伸比変化量は、音声区間の平均圧伸比ｖｍ１を０と想定したとき、ｖｍ１に対して増減する量（図１１の網掛け部分の面積）を意味する。したがって、この量の和が０となるためには、図１１に示したように、ｘ＝０．５のときに変換圧伸比ｖｘが平均圧伸比ｖｍ１となるようにすればよい。平均圧伸比ｖｍ１は、音声区間の平均速度比Ｖｍ１から求まる値であり、終端圧伸比ｖｅは、終端速度比Ｖｅｎｄから求まる値である。したがって、式（３）を満たすように始端圧伸比ｖｓを設定すればよいこととなる。

なお、経過割合がｘのときの変換圧伸比ｖｘ、始端圧伸比ｖｓ、終端圧伸比ｖｅ、平均圧伸比ｖｍ１を速度比で表すと、式（４）のようになる。ここで、Ｖｘは経過割合がｘのときの変換速度比、Ｖｓは始端速度比、Ｖｅｎｄは上述した終端速度比、Ｖｍ１は上述した平均速度比を示す。

そして、式（２）に式（３）および（４）を代入すると、式（５）が得られる。

なお、速度変換後の音声区間長は、平均速度比Ｖｍ１で一様に変換した時間長と等しくなることから、式（６）が成り立つ。

このようにステップＳ１１０において、速度比決定部１６は、式（５）に音声区間の経過割合ｘを代入することで、音声区間の変換速度比Ｖｘを決定することができる。このステップＳ１１０で算出した音声区間の変換速度比は、上述した図７のような変化となる。つまり、音声区間の冒頭部分を遅くし、終端に向かって徐々に速めていくように、変換速度比を音声区間長に応じて変化させることができる。 After step S109, the speed ratio determination unit 16 determines the conversion speed ratio of the voice section from the elapsed rate of the voice section (step S110). Hereinafter, a specific processing example of step S110 will be described. As an example of the conversion speed ratio calculation process, there is a method of calculating the conversion speed ratio so that the sum of the amount of change in the drawing ratio with respect to the average drawing ratio in the voice section becomes zero. FIG. 11 is a diagram showing a change in the companding ratio in the voice section. In FIG. 11, x is the elapsed ratio, vx is the conversion companding ratio when the elapsed ratio is x, vs is the start end companding ratio, ve is the end companding ratio, and vm1 is the average companding ratio. Here, the change curve of the drawing ratio connecting the starting end drawing ratio vs and the ending drawing ratio ve is expressed by Expression (2).

The average companding ratio vm1 is the reciprocal of the average speed ratio Vm1 of the voice section. The terminal companding ratio ve is the reciprocal of the terminal speed ratio Vend. Here, when the average companding ratio vm1 of the voice section is assumed to be 0, the companding ratio change amount means an amount that increases / decreases with respect to vm1 (the area of the shaded portion in FIG. 11). Therefore, in order for the sum of the amounts to be 0, as shown in FIG. 11, the converted companding ratio vx should be the average companding ratio vm1 when x = 0.5. The average companding ratio vm1 is a value obtained from the average speed ratio Vm1 of the voice section, and the terminal companding ratio ve is a value obtained from the terminal speed ratio Vend. Therefore, the starting end companding ratio vs may be set so as to satisfy the expression (3).

In addition, when the elapsed ratio is x, the converted companding ratio vx, the starting end companding ratio vs, the end companding ratio ve, and the average companding ratio vm1 are expressed by a speed ratio as shown in Expression (4). Here, Vx is the conversion speed ratio when the elapsed ratio is x, Vs is the start speed ratio, Vend is the end speed ratio described above, and Vm1 is the average speed ratio described above.

Then, when Expressions (3) and (4) are substituted into Expression (2), Expression (5) is obtained.

Note that the voice section length after the speed conversion is equal to the time length uniformly converted with the average speed ratio Vm1, and therefore Expression (6) is established.

As described above, in step S110, the speed ratio determination unit 16 can determine the conversion speed ratio Vx of the voice section by substituting the elapsed ratio x of the voice section into the equation (5). The conversion speed ratio of the speech section calculated in step S110 changes as shown in FIG. That is, the conversion speed ratio can be changed in accordance with the length of the speech segment so that the beginning of the speech segment is delayed and gradually accelerated toward the end.

ステップＳ１０８において非音声区間と判断した場合、速度比決定部１６は、非音声区間の始端から終端まで、速度比条件設定部１４で設定された非音声区間の平均速度比を変換速度比として決定する。つまり、図７に示したように、速度比決定部１６は、平均速度比で一定となるように、非音声区間の始端から終端までの変換速度比を決定する。 If it is determined in step S108 that it is a non-speech segment, the speed ratio determination unit 16 determines the average speed ratio of the non-speech segment set by the speed ratio condition setting unit 14 as the conversion rate ratio from the start to the end of the non-speech segment. To do. That is, as shown in FIG. 7, the speed ratio determination unit 16 determines the conversion speed ratio from the start end to the end of the non-voice section so that the average speed ratio is constant.

ステップＳ１１０及びＳ１１１の次に、速度比決定部１６は、コンテンツの終端まで変換速度比を算出したか否かを判断する（ステップＳ１１２）。終端ではないとき、処理はステップＳ１０８へ戻る。このように、コンテンツの終端までの変換速度比が算出されるまで、速度比決定部１６においてステップＳ１０８〜Ｓ１１２までの処理が繰り返される。ステップＳ１１２においてコンテンツの終端まで変換速度比が算出されたと判断された場合、速度変換部１７において変換速度比に従ってオーディオ信号の速度変換が行われ、速度変換後のオーディオ信号の再生が開始される（ステップＳ１１３）。入力装置（図示なし）が本装置の処理を終了するか否かの指示を受け付ける（ステップＳ１１４）。ユーザが他のコンテンツについて速度変換処理を行う場合（ステップＳ１１４でＮｏ）、処理はステップＳ１０２へ戻る。 Next to steps S110 and S111, the speed ratio determination unit 16 determines whether or not the conversion speed ratio has been calculated up to the end of the content (step S112). If not, the process returns to step S108. In this way, the processing from step S108 to step S112 is repeated in the speed ratio determination unit 16 until the conversion speed ratio up to the end of the content is calculated. When it is determined in step S112 that the conversion speed ratio has been calculated up to the end of the content, the speed conversion unit 17 converts the speed of the audio signal according to the conversion speed ratio, and starts playing the audio signal after the speed conversion ( Step S113). The input device (not shown) receives an instruction as to whether or not to end the processing of this device (step S114). When the user performs the speed conversion process for other contents (No in step S114), the process returns to step S102.

以上のように、本実施形態に係る音声再生装置によれば、コンテンツの音声含有率を算出することにより、コンテンツに応じた速度比条件を設定することができる。これにより、目標圧伸比、つまり目標時間を達成しつつも、音声区間及び非音声区間の速度比をコンテンツに応じた最適な速度比にそれぞれ設定することができる。その結果、どのようなコンテンツのオーディオ信号が入力されても、聞き取り易い速度で再生することが可能となり、再生内容の不連続性や情報の欠落による不快感などを低減させた再生を行うことができる。 As described above, according to the audio reproduction device according to the present embodiment, the speed ratio condition corresponding to the content can be set by calculating the audio content rate of the content. Thereby, while achieving the target companding ratio, that is, the target time, the speed ratio between the voice section and the non-voice section can be set to the optimum speed ratio according to the content. As a result, whatever content audio signal is input, it can be played back at a speed that is easy to hear, and playback with reduced discontinuity of playback content and discomfort due to lack of information can be performed. it can.

また本実施形態に係る音声再生装置によれば、速度比決定部１６において図１１に示すように圧伸比の変化を示す関数として、１次関数を用いるとした。つまり、本装置に入力されたオーディオ信号に対して、一次直線で速度比を設定している。ここで、日本語はモーラリズムの言語と言われており、個々のモーラが同じ長さになるように話す傾向がある。モーラは言葉を話すときの長さの単位であり、日本語では俳句や短歌で数える拍に相当する。「かな」でいえば、一文字に相当している。このモーラ毎に速度比を変化させることが望ましいが、入力されるオーディオ信号に対して一次直線で速度比の算出を行っているので、モーラ毎に速度比を変化させなくても、十分に自然な再生を実現することができる。さらに、音声区間の始端から終端までの速度比は、一次関数によって、細かく切り替えられている。これにより、知覚される時間よりも短い間隔で速度を変化させることとなり、違和感が少ない自然な再生を提供することができる。 Further, according to the sound reproducing device of the present embodiment, the linear function is used as the function indicating the change in the drawing / compression ratio as shown in FIG. That is, the speed ratio is set with a linear line for the audio signal input to the apparatus. Here, Japanese is said to be a language of mora rhythm, and there is a tendency to speak so that each mora has the same length. Mora is a unit of length for speaking words, and in Japanese, it corresponds to beats counted in haiku and tanka. Speaking of “kana”, it corresponds to one character. Although it is desirable to change the speed ratio for each mora, since the speed ratio is calculated with a linear line for the input audio signal, it is sufficiently natural even without changing the speed ratio for each mora. Playback can be realized. Furthermore, the speed ratio from the beginning to the end of the speech section is finely switched by a linear function. As a result, the speed is changed at intervals shorter than the perceived time, and natural reproduction with less discomfort can be provided.

また本実施形態に係る音声再生装置によれば、音声区間の始端から終端まで音声区間長に応じて速度比を設定している。これにより、音声区間の終端時刻において予め設定した終端速度比よりも速い速度比になることなく、音声区間の終端時刻付近において速度比が速くなりすぎて聞き取り難くなることを防ぐことができる。 In addition, according to the audio reproducing device according to the present embodiment, the speed ratio is set according to the audio section length from the start end to the end of the audio section. Accordingly, it is possible to prevent the speed ratio from becoming too fast near the termination time of the voice section and becoming difficult to hear without becoming a speed ratio faster than a preset termination speed ratio at the termination time of the voice section.

なお、上述では図１１に示したように、圧伸比の変化を一次関数によって表すようにしたが、他の関数によって表されても構わない。例えば、上に凸または下に凸の指数関数であってもよい。また例えば、予め用いることができる速度比が限られている場合は、２段階や数段階の速度比で変換をおこなっても、音声区間長の経過割合に応じた速度変換を行うことで、不自然さを低減させた再生を提供することができる。図１２は、２段階の変換速度比を算出した場合を示す図である。図１２において、より好ましくは、音声区間全体に対して、最初の変換速度比が始端から２〜３割の範囲を占めるようにする。これにより、より自然な再生を実現することが聴取実験で明らかとなった。また例えば、音声区間の速度比を非音声区間と同様に一定の速度比としてもよい。この場合であっても、音声含有率により、適切な速度比が設定されるため、従来技術のような非音声区間の削除や極端な高速化が行われずに済み、聞き易い再生を提供することができる。 In the above description, as shown in FIG. 11, the change in the drawing ratio is expressed by a linear function, but may be expressed by another function. For example, it may be an exponential function convex upward or convex downward. Also, for example, when the speed ratio that can be used in advance is limited, even if conversion is performed with two or several speed ratios, it is not possible to perform speed conversion according to the elapsed rate of the voice interval length. Reproduction with reduced naturalness can be provided. FIG. 12 is a diagram illustrating a case where a two-stage conversion speed ratio is calculated. In FIG. 12, it is more preferable that the first conversion speed ratio occupies a range of 20 to 30% from the start end with respect to the entire speech section. As a result, it became clear through listening experiments that more natural reproduction was realized. Further, for example, the speed ratio of the voice section may be a constant speed ratio as in the non-voice section. Even in this case, since an appropriate speed ratio is set depending on the audio content rate, it is not necessary to delete non-speech sections or perform extreme speedup as in the prior art, and provide easy-to-listen playback. Can do.

なお、上述では、速度比条件設定部１４は、図５に示した速度比算出分布を用いるとしたが、これに限定されない。ユーザ自身が、所望の算出パターンを選択して音声及び非音声区間の平均速度比の取り得る値を所望の値に設定し、速度比算出分布を作成するようにしてもよい。つまり、速度比条件設定部１４が用いる速度比算出分布は、予め設定されているものに限らず、ユーザによって設定されるものであってもよい。例えば、図５では音声区間の平均速度比が２．０までとり得る。しかし、高齢者や語学学習者では２．０よりももっと遅い平均速度比での聴き取りを希望する場合もある。その際に、ユーザが望む平均速度比を超えないように、音声及び非音声区間の平均速度比の取り得る値を設定することで、ユーザの聴き取り能力に応じた再生処理が可能となる。また、高齢者や語学学習者では通常の再生速度よりさらに遅くして聞きたい場合が存在する。音声区間は通常の平均速度比１．０よりも遅い速度にし、非音声区間を通常の平均速度比１．０より高速化して通常の再生時間と同じ時間内で収めたい、あるいはもっと短い時間で視聴したいといった要望に答えるためにも、速度比条件設定部１４の速度比算出分布は用途に応じて切り替えることを可能にしている。 In the above description, the speed ratio condition setting unit 14 uses the speed ratio calculation distribution shown in FIG. 5, but is not limited thereto. The user himself / herself may select a desired calculation pattern, set a possible value of the average speed ratio between the voice and non-voice sections to a desired value, and create a speed ratio calculation distribution. That is, the speed ratio calculation distribution used by the speed ratio condition setting unit 14 is not limited to that set in advance, and may be set by the user. For example, in FIG. 5, the average speed ratio of the voice section can be up to 2.0. However, elderly and language learners may wish to listen at an average speed ratio slower than 2.0. At this time, by setting a value that can be taken by the average speed ratio of the voice and non-voice sections so as not to exceed the average speed ratio desired by the user, it is possible to perform a reproduction process according to the user's listening ability. In addition, there are cases where elderly people and language learners want to listen at a slower playback speed than normal playback speed. The voice interval should be slower than the normal average speed ratio of 1.0, and the non-voice interval should be faster than the normal average speed ratio of 1.0 to fit within the normal playback time, or in a shorter time The speed ratio calculation distribution of the speed ratio condition setting unit 14 can be switched according to the purpose in order to answer the desire to view.

また、図５に示した速度比算出分布は、ジャンル毎に予め用意されていてもよい。この場合、ＥＰＧ等のジャンル情報やユーザの指示によって、いずれの速度比算出分布を用いるかを選択する。ここで音声含有率以外にも、画像の動きの激しさ等はジャンルによって異なる。例えば、ドキュメンタリーなどの静止画像が多いジャンルでは、非音声区間を高速で再生しても、画像の高速化による情報の欠落は少ない。また、非音声区間を高速で再生できるので、音声区間を１倍速に近づけることができる。その結果、番組内容を理解しやすい再生を行うことができる。ここで、ドキュメンタリーなどの静止画像が多いジャンルについての速度比算出分布の例を図１３に示す。図１３に示すように、速度比算出分布は、算出パターンａ及びｂの領域で構成される。このうち、算出パターンａでは、非音声区間の平均速度比が最大値であるＡｎ＝４、音声区間の平均速度比Ｖｍ１の取り得る値が１≦Ｖｍ１≦２と設定されている。算出パターンｂでは、音声区間の平均速度比がＢｓ＝１、非音声区間の平均速度比Ｖｍ２の取り得る値が１≦Ｖｍ２≦４と設定されている。 Further, the speed ratio calculation distribution shown in FIG. 5 may be prepared in advance for each genre. In this case, which speed ratio calculation distribution is used is selected according to genre information such as EPG or a user instruction. Here, in addition to the audio content rate, the intensity of movement of the image varies depending on the genre. For example, in a genre with many still images such as documentaries, even if a non-voice section is played back at a high speed, information loss due to an increase in the speed of the image is small. Further, since the non-voice section can be played back at high speed, the voice section can be made close to 1 × speed. As a result, it is possible to perform reproduction that makes it easy to understand the contents of the program. Here, FIG. 13 shows an example of a speed ratio calculation distribution for a genre with many still images such as documentaries. As shown in FIG. 13, the speed ratio calculation distribution is composed of areas of calculation patterns a and b. Among these, in the calculation pattern a, the average speed ratio of the non-voice section is set to An = 4, and the possible value of the average speed ratio Vm1 of the voice section is set to 1 ≦ Vm1 ≦ 2. In the calculation pattern b, the average speed ratio of the voice section is set to Bs = 1, and the possible value of the average speed ratio Vm2 of the non-voice section is set to 1 ≦ Vm2 ≦ 4.

図１４は、図１３に示す速度比算出分布において、音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示している。目標圧伸比が０．１及び０．３の値をとるとき、図１３に示す速度比算出分布からも分かるように、処理不可の領域となるので、音声及び非音声区間の速度比は算出されない。目標圧伸比が０．４、０．５、及び０．６の値をとるとき、共に算出パターンａによって算出される。目標圧伸比が０．７、０．９、及び１の値をとるとき、共に算出パターンｂによって算出される。図１４から分かるように、図１３に示す速度比算出分布を用いた場合、算出パターンが２つのパターンで構成されるので、音声及び非音声の平均速度比の差が大きくなる。換言すれば、非音声区間を高速化し、音声区間を１倍速にすることができ、番組内容をより理解し易い再生を実現することができることを意味する。 FIG. 14 shows the target companding ratio, the average speed ratio of the voice section, and the average speed ratio of the non-voice section when the voice content rate is 0.5 in the speed ratio calculation distribution shown in FIG. When the target companding ratio takes values of 0.1 and 0.3, it becomes an unprocessable area as can be seen from the speed ratio calculation distribution shown in FIG. 13, so the speed ratio of the voice and non-voice sections is calculated. Not. When the target companding ratio takes values of 0.4, 0.5, and 0.6, both are calculated by the calculation pattern a. When the target companding ratio takes values of 0.7, 0.9, and 1, both are calculated by the calculation pattern b. As can be seen from FIG. 14, when the speed ratio calculation distribution shown in FIG. 13 is used, the difference between the average speed ratios of voice and non-voice increases because the calculation pattern is composed of two patterns. In other words, it means that the non-speech section can be speeded up and the voice section can be set to 1 × speed, so that the program contents can be reproduced more easily.

また、例えばスポーツなど動きの激しいシーンが多いジャンルでは、音声と非音声の平均速度比に大きな差をつけないほうがよい。なぜならば、動きの激しいシーンが多いジャンルは、動きの少ないジャンルに比べ、番組の内容理解に対して音声以外の部分が与える影響が大きいため、非音声区間の聞き取り易さや見易さを向上させる必要があるからである。ここで、スポーツなど動きの激しいシーンが多いジャンルについての速度比算出分布の例を図１５に示す。図１５に示すように、速度比算出分布は、２つの算出パターンａ、２つの算出パターンｂ、及び算出パターンｃの領域で構成される。このうち、一番左側の算出パターンａでは、非音声区間の平均速度比がＡｎ＝３、音声区間の平均速度比Ｖｍ１の取り得る値が１．８≦Ｖｍ１≦２．５と設定されている。この算出パターンａと隣り合う算出パターンｂでは、音声区間の平均速度比がＢｓ＝１．８、非音声区間の平均速度比Ｖｍ２の取り得る値が２．５≦Ｖｍ２≦３と設定されている。この算出パターンｂと隣り合う算出パターンａでは、非音声区間の平均速度比がＡｎ＝２．５、音声区間の平均速度比Ｖｍ１の取り得る値が１．５≦Ｖｍ１≦１．８と設定されている。この算出パターンａと隣り合う算出パターンｂでは、音声区間の平均速度比がＢｓ＝１．５、非音声区間の平均速度比Ｖｍ２の取り得る値が１．５≦Ｖｍ２≦２．５と設定されている。この算出パターンｂと隣り合う算出パターンｃでは、音声及び非音声区間の平均速度比をＶｍ１＝Ｖｍ２（１≦Ｖｍ１≦１．５）と設定されている。 In addition, in a genre where there are many scenes of intense movement such as sports, it is better not to make a large difference in the average speed ratio between voice and non-voice. This is because a genre with many scenes with a lot of movement has a greater influence on the content understanding of the program than a genre with little movement, so it improves the ease of hearing and visibility of non-voice segments. It is necessary. Here, FIG. 15 shows an example of the speed ratio calculation distribution for a genre with many scenes with a lot of motion such as sports. As illustrated in FIG. 15, the speed ratio calculation distribution includes regions of two calculation patterns a, two calculation patterns b, and calculation patterns c. Among these, in the leftmost calculation pattern a, the average speed ratio of the non-voice section is set to An = 3, and the possible value of the average speed ratio Vm1 of the voice section is set to 1.8 ≦ Vm1 ≦ 2.5. . In the calculation pattern b adjacent to the calculation pattern a, the average speed ratio of the voice section is set to Bs = 1.8, and the possible value of the average speed ratio Vm2 of the non-voice section is set to 2.5 ≦ Vm2 ≦ 3. . In the calculation pattern a adjacent to the calculation pattern b, the average speed ratio of the non-voice section is set to An = 2.5, and the possible value of the average speed ratio Vm1 of the voice section is set to 1.5 ≦ Vm1 ≦ 1.8. ing. In the calculation pattern b adjacent to the calculation pattern a, the average speed ratio of the voice section is set to Bs = 1.5, and the possible value of the average speed ratio Vm2 of the non-voice section is set to 1.5 ≦ Vm2 ≦ 2.5. ing. In the calculation pattern c adjacent to the calculation pattern b, the average speed ratio between the voice and non-voice sections is set as Vm1 = Vm2 (1 ≦ Vm1 ≦ 1.5).

図１６は、図１５に示す速度比算出分布において、音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示している。目標圧伸比が０．１及び０．３の値をとるとき、図１５に示す速度比算出分布からも分かるように、処理不可の領域となるので、音声及び非音声区間の速度比は算出されない。目標圧伸比が０．４及び０．５の値をとるとき、共に算出パターンａによって算出される。目標圧伸比が０．６の値をとるとき、算出パターンｂによって算出される。目標圧伸比が０．７、０．９、及び１の値をとるとき、算出パターンｃによって算出される。図１５から分かるように、目標圧伸比に対して算出パターンは多数切り替わっている。これにより、図１５に示す速度比算出分布を用いた場合、音声と非音声の平均速度比に大きな差が生じない。その結果、非音声区間の聞き取り易さ及び見易さが向上する。また、図６で示した速度比算出分布よりも、非音声区間の速度比を若干遅めに設定している。これにより、非音声区間において生じる動きが激しいシーンが多いジャンルに対して、非音声区間の聞き取り易さ及び見易さをさらに向上させることができる。このように、速度比算出分布をジャンル毎に準備しておくことで、より的確な速度変換処理が可能になる。 FIG. 16 shows the target companding ratio, the average speed ratio of the voice section, and the average speed ratio of the non-voice section when the voice content rate is 0.5 in the speed ratio calculation distribution shown in FIG. When the target companding ratio takes a value of 0.1 and 0.3, it becomes an unprocessable area as can be seen from the speed ratio calculation distribution shown in FIG. Not. When the target companding ratio takes values of 0.4 and 0.5, both are calculated by the calculation pattern a. When the target companding ratio takes a value of 0.6, it is calculated by the calculation pattern b. When the target companding ratio takes values of 0.7, 0.9, and 1, it is calculated by the calculation pattern c. As can be seen from FIG. 15, many calculation patterns are switched with respect to the target companding ratio. Thus, when the speed ratio calculation distribution shown in FIG. 15 is used, there is no significant difference between the average speed ratio of voice and non-voice. As a result, the ease of hearing and visibility of the non-voice section is improved. Further, the speed ratio of the non-voice section is set slightly later than the speed ratio calculation distribution shown in FIG. Thereby, it is possible to further improve the easiness to hear and view in the non-speech section for a genre with many scenes that are generated in the non-speech section. Thus, by preparing the speed ratio calculation distribution for each genre, more accurate speed conversion processing can be performed.

なお、ジャンル毎だけではなく、動きの激しさなどを示す画像情報や、音響的な特徴に応じた速度比算出分布が予め用意されていてもよい。このような速度比算出分布は、例えば、音楽やある特定の人物の音声などユーザが着目したい音に対して個別に速度を制御したい場合に、有効である。 In addition, not only for each genre, but also image information indicating the intensity of movement and a speed ratio calculation distribution according to acoustic characteristics may be prepared in advance. Such a speed ratio calculation distribution is effective, for example, when it is desired to individually control the speed of sound that the user wants to focus on, such as music or the sound of a specific person.

また、上述では、速度比算出分布を構成する領域として、算出パターンａ〜ｃを用いて場合について説明したが、目的に合わせて算出パターンｄ及びｅを用いてもよい。音楽番組において音楽を重視して再生したい場合、音楽は非音声区間であるため、出来るだけ非音声区間の速度比を遅くするとよい。その代わり、音声区間の速度比を速くする必要がある。したがって、このような音楽番組などのジャンルに対しては、上述した算出パターンｄを用いるのが好適である。これは音楽に限らず、ユーザが着目したい音に対して再生を行う場合に有効である。また、ユーザがコンテンツに含まれる非音声区間をサーチする場合、音声区間を出来るだけ高速で再生することが望まれる。したがって、この場合、上述した算出パターンｅを用いるのが好適である。 In the above description, the calculation patterns a to c are used as the regions constituting the speed ratio calculation distribution. However, the calculation patterns d and e may be used according to the purpose. When music programs are to be played with an emphasis on music, the music is a non-speech segment, so the speed ratio of the non-speech segment should be made as slow as possible. Instead, it is necessary to increase the speed ratio of the voice section. Therefore, it is preferable to use the calculation pattern d described above for such a genre such as a music program. This is effective not only for music but also for reproduction of sound that the user wants to focus on. Further, when the user searches for a non-speech segment included in the content, it is desired to reproduce the speech segment as fast as possible. Therefore, in this case, it is preferable to use the calculation pattern e described above.

また、速度比条件設定部１４は、音声区間長算出部１５で得られた音声区間長から、図９及び図１０に示したような統計的な分布を求め、その統計的な分布に基づく速度比算出分布を用いてもよい。音声区間長とその生起頻度はコンテンツの属性を示している。このため、統計的な分布に基づく速度比算出分布を用いることで、コンテンツの属性に応じた速度変換処理が可能になる。例えば音声含有率が同じであったとしても、音声区間長が短いものが多く音声区間の生起頻度が高いコンテンツや、音声区間長が長いものが多く音声区間の生起頻度が低いコンテンツが存在する。後者のコンテンツでは、一つの音声区間あたりに含まれる情報量が多く、理解にかかるユーザの負荷が高いことが想定される。したがって、このようなコンテンツに対しては、音声区間に対してより重点的に遅い速度比を配分するような速度比算出分布を用いる。このように速度比条件設定部１４は、音声区間長の統計的な分布に基づく速度比算出分布を用いて、速度比条件を設定してもよい。このことは、プライベートコンテンツについて特に有効である。プライベートコンテンツは放送コンテンツと異なり編集等の処理を施していないものが多いため、音声含有率や音声区間長もコンテンツごとにばらつきが大きい。そのため、様々な速度比算出分布を用意することで、プライベートコンテンツなどコンテンツ間で音声区間長や音声含有率のばらつきが大きいものにおいても適切な速度比条件を設定することが可能になる。 Further, the speed ratio condition setting unit 14 obtains a statistical distribution as shown in FIG. 9 and FIG. 10 from the speech segment length obtained by the speech segment length calculation unit 15, and the speed based on the statistical distribution. A ratio calculation distribution may be used. The voice section length and the frequency of occurrence thereof indicate content attributes. For this reason, by using a speed ratio calculation distribution based on a statistical distribution, it is possible to perform a speed conversion process according to content attributes. For example, even if the audio content rate is the same, there are content with a short audio section length and a high occurrence frequency of the audio section, and content with a long audio section length and a high occurrence frequency of the audio section. In the latter content, it is assumed that the amount of information included per voice segment is large, and the burden on the user for understanding is high. Accordingly, for such content, a speed ratio calculation distribution that distributes a slower speed ratio more preferentially to the audio section is used. As described above, the speed ratio condition setting unit 14 may set the speed ratio condition using the speed ratio calculation distribution based on the statistical distribution of the voice section length. This is particularly effective for private content. Unlike broadcast contents, many private contents are not subjected to processing such as editing. Therefore, the content of audio and the length of an audio section vary greatly from content to content. Therefore, by preparing various speed ratio calculation distributions, it is possible to set an appropriate speed ratio condition even when the content of the voice section length or the voice content rate varies greatly between contents such as private contents.

また、上述では、目標圧伸比が０から１となる場合についてのみ説明を行ったが、速度制御後の再生時間が通常再生時間と同じかそれよりも長い時間で視聴を行う遅聞きや遅見再生など目標圧伸比１以上の場合についても、同様に速度比算出分布を予め用意しておくことで、速度比条件を設定することが可能である。また、一つの音声区間ごとに音声区間長と同じ長さの非音声区間を設け発音練習を促すような発音練習モードの出力制御も可能となる。例えば、発音練習モードでは、非音声区間の平均速度比は直前の音声区間長によって定まるとすると、音声区間や非音声区間の速度比は以下の式から算出できる。

なお、Ｓは音声含有率、Ｖｍ１は音声区間の平均速度比、Ｖｍ２は非音声区間の平均速度比、Ｅは目標圧伸比を示す。ｎ番目の音声区間の音声区間長をＭ（ｎ）とし、ｎ番目の音声区間に後続する非音声区間の非音声区間長をＮ（ｎ）とする。音声区間長と同じ長さの非音声区間長を設けるため、音声区間の平均比Ｖｍ１は式（７）のように表せる。また発音練習を行うには、音声区間と同じ長さの非音声区間を必要とする。このため、後続の非音声区間の速度比は、音声区間長に応じて算出する必要があるため、式（８）のようになる。このように、音声区間長と同じ長さの非音声区間を設け、一定の時刻内で再生するという発音練習モードにおいても、音声含有率や音声区間長を利用することで、適切な速度比を設定することが可能になる。なお、今回は音声区間と同じ長さの非音声区間を設けたが、非音声区間の長さは学習の進み具合などに応じて変更してもよい。このように、語学学習用に新たにコンテンツを作成しなくても、音声含有率と音声区間長の利用によって、発音練習に適した速度で音声を提示することが可能になる。また、学習に費やしたい時間を始めに指定することで、コンテンツ長から圧伸比を算出し、学習時間内におさまるように速度を制御することが可能になる。また、学習のレベルに応じて速度比を変えることも可能となる。 In the above description, only the case where the target companding ratio is 0 to 1 has been described. However, the playback time after speed control is the same as or longer than the normal playback time. Similarly, when the target companding ratio is 1 or more such as viewing and reproducing, the speed ratio condition can be set by preparing a speed ratio calculation distribution in advance. In addition, it is possible to perform output control in the pronunciation practice mode in which a non-speech segment having the same length as the speech segment length is provided for each voice segment to encourage pronunciation practice. For example, in the pronunciation practice mode, assuming that the average speed ratio of the non-speech section is determined by the length of the previous speech section, the speed ratio of the speech section and the non-speech section can be calculated from the following equation.

S represents the voice content rate, Vm1 represents the average speed ratio of the voice section, Vm2 represents the average speed ratio of the non-voice section, and E represents the target companding ratio. Let M (n) be the voice section length of the nth voice section, and N (n) be the non-voice section length of the non-voice section that follows the nth voice section. Since a non-speech segment length having the same length as the speech segment length is provided, the average ratio Vm1 of the speech segment can be expressed as in Expression (7). In order to practice pronunciation, a non-speech segment having the same length as the speech segment is required. For this reason, since it is necessary to calculate the speed ratio of the subsequent non-speech segment according to the speech segment length, Equation (8) is obtained. In this way, even in the pronunciation practice mode where a non-speech segment with the same length as the speech segment length is provided and played within a certain time, an appropriate speed ratio can be obtained by using the speech content rate and the speech segment length. It becomes possible to set. Although a non-speech segment having the same length as the speech segment is provided this time, the length of the non-speech segment may be changed according to the progress of learning. As described above, even when content is not newly created for language learning, it is possible to present the voice at a speed suitable for pronunciation practice by using the voice content rate and the voice section length. In addition, by designating the time to be spent for learning first, the companding ratio is calculated from the content length, and the speed can be controlled so as to fall within the learning time. It is also possible to change the speed ratio according to the learning level.

他の使用目的としては、音声の書き起こしを行うときに用いる書き起こしモードが考えられる。この場合も同様に速度比条件設定部１４の速度比算出分布を変えることで対応可能である。音声を書き起こすには書き起こす人の書き込み能力、例えば、紙に記入する場合、一定時間で何文字記入可能かということや、キーボードで打ち込む場合、一定時間に何タイプ可能かなど各ユーザの書き込み能力に応じて、再生速度を変える必要がある。書き込み能力より再生速度が速ければ、すぐに書き込み部分が追い越され、書き込み部分より先の部分が再生されてしまう。そのため、一時停止や巻き戻しなどの操作が必要となる。また、そのような再生制御の操作は書き込み処理を中断させるため、二重に時間を無駄に消費させることになる。書き込み能力より再生速度が遅ければ再生部分に追い越されることはないが、書き込み後も次の音声が始まるまで待ち時間が発生し無駄な時間を消費することに変わりは無い。そこで、音声の書き起こしを行うときはユーザの書き込み能力に応じた速度で再生する必要がある。従来の方法では音声区間長や音声含有率が不明なため、非音声区間も音声区間と同じ速度で再生されたり、速度を遅くした場合どの程度の時間がかかるかは事前にわからなかったりした。今回速度条件設定部１４で以下のような処理をおこなえば、音声区間は聞きやすく非音声区間を省くような再生が可能となる。音声区間の平均速度比は式（９）のように表せる。

なお、Ｓは音声含有率、Ｖｍ１は音声区間の平均速度比、Ｖｍ２は非音声区間の平均速度比、Ｅは目標圧伸比を示す。コンテンツに含まれる音声区間の総数をＵとし、コンテンツの全長をＱとしている。Ｐは音声区間を再生後に、書き込み行の変更など書き起こし作業用に必要な時間として設けた一定時間である。従って作業によってはＰの値は０でも構わない。非音声区間は再生されないため、非音声区間の速度比Ｖｍ２は式（１０）のようになる。ここで、音声区間の平均速度比Ｖｍ１をユーザの能力に応じた書き込み速度に設定すれば、書き込み作業中に一時停止や早送り、巻き戻しなどの機器操作をすることなく、書き起こし作業が可能となる。このように速度条件設定部１４で音声区間と非音声区間の平均速度比を設定することで、ユーザの書き込み能力に応じた速度比で音声区間のみを再生するような書き起こし作業モードが可能となる。 As another purpose of use, a transcription mode used when transcription of speech can be considered. This case can also be handled by changing the speed ratio calculation distribution of the speed ratio condition setting unit 14 in the same manner. To transcribe the voice, the writing ability of the writer, for example, how many characters can be written in a certain time when filling in paper, how many types can be written in a certain time when typing on the keyboard It is necessary to change the playback speed according to the ability. If the playback speed is faster than the writing capability, the writing part is immediately overtaken and the part before the writing part is reproduced. Therefore, operations such as pause and rewind are required. In addition, since such a reproduction control operation interrupts the writing process, it wastes time twice. If the playback speed is slower than the writing ability, it will not be overtaken by the playback part, but after writing, there will be a waiting time until the next sound starts and there will be no change in wasted time. Therefore, when transcription is performed, it is necessary to reproduce at a speed according to the user's writing ability. In the conventional method, since the length of the voice segment and the voice content rate are unknown, the non-speech segment is played back at the same speed as the voice segment, or it has not been known in advance how long it takes when the speed is slowed down. If the following processing is performed by the speed condition setting unit 14 this time, it is possible to reproduce the voice section so that the voice section is easy to hear and the non-voice section is omitted. The average speed ratio of the voice section can be expressed as in equation (9).

S represents the voice content rate, Vm1 represents the average speed ratio of the voice section, Vm2 represents the average speed ratio of the non-voice section, and E represents the target companding ratio. The total number of audio sections included in the content is U, and the total length of the content is Q. P is a fixed time set as a time required for a transcription work such as a change of a writing line after reproduction of a voice section. Therefore, the value of P may be 0 depending on the work. Since the non-speech section is not reproduced, the speed ratio Vm2 of the non-speech section is expressed by Equation (10). Here, if the average speed ratio Vm1 of the voice section is set to a writing speed according to the user's ability, the transcription work can be performed without performing a device operation such as pause, fast forward, and rewind during the writing work. Become. In this way, by setting the average speed ratio between the voice section and the non-voice section by the speed condition setting unit 14, a transcription work mode in which only the voice section is reproduced at a speed ratio according to the user's writing ability is possible. Become.

その他にも、学習開始時には音声区間を遅い速度で再生を行い、学習経過時間に応じて徐々に速度比を早くしながら学習時間の合計は所定の時間になるように、音声や非音声区間の速度を制御するような聴き取り練習用の制御も可能になる。例えば、一回の学習の中で、学習開始時には学習終了時に比べ、遅い速度で音声区間の再生を行い徐々に音声区間の再生速度を上げていくような制御を行うこともできる。また長期的な学習の中で、最初に学習を開始した時点からの経過時間や前回の操作履歴などから、今回の学習における音声区間の再生速度を制御してもよい。このように音声区間や非音声区間の再生速度を変更したり、音声区間の直前や直後に一時停止や無音やオーディオ区間を付与したり、画面を挿入したりしながらコンテンツ全体の時間調整をすることにも対応可能である。 In addition, at the start of learning, the voice interval is played back at a slow speed, and the total of the learning time becomes a predetermined time while gradually increasing the speed ratio according to the elapsed learning time. Control for listening practice that controls the speed is also possible. For example, in one learning, it is also possible to perform control such that the speech section is reproduced at a slower speed at the start of learning and gradually the playback speed of the speech section is increased at the start of learning. Further, during the long-term learning, the playback speed of the voice section in the current learning may be controlled based on the elapsed time from the time when the learning is first started or the previous operation history. In this way, the playback speed of audio and non-audio segments can be changed, pauses, silence and audio can be added immediately before and after the audio segment, and the time of the entire content can be adjusted while inserting a screen. It is possible to cope with this.

（第２の実施形態）
図１７を参照して、本発明の第２の実施形態に係る音声再生装置について説明する。図１７は、第２の実施形態に係る音声再生装置の構成例を示すブロック図である。図１７において、本音声再生装置は、音声非音声判別部１１、音声含有率予測部１８、速度比条件設定部１４、音声区間長予測部１９、速度比決定部１６、速度変換部１７、及び圧伸比算出部２０で構成される。本実施形態は、第１の実施形態に係る音声再生装置に対し、オーディオ信号の再生速度を変えた再生処理をリアルタイムで行う点で異なる。以下、異なる点を中心に説明する。 (Second Embodiment)
With reference to FIG. 17, an audio reproducing apparatus according to the second embodiment of the present invention will be described. FIG. 17 is a block diagram illustrating a configuration example of an audio reproduction device according to the second embodiment. In FIG. 17, the audio reproduction device includes an audio non-audio discrimination unit 11, an audio content rate prediction unit 18, a speed ratio condition setting unit 14, an audio section length prediction unit 19, a speed ratio determination unit 16, a speed conversion unit 17, and The drawing ratio calculating unit 20 is configured. The present embodiment is different from the audio reproduction device according to the first embodiment in that reproduction processing in which the reproduction speed of the audio signal is changed is performed in real time. Hereinafter, different points will be mainly described.

音声含有率予測部１８は、音声非音声判別部１１から出力されたフレーム毎の判別結果から、現時点より過去数分の音声含有率を算出する。そして、算出した音声含有率を用いて、現時点より１つ進んだセクションにおける音声含有率を予測する。現時点より過去数分の音声含有率をＸ（ｚ−１）とすると、音声含有率の予測値Ｙ（ｚ）は式（１１）で表現される。Ｙ（ｚ−１）は一つ前のセクションでの音声含有率の予測値である。以下、音声含有率の予測値を予測音声含有率と称す。αは０から１までの値で、シミュレーションにより最適な値を求めた。

式（１１）において、初期値Ｘ（０）＝Ｙ（０）は、一般的なコンテンツの音声含有率の平均値とする。また音声含有率Ｘ（ｚ）が算出されるまでの間予測音声含有率Ｙ（ｚ）はＹ（０）を維持するとする。図１８に、式（１１）で示される予測音声含有率Ｙ（ｚ）を求める方法を模式的に示す。図１８に示すように、現時点より過去数分の音声含有率は、フレーム毎に移動しながら算出される。また、予測音声含有率Ｙ（ｚ）やＹ（ｚ−１）がセクション毎に予測される。図１８に示すように、音声含有率予測部１８は、一つ前のセクションでの予測音声含有率Ｙ（ｚ−１）と、現時点より過去数分の音声含有率Ｘ（ｚ−１）を用いて予測している。 The voice content rate prediction unit 18 calculates the voice content rate for the past several times from the present time based on the discrimination result for each frame output from the voice non-speech discrimination unit 11. And the audio | voice content rate in the section advanced one from the present time is estimated using the calculated audio | voice content rate. If the voice content rate for the past several minutes from the present time is X (z−1), the predicted value Y (z) of the voice content rate is expressed by Expression (11). Y (z-1) is the predicted value of the speech content rate in the previous section. Hereinafter, the predicted value of the voice content rate is referred to as a predicted voice content rate. α is a value from 0 to 1, and an optimum value is obtained by simulation.

In equation (11), the initial value X (0) = Y (0) is the average value of the audio content rate of general content. Further, it is assumed that the predicted speech content rate Y (z) is maintained at Y (0) until the speech content rate X (z) is calculated. FIG. 18 schematically shows a method for obtaining the predicted speech content Y (z) represented by the equation (11). As shown in FIG. 18, the audio content rate for the past several minutes from the present time is calculated while moving from frame to frame. Further, the predicted voice content rate Y (z) and Y (z-1) are predicted for each section. As shown in FIG. 18, the speech content rate prediction unit 18 calculates the predicted speech content rate Y (z−1) in the previous section and the speech content rate X (z−1) for the past several minutes from the present time. Predict using.

ここで音声含有率Ｘ（ｚ）の算出時間が短すぎると、音声区間長が長いものであれば、音声含有率が１となってしまう。また、音声と音声の間にある短いポーズ区間のみを抽出してしまい、音声含有率が０に近くなるなど、極端な値をとる可能性がある。また、算出時間が長すぎると、平滑化してしまい、音声含有率の予測に利用できなくなる。そのため、音声含有率Ｘ（ｚ）の算出時間は、コンテンツの音声区間の集まり具合を適度に表す必要があり、実験の結果、１分以上が望ましいことがわかった。したがって、上述では過去数分としている。図１９に音声含有率Ｘ（ｚ）と予測音声含有率Ｙ（ｚ）の算出結果の一例を示す。このグラフは、縦軸に音声含有率を横軸に番組の経過時間を示したものである。また図１９においては、３０分番組のニュース番組について、音声含有率Ｘ（ｚ）と式（１１）によって算出した予測音声含有率Ｙ（ｚ）を図示したものである。図１９に示すように、予測音声含有率Ｙ（ｚ）は、実際の音声含有率Ｘ（ｚ）とほぼ同じ値に推移することが分かる。 Here, if the calculation time of the voice content rate X (z) is too short, the voice content rate becomes 1 if the voice section length is long. Further, only a short pause section between voices may be extracted, and an extreme value may be taken, such as the voice content rate being close to zero. On the other hand, if the calculation time is too long, it is smoothed and cannot be used for predicting the voice content rate. Therefore, the calculation time of the audio content rate X (z) needs to appropriately represent the degree of collection of the audio sections of the content, and as a result of the experiment, it was found that 1 minute or more is desirable. Therefore, in the above description, the past number is set. FIG. 19 shows an example of calculation results of the voice content rate X (z) and the predicted voice content rate Y (z). In this graph, the vertical axis represents the audio content rate and the horizontal axis represents the elapsed time of the program. FIG. 19 shows the audio content rate X (z) and the predicted audio content rate Y (z) calculated by equation (11) for a 30-minute news program. As shown in FIG. 19, it can be seen that the predicted voice content rate Y (z) changes to substantially the same value as the actual voice content rate X (z).

圧伸比算出部２０は、セクション単位で圧伸比を算出する。具体的には、まず現時刻ｔにおける圧伸比を算出する。現時刻ｔにおける圧伸比は、現時刻ｔにおいて速度変換部１７から出力された出力データ量を速度変換部１７に入力された入力データ量で除算することで求められる。次に圧伸比算出部２０は、現時刻ｔがセクション境界に達したどうかを判断する。セクション境界に達したとき、次のセクションでの速度比条件を設定するため、次のセクションにおける圧伸比（以下、セクション圧伸比と称す）を算出する。セクション圧伸比とは、次のセクションをどのくらいの圧伸比で変換するかを定めたものである。圧伸比算出部２０は、ユーザ又は機器が予め設定した目標圧伸比と、現時刻ｔにおける圧伸比とから、式（１２）を用いて算出する。式（１２）において、Ｒｔは目標圧伸比、Ｒ（ｔ）は現時刻ｔにおける圧伸比、次のセクションの時間長をＴ（ｚ）、Ｒｓ（ｚ）は次のセクションの圧伸比とする。

The drawing ratio calculation unit 20 calculates the drawing ratio in section units. Specifically, first, the companding ratio at the current time t is calculated. The companding ratio at the current time t is obtained by dividing the output data amount output from the speed conversion unit 17 at the current time t by the input data amount input to the speed conversion unit 17. Next, the companding ratio calculating unit 20 determines whether or not the current time t has reached the section boundary. When the section boundary is reached, in order to set the speed ratio condition in the next section, the companding ratio in the next section (hereinafter referred to as section companding ratio) is calculated. The section drawing ratio defines how much the next section is to be converted. The companding ratio calculation unit 20 calculates the target companding ratio set in advance by the user or the device and the companding ratio at the current time t using Expression (12). In Expression (12), Rt is the target companding ratio, R (t) is the companding ratio at the current time t, the time length of the next section is T (z), and Rs (z) is the companding ratio of the next section. And

速度比条件設定部１４は、音声含有率予測部１８で算出された予測音声含有率と、圧伸比算出部２０で算出されたセクション圧伸比を入力とする。速度比条件設定部１４は、第１の実施形態で説明した速度比条件設定部１４と同様の方法で、セクション毎に速度比条件を設定する。 The speed ratio condition setting unit 14 receives the predicted voice content calculated by the voice content prediction unit 18 and the section companding ratio calculated by the companding ratio calculating unit 20. The speed ratio condition setting unit 14 sets the speed ratio condition for each section in the same manner as the speed ratio condition setting unit 14 described in the first embodiment.

ここで、上述した遂次的な処理を行う場合、従来技術では局所的な音声区間の偏りが生じた場合に、その偏りが生じた箇所において再生時間を達成しようと処理するので、非音声区間の極端な削除や高速化が局所的に生じていた。これに対し、本実施形態では、セクション圧伸比がセクション単位で算出される。また速度比条件設定部１４は、セクション単位で、速度比条件を設定する。つまり、速度比条件は、セクション圧伸比によってセクション単位に更新される。これにより、偏りが生じた箇所において再生時間を達成しなくても、次以降のセクションへ持ち越すことができるので、目標時間を達成しつつ、聞き取り易い再生を実現した逐次的な処理を行うことができる。 Here, in the case of performing the above-described sequential processing, in the conventional technique, when a local voice section bias occurs, processing is performed so as to achieve the reproduction time at the location where the bias occurs, so the non-voice section The extreme deletion and speeding up of the system occurred locally. In contrast, in this embodiment, the section companding ratio is calculated in section units. The speed ratio condition setting unit 14 sets speed ratio conditions in section units. That is, the speed ratio condition is updated for each section by the section companding ratio. As a result, it is possible to carry forward to the next and subsequent sections without achieving the playback time at the location where the bias occurs, so that it is possible to perform sequential processing that achieves playback that is easy to hear while achieving the target time. it can.

なお、予測音声含有率をそのまま利用すると、予測音声含有率の増減と平均速度比の増減が直結してしまう。予測音声含有率が高い部分で平均速度比が速くなることは聞こえに影響を与える可能性がある。なぜならば、例えば発話速度が同じと仮定すると、音声含有率が高いほど、情報量が多いため、文章を理解するのが難しいからである。そこで、予測音声含有率が増加すると平均速度比を下げ、予測音声含有率が減少すると平均速度比が上がるように、予測音声含有率を下記のように調整してもよい。

Ｗ（ｚ）は、セクションＺにおける、調整後の予測音声含有率である。以下、調整音声含有率と呼ぶ。Ｗ（ｚ−１）は、セクションＺの１つ前のセクションにおける調整音声含有率である。Ｙ（ｚ）は、セクションＺにおける予測音声含有率である。Ｙ（ｚ−１）は、１つ前のセクションにおける予測音声含有率である。γは、加算する際の係数である。 Note that if the predicted speech content rate is used as it is, an increase or decrease in the predicted speech content rate and an increase or decrease in the average speed ratio are directly connected. An increase in the average speed ratio at a portion where the predicted speech content is high may affect hearing. This is because, for example, assuming that the utterance speed is the same, the higher the voice content rate, the greater the amount of information, making it difficult to understand the sentence. Therefore, the predicted speech content ratio may be adjusted as follows so that the average speed ratio is decreased when the predicted speech content ratio increases and the average speed ratio is increased when the predicted speech content ratio decreases.

W (z) is the adjusted predicted speech content rate in section Z. Hereinafter, it is referred to as the adjusted voice content rate. W (z-1) is the adjusted audio content rate in the section immediately before section Z. Y (z) is the predicted speech content in section Z. Y (z-1) is the predicted speech content in the previous section. γ is a coefficient for addition.

この調整音声含有率を用いて速度比条件を設定することで、予測音声含有率が１つ前のセクションにおける予測音声含有率よりも上がれば、音声区間の平均速度比が下がる。また予測音声含有率が１つ前のセクションにおける予測音声含有率と同じ場合、音声区間の平均速度比は変化しない。 By setting the speed ratio condition using this adjusted voice content rate, if the predicted voice content rate rises above the predicted voice content rate in the previous section, the average speed ratio of the voice interval is lowered. When the predicted voice content rate is the same as the predicted voice content rate in the previous section, the average speed ratio of the voice section does not change.

このような調整音声含有率を利用することで、音声含有率の高いセクションではより遅くすることが可能となり、情報量に応じた速度条件の設定が可能となる。なお、このような調整を行うことで、セクション圧伸比を達成できずに誤差が生じる恐れがあるが、式（１２）に示すように、次以降のセクションにおいて誤差を解消することができる。これにより、目標圧伸比は達成することができる。例えば、図１９に示したように、予測音声含有率は一定ではなく、高いところもあれば低いところもある。そのため、予測音声含有率が高いセクションで再生速度が遅くなり、目標圧伸比と差が広がったとしても、次以降の予測音声含有率が低いセクションでこの差を解消することができる。 By using such an adjusted audio content rate, it is possible to make it slower in a section with a high audio content rate, and it is possible to set a speed condition according to the amount of information. Note that, by making such an adjustment, there is a possibility that the section expansion ratio cannot be achieved and an error may occur, but the error can be eliminated in the following sections as shown in Expression (12). Thereby, the target companding ratio can be achieved. For example, as shown in FIG. 19, the predicted speech content rate is not constant and may be high or low. Therefore, even if the playback speed is slowed down in the section where the predicted speech content rate is high and the difference from the target companding ratio is widened, this difference can be eliminated in the sections where the predicted speech content rate is low thereafter.

音声区間長予測部１９は、音声非音声判別部１１の過去の判別結果から、音声区間長を算出し、音声区間長の予測を行う。音声区間長の予測値は、一つ一つの音声区間長の予測値ではなく、文を話す際の平均的な音声区間長を代表値として予測する。一つ一つの音声区間長は、話者交替や会話内容などの様々な要因によって、予測することは難しい。そこで、平均的な音声区間長を予測値として利用して、コンテンツに適した音声区間の速度比制御を行う。図９及び図１０に示したように、音声区間長の分布はジャンル毎に大きく異なる。図９は、ニュース番組での各音声区間長の頻度を示したものであるが、５００ｍｓをピークとし、４０００ｍｓ辺りまでゆるやかに頻度が減っている。一方、図１０は、野球番組での各音声区間長の頻度を示したものであるが、同じく５００ｍｓをピークとしているものの、急激に頻度が減少していく様子がみられる。この減少の仕方の違いにより、これらの番組を視聴した際の印象としては、ニュース番組では長めの音声区間が続き、野球番組では短い音声区間が続くように聞こえる。そのため、音声区間長を予測せずに、固定長により音声区間の速度比制御を行うと、音声区間の速度比が必要以上に遅くなりすぎたり、速すぎたりする恐れがある。これは、音声非音声判別部１１が逐次的に音声か非音声かを判別しているために、音声区間の終端時刻が把握できないからである。また、音声区間長の違いは、話者や会話内容の違いによるものであり、コンテンツ毎に異なるものとなる。このような理由から、音声区間長の予測が必要となる。 The speech segment length prediction unit 19 calculates the speech segment length from the past discrimination result of the speech non-speech discrimination unit 11 and predicts the speech segment length. The predicted value of the speech segment length is predicted not by using the predicted value of each speech segment length but by using the average speech segment length when speaking a sentence as a representative value. The length of each speech segment is difficult to predict due to various factors such as speaker change and conversation content. Therefore, speed ratio control of a voice section suitable for content is performed using an average voice section length as a predicted value. As shown in FIG. 9 and FIG. 10, the distribution of the voice section length is greatly different for each genre. FIG. 9 shows the frequency of each voice section length in a news program. The frequency gradually decreases to about 4000 ms with a peak at 500 ms. On the other hand, FIG. 10 shows the frequency of each voice section length in the baseball program, but although the peak is also 500 ms, it can be seen that the frequency decreases rapidly. Due to this difference in the way of reduction, the impression of viewing these programs seems to be that a longer audio segment continues in a news program and a shorter audio segment continues in a baseball program. For this reason, if speed ratio control of a voice section is performed with a fixed length without predicting the voice section length, the speed ratio of the voice section may be unnecessarily slow or too fast. This is because the speech non-speech discriminating unit 11 sequentially discriminates between speech and non-speech, so that the end time of the speech section cannot be grasped. Also, the difference in the voice section length is due to the difference in the speaker and the conversation content, and differs for each content. For this reason, it is necessary to predict the speech interval length.

そこで、処理を開始してからｎ（ｎは自然数）番目の音声区間がもつ音声区間長の予測値Ｌ（ｎ）は、一つ前のｎ−１番目の音声区間がもつ音声区間長の予測値Ｌ（ｎ−１）と実測値Ｍ（ｎ−１）とから、式（１４）で表現される。

なお、音声区間には、「はい」や「うん」など相槌や、「えー」などのフィラーなどが含まれる。これらは言語的な理解が容易なため、速度比に関わらず聞き取り易い。そこで、式（１４）の音声区間長の予測値Ｌ（ｎ）の算出には、所定の閾値以上の音声区間長をもつ音声区間を利用するとした。ここでは、所定の閾値の一例として、１０００ｍｓを採用する。 Therefore, the predicted value L (n) of the speech section length of the nth (n is a natural number) speech section from the start of processing is the prediction of the speech section length of the previous (n-1) th speech section. From the value L (n-1) and the actual measurement value M (n-1), it is expressed by the equation (14).

Note that the speech section includes a companion such as “Yes” or “Yes”, a filler such as “Eh”, and the like. These are easy to hear regardless of speed ratio because they are easy to understand linguistically. Therefore, it is assumed that a speech segment having a speech segment length equal to or greater than a predetermined threshold is used to calculate the speech segment length prediction value L (n) in Expression (14). Here, 1000 ms is adopted as an example of the predetermined threshold.

式（１４）に基づいて予測した音声区間長を図２０に示す。図２０において、縦軸は音声区間長であり、横軸は経過時間を示している。図２０では、Ｘ軸において音声区間長の始端時刻の位置に、その音声区間長を示している。また図２０では、音声区間長の実測値Ｍ（ｎ）、直前の音声区間長の実測値Ｍ（ｎ−１）、及び予測音声区間長Ｌ（ｎ）と図示している。このように、相槌やフィラーなどを除いた音声区間長から予測音声区間長を算出することで、速い速度比では聞き取りにくくなる音声区間長が長い音声区間に適した速度比を設定することができる。また、コンテンツに応じて音声区間長が遂次的に算出されることで、音声区間長の分布の違いに応じた速度比の設定が可能となる。速度比決定部１６は、逐次的な速度比条件と予測音声区間長とに基づき、音声及び非音声区間の変換速度比を決定する。 The speech segment length predicted based on the equation (14) is shown in FIG. In FIG. 20, the vertical axis represents the voice section length, and the horizontal axis represents the elapsed time. In FIG. 20, the speech segment length is shown at the position of the start time of the speech segment length on the X axis. Further, in FIG. 20, an actual measurement value M (n) of the speech segment length, an actual measurement value M (n−1) of the immediately preceding speech segment length, and a predicted speech segment length L (n) are illustrated. Thus, by calculating the predicted speech segment length from the speech segment length excluding the conflicts and fillers, it is possible to set a speed ratio suitable for a speech segment having a long speech segment length that is difficult to hear at a high speed ratio. . In addition, since the voice section length is sequentially calculated according to the content, the speed ratio can be set according to the difference in the voice section length distribution. The speed ratio determination unit 16 determines the conversion speed ratio between the speech and non-speech sections based on the sequential speed ratio condition and the predicted speech section length.

以下、図２１を参照して、第２の実施形態に係る音声再生装置の処理について説明する。図２１は、第２の実施形態に係る音声再生装置の処理の流れを示すフローチャートである。 Hereinafter, with reference to FIG. 21, the processing of the audio reproduction device according to the second embodiment will be described. FIG. 21 is a flowchart showing a flow of processing of the audio reproduction device according to the second embodiment.

まず、入力装置（図示なし）においてユーザによるコンテンツを再生する指示を受け付けたか否かが判断される（ステップＳ２０１）。ユーザがコンテンツを再生する指示をしたとき、音声非音声判別部１１にオーディオ信号が入力される。音声非音声判別部１１は、入力されたコンテンツのオーディオ信号について音声区間と非音声区間とをフレーム毎に判別する（ステップＳ２０２）。 First, it is determined whether or not an instruction to reproduce content by a user is received at an input device (not shown) (step S201). When the user gives an instruction to reproduce the content, an audio signal is input to the voice / non-voice discrimination unit 11. The voice / non-speech discrimination unit 11 discriminates a speech segment and a non-speech segment for each frame of the audio signal of the input content (step S202).

ステップＳ２０２の次に、音声含有率予測部１８は、音声非音声判別部１１から出力されたフレーム毎の判別結果から、現時点より過去数分の音声含有率を算出し、算出した音声含有率を用いて、現時点より１つ進んだセクションにおける音声含有率を予測する（ステップＳ２０３）。 After step S202, the speech content rate prediction unit 18 calculates the speech content rate for the past several times from the current time based on the discrimination result for each frame output from the speech non-speech discrimination unit 11, and calculates the calculated speech content rate. The speech content rate in the section advanced by 1 from the current time is predicted (step S203).

ステップＳ２０３の次に、速度比条件設定部１４は、音声含有率予測部１８で算出された予測音声含有率と、圧伸比算出部２０で算出されたセクション圧伸比を入力とし、セクション毎に速度比条件を設定する（ステップＳ２０４）。また、音声区間長予測部１９は、音声非音声判別部１１の過去の判別結果から、音声区間長を算出し、音声区間長の予測を行う（ステップＳ２０５）。 Following step S203, the speed ratio condition setting unit 14 receives the predicted speech content rate calculated by the speech content rate prediction unit 18 and the section companding ratio calculated by the companding ratio calculation unit 20 as input. The speed ratio condition is set to (Step S204). Further, the speech segment length prediction unit 19 calculates a speech segment length from the past discrimination result of the speech non-speech discrimination unit 11, and predicts the speech segment length (step S205).

ステップＳ２０５の次に、速度比決定部１６は、音声非音声判別部１１から出力される音声区間の始終端時刻を参照して、所定の単位時間毎に音声区間であるか否かを判断する（ステップＳ２０６）。音声区間と判断した場合、速度比決定部１６は、音声区間における経過割合を算出する（ステップＳ２０７）。音声区間の経過割合とは、音声区間の始端を０、終端を１として、始端からの経過時間を音声区間長で除算したものである。本実施形態では、音声非音声判別が逐次的に行われているため、音声区間の始端時刻は把握できるが、音声区間の終端時刻は現時点では分からない。そこで、音声区間長予測部１９で予測された予測音声区間長を音声区間長として用いる。これにより、速度比決定部１６は、音声区間における経過割合を算出することができる。なお、音声区間長として実際の値ではなく、予測音声区間長を用いるので、実際の音声区間の経過割合とは必ずしも一致しない。従って、音声区間の経過割合が１以下であっても音声区間の終端時刻となる可能性がある。そこで、速度比決定部１６は、音声区間の経過割合が１を越えていないかどうかを判断する（ステップＳ２０８）。１を超えていない場合、速度比決定部１６は、音声区間の経過割合から音声区間の変換速度比を決定する（ステップＳ２０９）。ステップＳ２０９の処理は、第１の実施形態の速度比決定部１６で説明した処理と同様であるので、説明を省略する。ステップＳ２０８において経過割合が１を超えた場合、処理はステップＳ２１０へ進み、速度比条件である終端速度比を変換速度比に算出する。この場合、音声区間長予測部１９で予測された音声区間長を超過した状態であるため、音声区間の終端速度比を変換速度比として算出する必要がある。 Next to step S205, the speed ratio determination unit 16 refers to the start / end time of the speech segment output from the speech non-speech discrimination unit 11, and determines whether or not the speech segment is a predetermined unit time. (Step S206). If it is determined that the voice segment is determined, the speed ratio determination unit 16 calculates the elapsed rate in the voice segment (step S207). The elapsed rate of the speech segment is obtained by dividing the elapsed time from the start end by the speech segment length, where 0 is the start end and 1 is the end of the speech segment. In the present embodiment, since voice / non-voice discrimination is performed sequentially, the start time of the voice section can be grasped, but the end time of the voice section is not known at the present time. Therefore, the predicted speech segment length predicted by the speech segment length prediction unit 19 is used as the speech segment length. Thereby, the speed ratio determination unit 16 can calculate the elapsed ratio in the voice section. In addition, since the predicted voice section length is used as the voice section length instead of the actual value, it does not necessarily match the elapsed ratio of the actual voice section. Therefore, there is a possibility that the end time of the voice section may be reached even if the elapsed ratio of the voice section is 1 or less. Therefore, the speed ratio determination unit 16 determines whether or not the passage ratio of the voice section exceeds 1 (step S208). When the ratio does not exceed 1, the speed ratio determination unit 16 determines the conversion speed ratio of the voice section from the elapsed ratio of the voice section (step S209). Since the process of step S209 is the same as the process described in the speed ratio determination unit 16 of the first embodiment, the description thereof is omitted. When the elapsed ratio exceeds 1 in step S208, the process proceeds to step S210, and the terminal speed ratio, which is the speed ratio condition, is calculated as the conversion speed ratio. In this case, since the voice section length predicted by the voice section length prediction unit 19 is exceeded, it is necessary to calculate the termination speed ratio of the voice section as the conversion speed ratio.

ステップＳ２０６において非音声区間と判断した場合、速度比決定部１６は、非音声区間の変換速度比を決定する（ステップＳ２１１）。ステップＳ２０９、Ｓ２１０、及びＳ２１１の次に、速度比決定部１６は、速度変換対象となるコンテンツの終端時刻まで変換速度比を決定したか否かを判断する（ステップＳ２１２）。終端時刻ではないとき、処理はステップＳ２０２へ戻る。このように、速度変換対象となるコンテンツの終端時刻までの変換速度比が算出されるまで、速度比決定部１６においてステップＳ２０２〜Ｓ２１２までの処理がセクション単位で繰り返される。ステップＳ２１２においてコンテンツの終端時刻まで変換速度比が算出されたと判断された場合、入力装置（図示なし）が本装置の処理を終了するか否かの指示を受け付ける（ステップＳ２１３）。ユーザが他のコンテンツについて速度変換処理を行う場合（ステップＳ２１３でＮｏ）、処理はステップＳ２０２へ戻る。 When it is determined in step S206 that it is a non-speech section, the speed ratio determination unit 16 determines the conversion speed ratio of the non-speech section (step S211). Next to steps S209, S210, and S211, the speed ratio determination unit 16 determines whether or not the conversion speed ratio has been determined up to the end time of the content to be converted (step S212). If it is not the end time, the process returns to step S202. In this way, the processing from step S202 to S212 is repeated for each section in the speed ratio determination unit 16 until the conversion speed ratio up to the end time of the content to be speed converted is calculated. When it is determined in step S212 that the conversion speed ratio has been calculated up to the end time of the content, the input device (not shown) receives an instruction as to whether or not to end the processing of this device (step S213). When the user performs the speed conversion process for other contents (No in step S213), the process returns to step S202.

以上のように、本実施形態に係る音声再生装置によれば、リアルタイムで処理を行いながら速度変換を行うことができる。また第１の実施形態に比べ、音声含有率や音声区間長に予測値を用いている。このため、実測値との誤差が生じるが、この誤差は圧伸比設定部２８で設定されるセクション圧伸比によって解消される。これにより、本実施形態に係る音声再生装置によれば、リアルタイムで処理を行いながら、目標圧伸比を達成しつつ、区間削除や音声区間の極端な高速化をせずに、速度変換を行うことができる。 As described above, according to the audio reproduction device according to the present embodiment, speed conversion can be performed while performing processing in real time. Compared to the first embodiment, predicted values are used for the voice content rate and the voice section length. For this reason, an error from the actual measurement value occurs, but this error is eliminated by the section companding ratio set by the companding ratio setting unit 28. Thereby, according to the audio reproducing device according to the present embodiment, the speed conversion is performed without deleting the section or extremely speeding up the audio section while achieving the target companding ratio while performing the processing in real time. be able to.

（第３の実施形態）
図２２を参照して、本発明の第３の実施形態に係る音声再生装置について説明する。図２２は、第３の実施形態に係る音声再生装置の構成例を示すブロック図である。図２２において、本音声再生装置は、音声非音声判別部１１、音声含有率予測部１８、速度比条件設定部１４、音声区間長予測部１９、速度比決定部１６、速度変換部１７、圧伸比算出部２０、及び統計量算出部２１で構成される。本実施形態は、第２の実施形態に係る音声再生装置に対し、統計量算出部２１を新たに備え、速度比条件設定部１４の処理が異なる。以下、統計量算出部２１と、速度比条件設定部１４の処理を中心に説明する。 (Third embodiment)
With reference to FIG. 22, an audio reproducing apparatus according to the third embodiment of the present invention will be described. FIG. 22 is a block diagram illustrating a configuration example of an audio reproduction device according to the third embodiment. In FIG. 22, the audio reproduction device includes an audio non-audio discrimination unit 11, an audio content rate prediction unit 18, a speed ratio condition setting unit 14, an audio section length prediction unit 19, a speed ratio determination unit 16, a speed conversion unit 17, a pressure An elongation ratio calculation unit 20 and a statistic calculation unit 21 are included. In the present embodiment, a statistic calculation unit 21 is newly added to the audio reproduction device according to the second embodiment, and the processing of the speed ratio condition setting unit 14 is different. Hereinafter, the processing of the statistic calculation unit 21 and the speed ratio condition setting unit 14 will be mainly described.

統計量算出部２１は、音声区間の上限速度比を修正するための統計量を算出している。例えば、コンテンツの始端から現時点までの音声含有率を利用する。このような音声含有率を以下、長期音声含有率と称す。コンテンツ毎の長期音声含有率の時間変化を図２３に示す。図２３において、縦軸は長期音声含有率を示し、横軸は始端（０分）からの経過時間を示している。また、コンテンツ毎の予測音声含有率を図２４に示す。予測音声含有率は、第２の実施形態において説明した予測音声含有率と同じである。ここでは、算出間隔を１分としている。予測音声含有率のグラフは、音声区間が密集している部分や疎の部分が反映され、山谷がはっきりしたグラフとなっている。長期音声含有率のグラフは、始端付近で多少変動があるものの、概ね平坦であり、第１の実施形態で用いたコンテンツ全体に対する音声含有率に近いグラフとなる。 The statistic calculator 21 calculates a statistic for correcting the upper limit speed ratio of the voice section. For example, the audio content rate from the beginning of content to the current time is used. Such voice content is hereinafter referred to as long-term voice content. FIG. 23 shows the time variation of the long-term audio content rate for each content. In FIG. 23, the vertical axis represents the long-term voice content rate, and the horizontal axis represents the elapsed time from the start (0 minutes). Moreover, the predicted audio | voice content rate for every content is shown in FIG. The predicted voice content rate is the same as the predicted voice content rate described in the second embodiment. Here, the calculation interval is 1 minute. The graph of the predicted speech content rate is a graph in which the valleys are clear, reflecting portions where speech sections are dense or sparse. The graph of the long-term audio content rate is substantially flat, although there is some variation near the beginning, and is a graph close to the audio content rate for the entire content used in the first embodiment.

そこで、この長期音声含有率を用いて音声区間の上限速度比の修正を行うことを考える。音声区間の上限速度比の修正を逐次行っていくことで、予測音声含有率が局所的に高くなった場合でも、音声区間の速度比が上がりすぎることを防ぐことができる。 Therefore, it is considered that the upper limit speed ratio of the voice section is corrected using the long-term voice content rate. By sequentially correcting the upper limit speed ratio of the speech section, it is possible to prevent the speed ratio of the speech section from being excessively increased even when the predicted speech content rate is locally increased.

速度比条件設定部１４は、音声含有率予測部１８で算出された予測音声含有率、統計量算出部２１で算出された長期音声含有率、及び圧伸比算出部２０で算出されたセクション圧伸比を入力とする。上述した図２１のステップＳ２０４において、速度比条件設定部１４は図２５に示す処理を行う。図２５は、第３の実施形態に係る速度比条件設定部１４の処理を示すフローチャートである。 The speed ratio condition setting unit 14 includes a predicted voice content rate calculated by the voice content rate prediction unit 18, a long-term voice content rate calculated by the statistic calculation unit 21, and a section pressure calculated by the companding ratio calculation unit 20. The stretch ratio is input. In step S204 of FIG. 21 described above, the speed ratio condition setting unit 14 performs the process shown in FIG. FIG. 25 is a flowchart showing processing of the speed ratio condition setting unit 14 according to the third embodiment.

図２５において、速度比条件設定部１４は、入力される音声含有率が予測音声含有率であるか否かを判断する（ステップＳ３０１）。予測音声含有率が入力された場合、処理はステップＳ３０２へ進み、速度比条件設定部１４は、予測音声含有率を用いて音声及び非音声の平均速度比を算出する。また速度比条件設定部１４は、算出した音声区間の平均速度比から終端速度比を算出する（ステップＳ３０３）。なお、予測音声含有率に基づく終端速度比をＶｅｎｄ１とする。ステップＳ３０２及びＳ３０３の処理は、上述した第１の実施形態と同様の処理である。 In FIG. 25, the speed ratio condition setting unit 14 determines whether or not the input voice content rate is the predicted voice content rate (step S301). When the predicted voice content rate is input, the process proceeds to step S302, and the speed ratio condition setting unit 14 calculates the average speed ratio of voice and non-speech using the predicted voice content rate. Further, the speed ratio condition setting unit 14 calculates a terminal speed ratio from the calculated average speed ratio of the voice section (step S303). Note that the end speed ratio based on the predicted voice content is Vend1. The processes in steps S302 and S303 are the same as those in the first embodiment described above.

一方、ステップＳ３０１において予測音声含有率が入力されない場合、つまり長期音声含有率が入力された場合、速度比条件設定部１４は、長期音声含有率を用いて音声及び非音声の平均速度比を算出する（ステップＳ３０４）。また速度比条件設定部１４は、算出した音声区間の平均速度比から終端速度比を算出する（ステップＳ３０５）。なお、長期音声含有率に基づく終端速度比をＶｅｎｄ２とする。ステップＳ３０４及びＳ３０５の処理は、上述した第１の実施形態と同様の処理である。 On the other hand, when the predicted speech content rate is not input in step S301, that is, when the long-term speech content rate is input, the speed ratio condition setting unit 14 calculates the average speed ratio of speech and non-speech using the long-term speech content rate. (Step S304). Further, the speed ratio condition setting unit 14 calculates a terminal speed ratio from the calculated average speed ratio of the voice section (step S305). Note that the end speed ratio based on the long-term voice content is Vend2. The processes in steps S304 and S305 are the same as those in the first embodiment described above.

ステップＳ３０５の次に、速度比条件設定部１４は、長期音声含有率に基づいて算出した音声区間の終端速度比Ｖｅｎｄ２を上限速度比として設定する（ステップＳ３０６）。ステップＳ３０３及びＳ３０６の次に、速度比条件設定部１４は、予測音声含有率に基づく終端速度比Ｖｅｎｄ１と、長期音声含有率に基づく上限速度比Ｖｅｎｄ２とを比較する（ステップＳ３０７）。終端速度比Ｖｅｎｄ１が上限速度比Ｖｅｎｄ２を超える場合（Ｖｅｎｄ１＞Ｖｅｎｄ２）、速度比条件設定部１４は終端速度比Ｖｅｎｄ１を上限速度比Ｖｅｎｄ２に修正する（ステップＳ３０８）。またこの修正に併せて、速度比条件設定部１４は音声区間の平均速度比も長期音声含有率によって算出された値に修正する。つまり、音声及び非音声区間の平均速度比、音声区間の終端速度比の３つの速度比を表す速度比条件のうち、非音声区間の平均速度比についてのみ予測音声含有率で算出された値を用いる。それ以外の音声区間の平均速度比及び終端速度比は、長期音声含有率によって求められた値を用いる。 Next to step S305, the speed ratio condition setting unit 14 sets the termination speed ratio Vend2 of the voice section calculated based on the long-term voice content rate as the upper limit speed ratio (step S306). Next to steps S303 and S306, the speed ratio condition setting unit 14 compares the terminal speed ratio Vend1 based on the predicted voice content rate and the upper limit speed ratio Vend2 based on the long-term voice content rate (step S307). When the terminal speed ratio Vend1 exceeds the upper limit speed ratio Vend2 (Vend1> Vend2), the speed ratio condition setting unit 14 corrects the terminal speed ratio Vend1 to the upper limit speed ratio Vend2 (step S308). Along with this correction, the speed ratio condition setting unit 14 also corrects the average speed ratio of the speech section to a value calculated by the long-term speech content rate. That is, among the speed ratio conditions representing the three speed ratios, the average speed ratio of speech and non-speech sections and the end speed ratio of speech sections, the value calculated with the predicted speech content only for the average speed ratio of non-speech sections Use. For the average speed ratio and end speed ratio of the other speech sections, values obtained from the long-term speech content rate are used.

このように、音声区間の平均速度比及び終端速度比の修正を逐次行っていくことで、予測音声含有率が局所的に高くなって音声区間の平均速度比が高めに設定され得る場合でも、聞き易い速度比での再生を行うことができる。 Thus, even when the average speed ratio of the voice section can be set to be higher by performing the correction of the average speed ratio and the terminal speed ratio of the voice section sequentially, the predicted voice content rate can be increased locally, Playback at a speed ratio that is easy to hear can be performed.

音声区間長予測部１９は、音声非音声判別部１１の過去の判別結果から、予測音声区間長を算出する。本実施形態では、予測音声区間長として音声区間長の最大値を利用する。これは、聞き取り易さ重視の観点から、どのような音声区間であっても漏らさずに終端まで速度比制御を行えるようにするためである。 The speech segment length prediction unit 19 calculates the predicted speech segment length from the past discrimination result of the speech non-speech discrimination unit 11. In this embodiment, the maximum value of the speech segment length is used as the predicted speech segment length. This is because speed ratio control can be performed up to the end without leaking any speech section from the viewpoint of ease of listening.

図２６に示すように、音声区間長の分布は経過時間によって大きく異なる。このため、音声区間長の予測が必要となる。図２６は、音声区間長の実測値と、直前の音声区間長の実測値と、予測音声区間長の分布を示した図である。図２６において、縦軸は音声区間長を示し、横軸はコンテンツの始端からの経過時間を示している。また音声区間長は音声区間の始端時刻に表示している。ここで、ｎ番目の音声区間長の実測値をＭ（ｎ）とする。予測音声区間長Ｌｍ（ｎ）は式（１５）〜式（１７）のように表現される。

式（１５）において、ｍａｘはコンテンツに含まれる音声区間長のうち最大の音声区間長を複数のコンテンツについて平均した値である。事前にジャンル情報が得られる場合は、ジャンル情報毎に上記平均値を算出し、テーブルを用意しておく。 As shown in FIG. 26, the distribution of the voice section length varies greatly depending on the elapsed time. For this reason, it is necessary to predict the speech section length. FIG. 26 is a diagram showing the distribution of the measured value of the speech section length, the measured value of the immediately preceding speech section length, and the predicted speech section length. In FIG. 26, the vertical axis indicates the audio section length, and the horizontal axis indicates the elapsed time from the beginning of the content. The voice section length is displayed at the start time of the voice section. Here, the measured value of the nth speech segment length is M (n). The predicted speech interval length Lm (n) is expressed as in Expression (15) to Expression (17).

In Expression (15), max is a value obtained by averaging the maximum audio section length among a plurality of contents among the audio section lengths included in the content. When genre information is obtained in advance, the average value is calculated for each genre information, and a table is prepared.

βは、予測音声区間長Ｌｍが次の音声区間までの経過時間とともに減少するように設定された値である。ｎ−１番目の音声区間の始端時刻を０とし、ｎ番目の音声区間の始端時刻をｔとすると、式（１８）のように表せる。ｋは正の値をとるものとする。

なお、βは指数関数でもよく、経過時間ｔの減少関数であればよい。 β is a value set so that the predicted speech segment length Lm decreases with the elapsed time until the next speech segment. If the start time of the (n-1) th speech section is 0 and the start time of the nth speech section is t, it can be expressed as in equation (18). k takes a positive value.

Note that β may be an exponential function or a decreasing function of the elapsed time t.

式（１５）〜式（１８）により予測された予測音声区間長は、図２６に示すようになる。図２６に示すように、予測音声区間長が音声区間長の実測値よりも長いものが多い。この値を速度比算出時に用いることで、音声区間の終端時刻では終端速度比で変換される割合が低下し、音声区間の平均速度比が更に下がる効果を有する。その結果、聞き取り易い再生を提供することができる。 The predicted speech section length predicted by the equations (15) to (18) is as shown in FIG. As shown in FIG. 26, there are many cases in which the predicted speech interval length is longer than the actual measurement value of the speech interval length. By using this value when calculating the speed ratio, there is an effect that the rate of conversion by the termination speed ratio is reduced at the termination time of the speech section, and the average speed ratio of the speech section is further lowered. As a result, it is possible to provide playback that is easy to hear.

（第４の実施形態）
図２７を参照して、本発明の第４の実施形態に係る音声再生装置について説明する。図２７は、第４の実施形態に係る音声再生装置の構成例を示すブロック図である。図２７において、本音声再生装置は、音声非音声判別部１１、一時蓄積部２２、音声含有率算出部１３、速度比条件設定部１４、音声区間長算出部１５、速度比決定部１６、速度変換部１７、及び圧伸比算出部２０で構成される。本実施形態は、第１の実施形態に係る音声再生装置に対し、蓄積部１２よりも蓄積量が少ない一時蓄積部２２と、第２の実施形態で説明した圧伸比算出部２０を備える点で異なる。以下、異なる点を中心に説明する。 (Fourth embodiment)
With reference to FIG. 27, an audio reproducing apparatus according to the fourth embodiment of the present invention will be described. FIG. 27 is a block diagram illustrating a configuration example of an audio reproduction device according to the fourth embodiment. In FIG. 27, the audio reproduction device includes an audio non-audio discrimination unit 11, a temporary storage unit 22, an audio content rate calculation unit 13, a speed ratio condition setting unit 14, an audio section length calculation unit 15, a speed ratio determination unit 16, a speed A conversion unit 17 and a companding ratio calculation unit 20 are included. The present embodiment is provided with a temporary accumulation unit 22 having a smaller accumulation amount than the accumulation unit 12 and a companding ratio calculation unit 20 described in the second embodiment with respect to the audio reproduction device according to the first embodiment. It is different. Hereinafter, different points will be mainly described.

一時蓄積部２２は、ハードディスク、ＤＶＤ、又はメモリ媒体（例えばＳＤカード）などの読み書き可能な記録媒体で構成される。一時蓄積部２２には、音声非音声判別部１１に入力されるのと同じオーディオ信号がセクション１個分もしくは数個分蓄積される。そして、一時蓄積部２２において蓄積されたオーディオ信号が速度変換処理された後、その速度変換処理されたセクションのオーディオ信号は消去され、新しいセクションのオーディオ信号が蓄積される。ここで、本実施形態に係るセクションとは、所定間隔で区切られた区間だけではなく、所定のイベントで区切られた区間でもよい。例えば、イベントをＣＭとすると、ＣＭ区間と、ＣＭとＣＭに挟まれた番組区間の２種類のセクションができる。イベントが音楽であれば、音楽区間と、音楽と音楽に挟まれた区間の２種類のセクションができる。また、セクションは、ユーザによって指示された区間であってもよい。 The temporary storage unit 22 includes a readable / writable recording medium such as a hard disk, a DVD, or a memory medium (for example, an SD card). The temporary storage unit 22 stores the same audio signal that is input to the voice / non-voice discrimination unit 11 for one section or several sections. Then, after the audio signal stored in the temporary storage unit 22 is subjected to speed conversion processing, the audio signal of the section subjected to the speed conversion processing is deleted, and the audio signal of a new section is stored. Here, the section according to the present embodiment may be a section delimited by a predetermined event as well as a section delimited by a predetermined interval. For example, if the event is a CM, there are two types of sections: a CM section and a program section sandwiched between CMs. If the event is music, there are two types of sections: a music section and a section between music and music. The section may be a section designated by the user.

なお、第１の実施形態と同様、１つのセクションを構成するオーディオ信号及びビデオ信号が一時蓄積部２２に蓄積されるとき、音声非音声判別部１１において判別処理が行われ、当該セクションの判別結果や音声区間の始終端時刻も一時蓄積部２２に蓄積される。また、一時蓄積部２２には、オーディオ信号やビデオ信号と、判別結果及び音声区間の始終端時刻とが対応付けされて蓄積される。なお、オーディオ信号及びビデオ信号のフォーマットは、どのようなフォーマットであってもかまわない。また、本実施形態に係る音声再生装置が、上述した蓄積部１２をさらに備えていてもよい。この場合、蓄積部１２においてコンテンツ単位で蓄積されたオーディオ信号や音声非音声判別結果が、セクション単位で読み出され、一時蓄積部２２に蓄積されるようにする。 As in the first embodiment, when the audio signal and the video signal constituting one section are stored in the temporary storage unit 22, a determination process is performed in the voice / non-voice determination unit 11, and the determination result of the section And the start / end time of the voice section are also stored in the temporary storage unit 22. The temporary storage unit 22 stores the audio signal and the video signal, the determination result, and the start / end time of the audio section in association with each other. The format of the audio signal and the video signal may be any format. Moreover, the audio reproduction device according to the present embodiment may further include the above-described storage unit 12. In this case, the audio signal and voice / non-speech discrimination result stored in the storage unit 12 in units of contents are read out in units of sections and stored in the temporary storage unit 22.

このような一時蓄積部２２を設けることで、蓄積されたセクションで実際の音声区間長（実測値）を算出することができる。これにより、実際の音声区間にあわせた速度制御が可能になる。また蓄積されたセクションの実際の音声含有率を求めることができるため、第２の実施形態で説明した予測音声含有率を用いる場合に比べて、局所的な変動が少なく、コンテンツ全体の音声含有率と近い値となる。 By providing such a temporary storage unit 22, it is possible to calculate the actual voice section length (actually measured value) in the stored section. This makes it possible to control the speed according to the actual voice interval. In addition, since the actual audio content rate of the accumulated section can be obtained, there is less local variation compared to the case where the predicted audio content rate described in the second embodiment is used, and the audio content rate of the entire content is reduced. And close to the value.

音声含有率算出部１３は、一時蓄積部２２で蓄積された判別結果や音声区間の始終端時刻から、セクションの音声含有率を算出する。このセクション内に含まれる音声区間長の和を求め、セクション全体の時間長（以下、セクション長と称す）で除算したものが本実施形態の音声含有率となる。 The voice content rate calculation unit 13 calculates the voice content rate of the section from the determination result stored in the temporary storage unit 22 and the start / end time of the voice section. The sum of the voice section lengths included in this section is calculated and divided by the time length of the entire section (hereinafter referred to as section length) to obtain the voice content rate of this embodiment.

以下、図２８を参照して、第４の実施形態に係る音声再生装置の処理について説明する。図２８は、第４の実施形態に係る音声再生装置の処理の流れを示すフローチャートである。 Hereinafter, with reference to FIG. 28, the process of the audio reproduction device according to the fourth embodiment will be described. FIG. 28 is a flowchart showing a flow of processing of the audio reproduction device according to the fourth embodiment.

まず、入力装置（図示なし）において、ユーザが所望のコンテンツを再生する指示をしたか否かが判断される（ステップＳ４０１）。ユーザの指示があった場合、コンテンツのオーディオ信号及びビデオ信号がセクション分だけ一時蓄積部２２に蓄積され、音声非音声判別部１１は、セクション内のオーディオ信号について音声区間と非音声区間とを判別する（ステップＳ４０２）。なお、ステップＳ４０２において判別された判別結果と音声区間の始終端時刻についても、一時蓄積部２２に蓄積される。 First, in an input device (not shown), it is determined whether or not the user gives an instruction to reproduce desired content (step S401). When the user gives an instruction, the audio and video signals of the content are stored in the temporary storage unit 22 for the section, and the voice / non-speech discrimination unit 11 discriminates between the speech and non-speech sections of the audio signal in the section (Step S402). Note that the determination result determined in step S402 and the start and end times of the speech section are also stored in the temporary storage unit 22.

ステップＳ４０２の次に、音声含有率算出部１３は、セクションの音声含有率を算出する（ステップＳ４０３）。速度比条件設定部１４は、ステップＳ４０３で算出された音声含有率、圧伸比算出部２０で算出されたセクション圧伸比に基づいて、音声区間の平均速度比、非音声区間の平均速度比、及び音声区間の終端速度比を算出する（ステップＳ４０４及びＳ４０５）。この処理は、第１の実施形態と同様の処理である。次に、ステップＳ４０６において速度比条件設定部１４は、ステップＳ４０５で算出した終端速度比Ｖｅｎｄ１と、ユーザによって指定された又は予め装置に設定された終端速度比の上限速度比Ｖｅｎｄ２とを比較する。ステップＳ４０６において終端速度比Ｖｅｎｄ１が上限速度比Ｖｅｎｄ２よりも大きいと判断された場合、速度比条件設定部１４は、終端速度比Ｖｅｎｄ１を上限速度比Ｖｅｎｄ２に修正する（ステップＳ４０７）。ステップＳ４０６において終端速度比Ｖｅｎｄ１が上限速度比Ｖｅｎｄ２よりも小さいと判断された場合、処理はステップＳ４０８へ進む。 Following step S402, the audio content rate calculation unit 13 calculates the audio content rate of the section (step S403). Based on the voice content rate calculated in step S403 and the section companding ratio calculated by the companding ratio calculating unit 20, the speed ratio condition setting unit 14 calculates the average speed ratio of the speech section and the average speed ratio of the non-speech section. , And the termination speed ratio of the voice section are calculated (steps S404 and S405). This process is the same as that of the first embodiment. Next, in step S406, the speed ratio condition setting unit 14 compares the terminal speed ratio Vend1 calculated in step S405 with the upper limit speed ratio Vend2 of the terminal speed ratio designated by the user or set in advance in the apparatus. When it is determined in step S406 that the terminal speed ratio Vend1 is larger than the upper limit speed ratio Vend2, the speed ratio condition setting unit 14 corrects the terminal speed ratio Vend1 to the upper limit speed ratio Vend2 (step S407). If it is determined in step S406 that the terminal speed ratio Vend1 is smaller than the upper limit speed ratio Vend2, the process proceeds to step S408.

ここで、セクション音声含有率は、コンテンツの一部を構成するセクション内での値である。したがって、音声区間が局所的に集中するセクションなどが存在すれば、局所的にセクション音声含有率の値が大きくなる場合がある。以上のステップＳ４０６及びＳ４０７の処理を行うことで、セクション音声含有率の値が大きくなり、音声区間の終端速度比が大きくなり過ぎることを防ぐことができる。なお、第３の実施形態で説明した統計量算出部２１で長期音声含有率を算出し、上限速度比を修正するようにしてもよい。 Here, the section audio content rate is a value in a section constituting a part of the content. Therefore, if there is a section where speech sections are locally concentrated, the value of the section speech content rate may increase locally. By performing the processes of steps S406 and S407 described above, it is possible to prevent the section voice content rate from increasing and the termination speed ratio of the voice section from becoming too large. Note that the statistical amount calculation unit 21 described in the third embodiment may calculate the long-term voice content rate and correct the upper limit speed ratio.

ステップＳ４０７の次に、音声区間長算出部１５は、音声区間の始終端時刻を入力とし、音声区間長を算出する（ステップＳ４０８）。速度比決定部１６は、一時蓄積部２２に蓄積された音声区間の始終端時刻を参照して、セクションの始端から順に所定の単位時間毎に音声区間であるか否かを判断する（ステップＳ４０９）。音声区間と判断した場合、速度比決定部１６は、音声区間の始終端時刻と、ステップＳ４０８で算出された音声区間長とに基づき、音声区間における経過割合を算出する（ステップＳ４１０）。 After step S407, the speech segment length calculation unit 15 receives the start / end time of the speech segment and calculates the speech segment length (step S408). The speed ratio determination unit 16 refers to the start / end time of the speech section stored in the temporary storage unit 22 and determines whether or not the speech section is a predetermined unit time in order from the start end of the section (step S409). ). When it is determined that the voice segment is determined, the speed ratio determination unit 16 calculates the elapsed rate in the voice segment based on the start / end time of the voice segment and the voice segment length calculated in step S408 (step S410).

ステップＳ４１０の次に、速度比決定部１６は、音声区間の経過割合から、音声区間の変換速度比を決定する（ステップＳ４１１）。ステップＳ４１１の処理は、第１の実施形態と同様である。ステップＳ４０９において非音声区間と判断した場合、速度比決定部１６は、非音声区間の始端から終端まで、速度比条件設定部１４で設定された非音声区間の平均速度比を変換速度比として決定する（ステップＳ４１２）。 Following step S410, the speed ratio determination unit 16 determines the conversion speed ratio of the voice section from the elapsed rate of the voice section (step S411). The processing in step S411 is the same as that in the first embodiment. If it is determined in step S409 that it is a non-speech segment, the speed ratio determination unit 16 determines the average speed ratio of the non-speech segment set by the speed ratio condition setting unit 14 as the conversion rate ratio from the start to the end of the non-speech segment. (Step S412).

ステップＳ４１１及びＳ４１２の次に、速度比決定部１６は、セクションの終端まで変換速度比を算出したか否かを判断する（ステップＳ４１３）。終端ではないとき、処理はステップＳ４０９へ戻る。このように、セクションの終端までの変換速度比が算出されるまで、速度比決定部１６においてステップＳ４０９〜Ｓ４１３までの処理が繰り返される。ステップＳ４１３においてセクションの終端まで変換速度比が算出されたと判断された場合、速度変換部１７において変換速度比に従ってオーディオ信号の速度変換が行われ、速度変換後のオーディオ信号の再生が開始される（ステップＳ４１４）。速度比決定部１６は、速度変換対象となるコンテンツの終端時刻まで再生されたか否かを判断する（ステップＳ４１５）。終端時刻ではないとき、次のセクション分のオーディオ信号が一時蓄積部２２に蓄積され、処理はステップＳ４０２へ戻る。ステップＳ４１５においてコンテンツの終端時刻まで再生されたと判断された場合、入力装置（図示なし）が本装置の処理を終了するか否かの指示を受け付ける（ステップＳ４１６）。ユーザが他のコンテンツについて速度変換処理を行う場合（ステップＳ４１６でＮｏ）、処理はステップＳ４０１へ戻る。 Subsequent to steps S411 and S412, the speed ratio determination unit 16 determines whether or not the conversion speed ratio has been calculated up to the end of the section (step S413). If it is not the end, the process returns to step S409. In this way, the processing from step S409 to step S413 is repeated in the speed ratio determination unit 16 until the conversion speed ratio up to the end of the section is calculated. If it is determined in step S413 that the conversion speed ratio has been calculated up to the end of the section, the speed conversion unit 17 converts the speed of the audio signal according to the conversion speed ratio, and starts reproducing the audio signal after the speed conversion ( Step S414). The speed ratio determination unit 16 determines whether or not the content has been played up to the end time of the content to be speed converted (step S415). If it is not the end time, the audio signal for the next section is stored in the temporary storage unit 22, and the process returns to step S402. If it is determined in step S415 that the content has been played back until the end time of the content, the input device (not shown) receives an instruction as to whether or not to end the processing of this device (step S416). When the user performs the speed conversion process for other contents (No in step S416), the process returns to step S401.

以上のように、本実施形態に係る音声再生装置によれば、セクション単位で速度変換を行うことができる。ここでユーザが、例えば放送中の番組を最初から録画していたが、その番組放送の途中から視聴可能になったとする。このときユーザは、その番組の冒頭を見逃したのでその冒頭を速度変換して視聴しようとするとき、ユーザは速度変換処理の開始時点を冒頭の時点に指定する。なお、速度変換処理の終了時点は、最新の録画がなされた時点である。つまり、冒頭から最新の録画がなされた時点までの区間が１つのセクションとなる。これにより、ユーザは、冒頭から最新の録画がなされた時点まで速度変換した視聴をすることができ、その後においては通常再生によって視聴を続けることができる。このように、本実施形態に係る音声再生装置によれば、セクション単位で速度変換処理を行うので、コンテンツの録画中であっても、全体の録画終了を待たずに速度変換処理を行うことができる。 As described above, according to the audio reproduction device according to the present embodiment, speed conversion can be performed in units of sections. Here, it is assumed that, for example, the user has recorded a program being broadcast from the beginning, but can be viewed from the middle of the program broadcast. At this time, since the user missed the beginning of the program, when the user wants to view the beginning with speed conversion, the user designates the start time of the speed conversion process as the beginning time. Note that the end point of the speed conversion process is the time when the latest recording is performed. That is, a section from the beginning to the time when the latest recording is made becomes one section. As a result, the user can perform viewing after speed conversion from the beginning to the time when the latest recording is performed, and thereafter can continue viewing by normal reproduction. As described above, according to the audio reproduction device according to the present embodiment, the speed conversion process is performed in section units. Therefore, even when content is being recorded, the speed conversion process can be performed without waiting for the end of the entire recording. it can.

また本実施形態に係る音声再生装置によれば、一時蓄積部２２を備えることにより、音声含有率の実測値を算出することができ、音声含有率として予測値を用いる第２及び第３の実施形態に比べて、より最適な速度比で速度変換を行うことができる。また、一時蓄積部２２を備えることにより、音声区間長の実測値を算出することができる。音声区間長に実測値を用いる限り、音声区間長が分からないことによる終端速度比の上がり過ぎは生じず、音声区間長として予測値を用いる第２及び第３の実施形態に比べて、より最適な速度比で速度変換を行うことができる。 Moreover, according to the audio reproduction device according to the present embodiment, the temporary storage unit 22 is provided, whereby an actual measurement value of the audio content rate can be calculated, and the second and third implementations using the predicted value as the audio content rate. The speed conversion can be performed with a more optimal speed ratio as compared with the form. Further, by providing the temporary storage unit 22, it is possible to calculate an actual measurement value of the voice section length. As long as the actual measurement value is used for the voice section length, the termination speed ratio does not increase excessively due to the unknown voice section length, and is more optimal than the second and third embodiments using the predicted value as the voice section length. Speed conversion can be performed with a simple speed ratio.

（第５の実施形態）
図２９を参照して、本発明の第５の実施形態に係る音声再生装置について説明する。図２９は、第５の実施形態に係る音声再生装置の構成例を示すブロック図である。図２９において、本音声再生装置は、音判別部２３、蓄積部１２、音声含有率算出部１３、速度比条件設定部１４、音声区間長算出部１５、速度比決定部１６、速度変換部１７、及び特定イベント含有率算出部２４で構成される。 (Fifth embodiment)
With reference to FIG. 29, an audio reproducing apparatus according to the fifth embodiment of the present invention will be described. FIG. 29 is a block diagram illustrating a configuration example of an audio reproduction device according to the fifth embodiment. In FIG. 29, the sound reproducing apparatus includes a sound discriminating unit 23, a storage unit 12, a sound content rate calculating unit 13, a speed ratio condition setting unit 14, a sound section length calculating unit 15, a speed ratio determining unit 16, and a speed converting unit 17. , And a specific event content rate calculation unit 24.

なお、上述した第１の実施形態では、音声区間及び非音声区間の速度比を算出したが、本実施形態では、コンテンツに含まれる特定イベント区間についてさらに個別の速度比を算出することが可能な音声再生装置について説明する。また本実施形態に係る音声再生装置は、第１の実施形態に係る音声再生装置に対し、音声非音声判別部１１の代わりに音判別部２３を備える点と、特定イベント含有率算出部２４をさらに備える点で大きく異なる。 In the first embodiment described above, the speed ratio between the voice section and the non-voice section is calculated. However, in this embodiment, it is possible to further calculate individual speed ratios for the specific event section included in the content. An audio playback device will be described. In addition, the audio playback device according to the present embodiment is different from the audio playback device according to the first embodiment in that a sound discrimination unit 23 is provided instead of the audio non-speech discrimination unit 11 and a specific event content rate calculation unit 24 is provided. Furthermore, it differs greatly in the point provided.

音判別部２３は、オーディオ信号を入力として、特定イベント音を含む特定イベント区間、当該特定イベント区間以外の音声区間及び非音声区間を判別する。特定イベント音とは、個別の音源からの音であってもよいし、複数の音源からの音を一まとめにしたものであってもよい。個別の音源からの音としては、例えば、話者Ａからの音声、楽器Ｂからの音、機器Ｃからの特定音などが挙げられる。複数の音源からの音を一まとめにしたものとしては、例えば、複数の話者からの音声を一まとめにしたものや、音楽、雑音などが挙げられる。また、特定イベント音は１つとは限らず、複数であってもよい。特定イベント音が複数ある場合、音判別部２３は、オーディオ信号を入力として、複数の特定イベント区間、当該各特定イベント区間以外の音声区間及び非音声区間を判別することになる。また、特定イベント音が話者Ａや話者Ｂなどの音声である場合、音判別部２３が判別する音声区間は特定イベント区間以外の音声区間を意味することになる。以下では、特定イベント音を音楽と仮定して説明する。 The sound discriminating unit 23 receives an audio signal as an input, and discriminates a specific event section including a specific event sound, a voice section other than the specific event section, and a non-voice section. The specific event sound may be a sound from an individual sound source, or may be a group of sounds from a plurality of sound sources. Examples of the sound from the individual sound source include sound from the speaker A, sound from the musical instrument B, and specific sound from the device C. Examples of a set of sounds from a plurality of sound sources include a set of sounds from a plurality of speakers, music, noise, and the like. Further, the number of specific event sounds is not limited to one, and may be plural. When there are a plurality of specific event sounds, the sound discriminating unit 23 discriminates a plurality of specific event sections, a voice section other than the specific event sections, and a non-voice section by using the audio signal as an input. Further, when the specific event sound is a voice of speaker A, speaker B, or the like, the voice section determined by the sound determination unit 23 means a voice section other than the specific event section. In the following description, the specific event sound is assumed to be music.

音判別部２３は、オーディオ信号を入力として、特定イベント区間である音楽区間、当該音楽区間以外の音声区間及び非音声区間を判別する。これらの区間を判別する方法としては、例えば「ＭＰＥＧ符号化データからのオーディオインデキシング」＜中島康之，陸洋，菅野勝，柳原広昌，米山暁夫（ＫＤＤＩ研究所）２０００、信学論Ｄ−II
，Ｖｏｌ．Ｊ８３−Ｄ−II，Ｎｏ．５，ｐｐ．１３６１−１３７１＞に記載された
公知の方法がある。この方法では、まずオーディオ信号を有音部と無音部に分類する。そして有音部についてさらに、ベイズ推定を用いて音声・音楽・歓声の３つのカテゴリに分類する。このような方法で、音判別部２３は、音楽区間、当該音楽区間以外の音声区間及び非音声区間を判別する。なお、上記歓声は、音楽区間以外の非音声区間に含まれるとする。音判別部２３で判別された判別結果や、音声区間の始終端時刻、音楽区間の始終端時刻は、蓄積部１２に蓄積される。 The sound discriminating unit 23 receives the audio signal as an input, and discriminates a music section that is a specific event section, a voice section other than the music section, and a non-voice section. As a method for discriminating these sections, for example, “audio indexing from MPEG encoded data” <Yasuyuki Nakajima, Rikuyo, Masaru Sugano, Hiromasa Yanagihara, Ikuo Yoneyama (KDDI R & D Laboratories) 2000, Science D-II
, Vol. J83-D-II, no. 5, pp. 1361-1371> are known methods. In this method, the audio signal is first classified into a sound part and a soundless part. Further, the sounded part is further classified into three categories of voice, music, and cheer using Bayesian estimation. With such a method, the sound discriminating unit 23 discriminates a music segment, a voice segment other than the music segment, and a non-speech segment. Note that the cheers are included in a non-voice section other than the music section. The discrimination result discriminated by the sound discriminating unit 23, the start / end time of the voice section, and the start / end time of the music section are stored in the storage unit 12.

特定イベント含有率算出部２４は、蓄積部１２に蓄積された特定イベント区間の始終端時刻から特定イベントの含有率を算出する。特定イベント含有率は、コンテンツのオーディオ信号に含まれる特定イベント区間（ここでは音楽区間）の比率を示したものである。以下の説明では、特定イベント含有率を音楽含有率と言い換えて説明する。音楽含有率は、具体的には、所定時間のオーディオ信号に含まれる音楽区間長の和を当該所定時間で除算したものである。ここでは、コンテンツ全体に含まれる音楽区間長の和をコンテンツ長で除算したものとする。 The specific event content rate calculation unit 24 calculates the content rate of the specific event from the start and end times of the specific event section accumulated in the accumulation unit 12. The specific event content rate indicates a ratio of a specific event section (here, a music section) included in the audio signal of the content. In the following description, the specific event content rate is described in other words as the music content rate. Specifically, the music content rate is obtained by dividing the sum of music section lengths included in an audio signal for a predetermined time by the predetermined time. Here, it is assumed that the sum of music section lengths included in the entire content is divided by the content length.

速度比条件設定部１４は、まず目標圧伸比と特定イベント含有率算出部２４で算出された音楽含有率から、音楽区間の平均速度比（以下、音楽速度比と称す）を算出する。なお、目標圧伸比は、ユーザによって設定されたものでもよいし、予め装置に設定されたものでもよい。具体的には、速度比条件設定部１４は、目標圧伸比に応じて異なる音楽含有率と音楽速度比との対応を示したテーブルや対応関数に基づいて、音楽速度比を算出する。このテーブルや対応関数は、予め用意されているとする。図３０は、音楽含有率と音楽速度比との対応を示したテーブルの例を示す図である。図３０に示すテーブルは、目標圧伸比が０．５のときの対応関係を示したものである。 The speed ratio condition setting unit 14 first calculates an average speed ratio (hereinafter referred to as a music speed ratio) of a music section from the target companding ratio and the music content rate calculated by the specific event content rate calculation unit 24. The target companding ratio may be set by the user or may be set in advance in the apparatus. Specifically, the speed ratio condition setting unit 14 calculates the music speed ratio based on a table or a corresponding function that indicates the correspondence between the music content ratio and the music speed ratio that differ depending on the target companding ratio. It is assumed that this table and the corresponding function are prepared in advance. FIG. 30 is a diagram showing an example of a table showing the correspondence between the music content rate and the music speed ratio. The table shown in FIG. 30 shows the correspondence when the target companding ratio is 0.5.

速度比条件設定部１４は、算出した音楽速度比に基づいて、音楽区間以外の音声区間及び非音声区間の圧伸比を算出する。ここで音楽含有率をＳｍ、音楽速度比をＦ、目標圧伸比をＥ、音楽区間以外の音声区間及び非音声区間の平均速度比をＧとすると、平均速度比Ｇは式（１９）となる。

例えば目標圧伸比が０．５、音楽含有率が１０％の場合、図３０により、音楽速度比は１倍速となる。したがってこの場合、式（１９）にＥ＝０．５、Ｓｍ＝０．１、Ｆ＝１を代入すると、Ｇ＝２．２５となる。 Based on the calculated music speed ratio, the speed ratio condition setting unit 14 calculates a companding ratio for a speech segment other than the music segment and a non-speech segment. Here, assuming that the music content rate is Sm, the music speed ratio is F, the target companding ratio is E, and the average speed ratio of speech sections other than the music section and non-speech sections is G, the average speed ratio G is given by Equation (19). Become.

For example, when the target companding ratio is 0.5 and the music content rate is 10%, the music speed ratio becomes 1 × speed as shown in FIG. Therefore, in this case, if E = 0.5, Sm = 0.1, and F = 1 are substituted into the equation (19), G = 2.25.

圧伸比は速度比の逆数で表せる。平均速度比Ｇが２．２５であるため、音楽区間以外の音声区間及び非音声区間の圧伸比はその逆数０．４４となる。そこで、この圧伸比（０．４４）を音声区間及び非音声区間についての目標圧伸比とすれば、第１の実施形態と同様の方法で、音声区間の平均速度比、非音声区間の平均速度比、音声区間の終端速度比を算出することができる。なお、速度比条件設定部１４が用いる速度比算出分布は、音楽含有率に応じて設定されるようにしてもよい。 The draw ratio can be expressed as the reciprocal of the speed ratio. Since the average speed ratio G is 2.25, the companding ratio of the voice segment other than the music segment and the non-speech segment is 0.44. Therefore, if this companding ratio (0.44) is set as the target companding ratio for the speech segment and the non-speech segment, the average speed ratio of the speech segment and the non-speech segment in the same manner as in the first embodiment. The average speed ratio and the end speed ratio of the voice section can be calculated. The speed ratio calculation distribution used by the speed ratio condition setting unit 14 may be set according to the music content rate.

以下、図３１を参照して、第５の実施形態に係る音声再生装置の処理について説明する。図３１は、第５の実施形態に係る音声再生装置の処理の流れを示すフローチャートである。 Hereinafter, with reference to FIG. 31, the process of the audio reproduction device according to the fifth embodiment will be described. FIG. 31 is a flowchart showing a flow of processing of the audio reproduction device according to the fifth embodiment.

まず、ユーザが入力装置（図示なし）においてコンテンツを録画する指示をしたとき、当該コンテンツのオーディオ信号及びビデオ信号が蓄積部１２に蓄積される。このとき、音判別部２３は、音楽区間、当該音楽区間以外の音声区間及び非音声区間を判別する（ステップＳ５０１）。なお、ステップＳ５０１において判別された判別結果、音声区間の始終端時刻、及び音楽区間の始終端時刻についても、蓄積部１２に蓄積される。 First, when a user gives an instruction to record content on an input device (not shown), an audio signal and a video signal of the content are stored in the storage unit 12. At this time, the sound discriminating unit 23 discriminates a music segment, a voice segment other than the music segment, and a non-speech segment (step S501). Note that the determination result determined in step S501, the start / end time of the voice section, and the start / end time of the music section are also stored in the storage unit 12.

ステップＳ５０１の次に、入力装置において、ユーザが所望のコンテンツを再生する指示をしたか否かが判断される（ステップＳ５０２）。ユーザの指示があった場合（ステップＳ５０２でＹｅｓ）、音声含有率算出部１３は、指示されたコンテンツの音声含有率を算出する（ステップＳ５０３）。また、特定イベント含有率算出部２４は、指示されたコンテンツの音楽含有率を算出する（ステップＳ５０４）。 Next to step S501, it is determined whether or not the user has instructed the user to reproduce the desired content (step S502). When there is an instruction from the user (Yes in step S502), the audio content rate calculation unit 13 calculates the audio content rate of the instructed content (step S503). Further, the specific event content rate calculation unit 24 calculates the music content rate of the instructed content (step S504).

ステップＳ５０４の次に、速度比条件設定部１４は、目標圧伸比と特定イベント含有率算出部２４で算出された音楽含有率から、音楽速度比を算出する（ステップＳ５０５）。速度比条件設定部１４は、算出した音楽速度比に基づいて、式（１９）を用いて音楽区間以外の音声区間及び非音声区間の圧伸比を算出する。そして、算出した圧伸比を用いて、音声区間の平均速度比、非音声区間の平均速度比、及び音声区間の終端速度比を速度比条件として設定する（ステップＳ５０６）。音声区間長算出部１５は、音声区間の始終端時刻を入力とし、音声区間長を算出する（ステップＳ５０７）。 After step S504, the speed ratio condition setting unit 14 calculates a music speed ratio from the target companding ratio and the music content rate calculated by the specific event content rate calculation unit 24 (step S505). Based on the calculated music speed ratio, the speed ratio condition setting unit 14 calculates the companding ratio of the speech section other than the music section and the non-speech section using Expression (19). Then, using the calculated companding ratio, the average speed ratio of the voice section, the average speed ratio of the non-voice section, and the terminal speed ratio of the voice section are set as speed ratio conditions (step S506). The voice segment length calculation unit 15 receives the start / end time of the voice segment and calculates the voice segment length (step S507).

ステップＳ５０７の次に、速度比決定部１６は、蓄積部１２に蓄積された音楽区間の始終端時刻を参照して、コンテンツの始端から順に所定の単位時間毎に音楽区間であるか否かを判断する（ステップＳ５０８）。音楽区間と判断した場合、速度比決定部１６は、ステップＳ５０５で算出した音楽速度比を変換速度比として決定する（ステップＳ５０９）。つまり、音楽区間の始端から終端までの変換速度比は、音楽速度比で一定となる。 After step S507, the speed ratio determination unit 16 refers to the start / end times of the music sections stored in the storage unit 12, and determines whether the music section is a music section every predetermined unit time in order from the start end of the content. Judgment is made (step S508). If it is determined as a music section, the speed ratio determination unit 16 determines the music speed ratio calculated in step S505 as a conversion speed ratio (step S509). That is, the conversion speed ratio from the beginning to the end of the music section is constant at the music speed ratio.

ステップＳ５０８において音楽区間でないと判断した場合、速度比決定部１６は、音声区間の始終端時刻を参照して、音声区間であるか否かを判断する（ステップＳ５１０）。音声区間と判断した場合、速度比決定部１６は、ステップＳ５０６で設定された音声区間の平均速度比と、音声区間の始終端時刻と、ステップＳ５０７で算出された音声区間長とに基づき、音声区間における経過割合を算出する（ステップＳ５１１）。速度比決定部１６は、音声区間の経過割合から、音声区間の変換速度比を決定する（ステップＳ５１２）。音声区間でないと判断した場合、速度比決定部１６は、ステップＳ５０６で設定された非音声区間の平均速度比を変換速度比として決定する（ステップＳ５１３）。つまり、非音声区間の始端から終端までの変換速度比は、当該非音声区間の平均速度比で一定となる。なお、ステップＳ５１１〜Ｓ５１３の処理は、第１の実施形態と同様である。 If it is determined in step S508 that it is not a music section, the speed ratio determining unit 16 refers to the start / end time of the voice section and determines whether it is a voice section (step S510). When it is determined that the voice segment is determined, the speed ratio determination unit 16 determines the voice based on the average speed ratio of the voice segment set in step S506, the start / end time of the voice segment, and the voice segment length calculated in step S507. The elapsed ratio in the section is calculated (step S511). The speed ratio determining unit 16 determines the conversion speed ratio of the voice segment from the elapsed rate of the voice segment (step S512). When determining that it is not a speech section, the speed ratio determination unit 16 determines the average speed ratio of the non-speech section set in step S506 as the conversion speed ratio (step S513). That is, the conversion speed ratio from the beginning to the end of the non-speech section is constant at the average speed ratio of the non-speech section. Note that the processing in steps S511 to S513 is the same as that in the first embodiment.

ステップＳ５０９、Ｓ５１２、及びＳ５１３の次に、速度比決定部１６は、コンテンツの終端まで変換速度比を算出したか否かを判断する（ステップＳ５１４）。終端ではないとき、処理はステップＳ５０８へ戻る。このように、コンテンツの終端までの変換速度比が算出されるまで、速度比決定部１６においてステップＳ５０８〜Ｓ５１４までの処理が繰り返される。ステップＳ５１４においてコンテンツの終端まで変換速度比が算出されたと判断された場合、速度変換部１７において変換速度比に従ってオーディオ信号の速度変換が行われ、速度変換後のオーディオ信号の再生が開始される（ステップＳ５１５）。入力装置（図示なし）が本装置の処理を終了するか否かの指示を受け付ける（ステップＳ５１６）。ユーザが他のコンテンツについて速度変換処理を行う場合（ステップＳ５１６でＮｏ）、処理はステップＳ５０２へ戻る。 After steps S509, S512, and S513, the speed ratio determination unit 16 determines whether the conversion speed ratio has been calculated up to the end of the content (step S514). If it is not the end, the process returns to step S508. In this way, the processing from step S508 to S514 is repeated in the speed ratio determination unit 16 until the conversion speed ratio up to the end of the content is calculated. If it is determined in step S514 that the conversion speed ratio has been calculated up to the end of the content, the speed conversion unit 17 converts the speed of the audio signal according to the conversion speed ratio, and starts playing the audio signal after the speed conversion ( Step S515). The input device (not shown) receives an instruction as to whether or not to end the processing of this device (step S516). When the user performs the speed conversion process for other contents (No in step S516), the process returns to step S502.

以上のように、本実施形態に係る音声再生装置によれば、特定イベント含有率を算出して特定イベント区間の速度比を設定することで、音楽番組などを視聴するに際し、特定イベント区間である音楽区間をそれ以外の音声区間及び非音声区間よりも遅い速度で再生することができる。これにより、速度変換処理において、音楽を重視した再生を行うことができる。また、特定イベント音を音楽ではなく、コンテンツ中に登場するある話者Ａの音声とした場合、話者Ａの音声に重点がおかれ、特定イベント区間以外の音声区間よりも遅い速度で話者Ａの音声が速度変換処理される。例えば、何度も視聴しているコンテンツに対して、話者Ａの発言内容を確認したいときなどに、話者Ａの発言内容だけ遅い速度で再生を行うことは有用である。また、セキュリティカメラのように長時間記録し続けている場合、雑音部分を特定イベント音として識別し、その部分を高速再生することで、冗長なシーンを見る時間を減らすような使い方も可能となる。このように、ある特定イベント音に対して、個別の速度を設定することにより、その部分の重視を促すように遅い速度で再生したり、冗長な部分を低減するために速い速度で再生を行ったり、用途に応じた速度設定が可能になる。 As described above, according to the audio reproduction device according to the present embodiment, the specific event section is calculated when the specific event section is calculated and the speed ratio of the specific event section is set to view a music program or the like. The music section can be played back at a slower speed than the other voice sections and non-voice sections. Thereby, it is possible to perform reproduction with emphasis on music in the speed conversion process. In addition, when the specific event sound is not music but the voice of a certain speaker A appearing in the content, the voice of the speaker A is emphasized and the speaker is slower than the voice section other than the specific event section. A voice is subjected to speed conversion processing. For example, when it is desired to confirm the content of the speech of the speaker A with respect to the content viewed many times, it is useful to reproduce only the content of the speech of the speaker A at a slow speed. Also, when recording for a long time like a security camera, it is possible to identify the noise part as a specific event sound and play that part at high speed to reduce the time to see redundant scenes. . In this way, by setting an individual speed for a specific event sound, it can be played back at a slow speed to promote emphasis on that part, or it can be played back at a fast speed to reduce redundant parts. Speed setting according to the application.

なお、上述した第１〜第５の実施形態に係る音声再生装置は、一般的なコンピュータシステム５０に音声再生プログラムを実行させることによって実現されてもよい。図３２は、音声再生装置がコンピュータシステム５０によって実現される構成例を示すブロック図である。 Note that the audio reproduction apparatuses according to the first to fifth embodiments described above may be realized by causing a general computer system 50 to execute an audio reproduction program. FIG. 32 is a block diagram illustrating a configuration example in which the audio reproduction device is realized by the computer system 50.

図３２において、コンピュータシステム５０は、ＣＰＵ５１、メモリ５２、ハードディスク５３、ディスクドライブ装置５４、モニタ５５、スピーカ５６、及び入力装置５７で構成される。ＣＰＵ５１は、音声再生プログラムを実行させることによって、上述した蓄積部１２及び一時蓄積部２２以外の第１〜第５の実施形態に係る音声再生装置の各構成部と同一の機能を実現する。メモリ５２やハードディスク５３は、音声再生プログラムを実行させることによって、蓄積部１２及び一時蓄積部２２と同一の機能を実現する。 32, the computer system 50 includes a CPU 51, a memory 52, a hard disk 53, a disk drive device 54, a monitor 55, a speaker 56, and an input device 57. CPU51 implement | achieves the same function as each structure part of the audio | voice reproduction apparatus based on 1st-5th Embodiment other than the storage part 12 and the temporary storage part 22 mentioned above by running an audio | voice reproduction | regeneration program. The memory 52 and the hard disk 53 realize the same function as the storage unit 12 and the temporary storage unit 22 by executing a sound reproduction program.

ディスクドライブ装置５４は、コンピュータシステム５０を音声再生装置として機能させるための音声再生プログラムが記憶された記録媒体５８から、当該音声再生プログラムを読み出す。音声再生プログラムが任意のコンピュータシステム５０にインストールされることにより、コンピュータシステム５０を上述した音声再生装置として機能させることができる。 The disk drive device 54 reads out the sound reproduction program from the recording medium 58 in which the sound reproduction program for causing the computer system 50 to function as the sound reproduction device is stored. By installing the audio reproduction program in an arbitrary computer system 50, the computer system 50 can function as the above-described audio reproduction apparatus.

なお、記録媒体５８は、例えばフレキシブルディスクや光ディスクなどのディスクドライブ装置５４によって読み取り可能な形式の記録媒体である。また音声再生プログラムは、コンピュータシステム５０に予めインストールされていてもかまわない。また音声再生プログラムは、インターネットなどの電気通信回線によって提供されてもよい。また音声再生処理は、全部または一部をハードウェアによって処理される形態であってもよい。 The recording medium 58 is a recording medium in a format that can be read by the disk drive device 54 such as a flexible disk or an optical disk. The audio reproduction program may be installed in the computer system 50 in advance. The audio reproduction program may be provided through a telecommunication line such as the Internet. Moreover, the audio | voice reproduction | regeneration process may be a form processed entirely or partially by hardware.

モニタ５５は、ディスクドライブ装置５４を介して読み込んだ記録媒体５８に記録されたビデオ信号や、ハードディスク５３に記録されたビデオ信号などを表示する。スピーカ５６は、ディスクドライブ装置５４を介して読み込んだ記録媒体５８に記録されたオーディオ信号、ハードディスク５３に記録されたオーディオ信号、速度変換処理後のオーディオ信号を音に変換して再生する。入力装置５７は、例えばキーボードやマウスなどで構成され、目標圧伸比の入力などを受け付ける。 The monitor 55 displays a video signal recorded on the recording medium 58 read via the disk drive device 54, a video signal recorded on the hard disk 53, and the like. The speaker 56 converts the audio signal recorded on the recording medium 58 read via the disk drive device 54, the audio signal recorded on the hard disk 53, and the audio signal after the speed conversion process into sound and reproduces it. The input device 57 is composed of, for example, a keyboard and a mouse, and receives an input of a target companding ratio.

このように、上述した第１〜第５の実施形態に係る音声再生装置は、一般的なコンピュータシステム５０に音声再生プログラムを実行させることによって実現される。 As described above, the sound reproduction apparatuses according to the first to fifth embodiments described above are realized by causing a general computer system 50 to execute a sound reproduction program.

また、上述した第１〜第５の実施形態に係る音声再生装置は、ＬＳＩなどの集積回路や、専用の信号処理回路を用いて１チップ化したものによって実現されてもよい。また上述した第１〜第５の実施形態に係る音声再生装置は、音声再生装置を構成する各構成部の機能に相当するものをそれぞれチップ化したものによって実現されてもよい。なお、ここでは、ＬＳＩとしたが、集積度の違いにより、ＩＣ、システムＬＳＩ、スーパーＬＳＩ、ウルトラＬＳＩと呼称されることもある。また集積回路化の手法は、ＬＳＩに限るものではなく、専用回路又は汎用プロセッサで実現してもよい。ＬＳＩ製造後に、プログラムすることが可能なＦＰＧＡ（ＦｉｅｌｄＰｒｏｇｒａｍｍａｂｌｅＧａｔｅＡｒｒａｙ）や、ＬＳＩ内部の回路セルの接続や設定を再構成可能なリコンフィギュラブル・プロセッサを利用してもよい。さらには、半導体技術の進歩又は派生する別技術によりＬＳＩに置き換わる集積回路化の技術が登場すれば、当然、その技術を用いて機能ブロックの集積化を行ってもよい。 In addition, the audio reproduction apparatuses according to the first to fifth embodiments described above may be realized by a single chip using an integrated circuit such as an LSI or a dedicated signal processing circuit. Moreover, the audio | voice reproduction apparatus which concerns on the 1st-5th embodiment mentioned above may be implement | achieved by what each comprised the thing equivalent to the function of each structure part which comprises an audio | voice reproduction apparatus as a chip | tip. Here, although LSI is used, it may be called IC, system LSI, super LSI, or ultra LSI depending on the degree of integration. Further, the method of circuit integration is not limited to LSI's, and implementation using dedicated circuitry or general purpose processors is also possible. An FPGA (Field Programmable Gate Array) that can be programmed after manufacturing the LSI or a reconfigurable processor that can reconfigure the connection and setting of circuit cells inside the LSI may be used. Further, if integrated circuit technology comes out to replace LSI's as a result of the advancement of semiconductor technology or a derivative other technology, it is naturally also possible to carry out function block integration using this technology.

本発明に係る音声再生装置は、目標時間を達成しつつ、入力されるオーディオ信号に応じた適切な速度変換を行うことが可能なハードディスクレコーダーやＤＶＤレコーダー等のＡＶコンテンツ視聴用機器、パソコンや携帯電話等のモバイル機器上で動作するアプリケーション等に有用である。また、視聴用途だけではなく、学習コンテンツ再生システム等において内容の理解を容易にするための用途や、セキュリティカメラで撮影された映像等の長時間のコンテンツについて概要の把握を容易にするための用途等にも有用である。 The audio reproducing apparatus according to the present invention is a device for viewing AV contents such as a hard disk recorder and a DVD recorder, a personal computer and a mobile phone capable of performing an appropriate speed conversion according to an input audio signal while achieving a target time. This is useful for applications that run on mobile devices such as telephones. Also, not only for viewing purposes, but also for use in learning content playback systems, etc., for easy understanding of content, and for easy understanding of long-term content such as video taken with a security camera Etc. are also useful.

第１の実施形態に係る音声再生装置の構成例を示すブロック図1 is a block diagram showing a configuration example of an audio reproduction device according to a first embodiment. ジャンル別の音声含有率を示した図Diagram showing audio content by genre 各ジャンルの音声含有率の平均と標準偏差とを示した図A diagram showing the average and standard deviation of the audio content of each genre ５種類の算出パターンを示した図Diagram showing five types of calculation patterns 速度比算出分布の一例を示す図Diagram showing an example of speed ratio calculation distribution 音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示す図The figure which shows the target companding ratio when the voice content rate is 0.5, the average speed ratio of the voice section, and the average speed ratio of the non-voice section 音声及び非音声区間の速度比変化を示した模式図Schematic showing the speed ratio change between voice and non-voice segments 第１の実施形態に係る音声再生装置の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the audio | voice reproduction apparatus which concerns on 1st Embodiment. ニュース番組に含まれる音声区間長とその頻度を示した図The figure which showed the voice section length and the frequency which are included in the news program 野球番組に含まれる音声区間長とその頻度を示した図The figure which showed the voice section length and the frequency included in the baseball program 音声区間の圧伸比の変化を示した図Diagram showing change in companding ratio in voice interval ２段階の変換速度比を算出した場合を示す図The figure which shows the case where the conversion speed ratio of 2 steps is calculated ドキュメンタリーなどの静止画像が多いジャンルについての速度比算出分布の例を示す図The figure which shows the example of speed ratio calculation distribution about the genre where there are many still images such as documentary 図１３に示す速度比算出分布において、音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示す図In the speed ratio calculation distribution shown in FIG. 13, the target companding ratio when the voice content rate is 0.5, the average speed ratio of the voice section, and the average speed ratio of the non-voice section スポーツなど動きの激しいシーンが多いジャンルについての速度比算出分布の例を示す図The figure which shows the example of the speed ratio calculation distribution about the genre with many scenes with a lot of movements, such as sports 図１５に示す速度比算出分布において、音声含有率が０．５のときの目標圧伸比、音声区間の平均速度比、非音声区間の平均速度比を示す図In the speed ratio calculation distribution shown in FIG. 15, the target companding ratio when the voice content rate is 0.5, the average speed ratio of the voice section, and the average speed ratio of the non-voice section 第２の実施形態に係る音声再生装置の構成例を示すブロック図The block diagram which shows the structural example of the audio | voice reproduction apparatus which concerns on 2nd Embodiment. 式（１１）で示される予測音声含有率Ｙ（ｚ）を求める方法を模式的に示す図The figure which shows typically the method of calculating | requiring the prediction audio | voice content rate Y (z) shown by Formula (11). 音声含有率Ｘ（ｚ）と予測音声含有率Ｙ（ｚ）の算出結果の一例を示す図The figure which shows an example of the calculation result of the audio | voice content rate X (z) and the prediction audio | voice content rate Y (z) 式（１４）に基づいて予測した音声区間長を示す図The figure which shows the audio | voice area length estimated based on Formula (14) 第２の実施形態に係る音声再生装置の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the audio | voice reproduction apparatus which concerns on 2nd Embodiment. 第３の実施形態に係る音声再生装置の構成例を示すブロック図The block diagram which shows the structural example of the audio | voice reproduction apparatus which concerns on 3rd Embodiment. コンテンツ毎の長期音声含有率の時間変化を示す図The figure which shows the time change of long-term audio content rate for every contents コンテンツ毎の予測音声含有率を示す図The figure which shows the prediction voice content rate for every contents 第３の実施形態に係る速度比条件設定部１４の処理を示すフローチャートThe flowchart which shows the process of the speed ratio condition setting part 14 which concerns on 3rd Embodiment. 音声区間長の実測値と、直前の音声区間長の実測値と、予測音声区間長の分布を示した図The figure which shows distribution of the actual measurement value of the voice section length, the actual measurement value of the last voice section length, and the predicted voice section length 第４の実施形態に係る音声再生装置の構成例を示すブロック図The block diagram which shows the structural example of the audio | voice reproduction apparatus which concerns on 4th Embodiment. 第４の実施形態に係る音声再生装置の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the audio | voice reproduction apparatus which concerns on 4th Embodiment. 第５の実施形態に係る音声再生装置の構成例を示すブロック図The block diagram which shows the structural example of the audio | voice reproduction apparatus which concerns on 5th Embodiment. 音楽含有率と音楽速度比との対応を示したテーブルの例を示す図The figure which shows the example of the table which showed a response | compatibility with a music content rate and a music speed ratio 第５の実施形態に係る音声再生装置の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the audio | voice reproduction apparatus which concerns on 5th Embodiment. 音声再生装置がコンピュータシステム５０によって実現される構成例を示すブロック図FIG. 2 is a block diagram showing a configuration example in which an audio reproduction device is realized by a computer system 50 従来の音声再生装置の構成を示したブロック図Block diagram showing the configuration of a conventional audio playback device

Explanation of symbols

１１音声非音声判別部
１２蓄積部
１３音声含有率算出部
１４速度比条件設定部
１５音声区間長算出部
１６速度比決定部
１７速度変換部
１８音声含有率予測部
１９音声区間長予測部
２０圧伸比算出部
２１統計量算出部
２２一時蓄積部
２３音判別部
２４特定イベント含有率算出部
５０コンピュータシステム
５１ＣＰＵ
５２メモリ
５３ハードディスク
５４ディスクドライブ装置
５５モニタ
５６スピーカ
５７入力装置 DESCRIPTION OF SYMBOLS 11 Voice non-voice discrimination | determination part 12 Accumulation part 13 Voice content rate calculation part 14 Speed ratio condition setting part 15 Voice interval length calculation part 16 Speed ratio determination part 17 Speed conversion part 18 Voice content rate prediction part 19 Voice area length prediction part 20 Pressure Elongation ratio calculation unit 21 Statistics calculation unit 22 Temporary storage unit 23 Sound discrimination unit 24 Specific event content rate calculation unit 50 Computer system 51 CPU
52 Memory 53 Hard Disk 54 Disk Drive Device 55 Monitor 56 Speaker 57 Input Device

Claims

An audio playback device that plays back a predetermined playback time by changing the playback speed of an audio signal of input content,
A discriminating means for discriminating a voice section including voice and a non-voice section not containing voice with respect to the audio signal;
A voice content rate calculating unit that calculates a voice content rate indicating a ratio of a voice section included in the audio signal based on the determination result determined by the determination unit;
Speeds for calculating the speed ratios of the voice and non-voice sections to the playback speed set in advance in the audio signal based on the voice content rate so that the playback time of the audio signal becomes the predetermined playback time. A ratio calculating means;
An audio playback device comprising: the audio signal as input, and speed conversion means for converting the playback speeds of a voice section and a non-voice section included in the audio signal based on the speed ratio.

The speed ratio calculating means includes
An audio that indicates a companding ratio indicating a ratio of companding a reproduction time reproduced at a reproduction speed set in advance to the audio signal to the predetermined reproduction time, an audio content ratio, and an average speed ratio of the audio section. Using the correspondence information indicating the correspondence with the calculation method of the average speed ratio and the non-voice average speed ratio indicating the average speed ratio of the non-voice section, the speed is calculated by calculating the voice average speed ratio and the non-voice average speed ratio, respectively. Speed ratio condition setting means for setting as a ratio condition;
The speed ratio in each section into which the voice section is subdivided is determined as a speed ratio based on the voice average speed ratio, and the speed ratio in each section into which the non-voice section is subdivided is used as the non-voice average speed ratio. The audio reproduction device according to claim 1, further comprising: a speed ratio determining unit that determines a speed ratio based on the speed ratio and calculates a speed ratio between the voice section and the non-voice section.

The voice reproduction device further includes voice section length calculating means for calculating each time from the start time to the end time of each voice section determined by the determining means as a voice section length.
The speed ratio condition setting means further sets, as a speed ratio condition, an end speed ratio at the end time of the voice section according to the voice average speed ratio,
The speed ratio determining means calculates a speed ratio in each section of the voice section based on the voice average speed ratio, the voice section length, and the end speed ratio for each voice section discriminated by the discrimination means. The audio reproduction device according to claim 2, wherein the audio reproduction device is determined.

The speed ratio determining unit is configured to determine, based on an elapsed ratio obtained by dividing a time elapsed from the start time of the voice segment by the voice segment length, for each voice segment determined by the determination unit. The audio reproduction device according to claim 3, wherein a speed ratio in each section is determined.

3. The audio reproduction device according to claim 2, wherein the speed ratio determination unit determines a speed ratio in each section of the audio section so that the reproduction speed increases as time elapses from the start time of the audio section.

The speed ratio condition setting means is correspondence information including at least one type of calculation method of the voice average speed ratio and the non-voice average speed ratio, and shows different correspondence depending on the type of content constituted by the audio signal. The audio reproduction device according to claim 2, wherein the audio average speed ratio and the non-audio average speed ratio are calculated using correspondence information.

The speed ratio condition setting means creates the correspondence information so that the voice average speed ratio and the non-voice average speed ratio are within a range specified by a user, and uses the correspondence information to generate the voice average speed ratio and The audio reproduction device according to claim 2, wherein the non-audio average speed ratio is calculated.

The said correspondence information is information which shows the correspondence from which the said audio | voice content rate differs from the calculation method of the said audio | voice average speed ratio and a non-voice average speed ratio according to the magnitude | size of the said audio | voice content rate. Audio playback device.

The speed ratio condition setting means is correspondence information including at least one kind of calculation method of the voice average speed ratio and the non-voice average speed ratio, and uses correspondence information indicating different correspondence depending on a user's purpose of use, The sound reproducing apparatus according to claim 2, wherein the sound average speed ratio and the non-voice average speed ratio are calculated.

The said correspondence information is information which shows the correspondence from which the said companding ratio differs from the calculation method of the said voice average speed ratio and the non-voice average speed ratio according to the magnitude | size of the said companding ratio. Audio playback device.

An accumulator that accumulates in advance the audio signal that forms the entire content and the determination result determined by the determining unit with respect to the audio signal that forms the entire content;
The audio content rate calculating unit calculates an audio content rate indicating a ratio of audio segments included in an audio signal constituting the entire content based on a determination result stored in the storage unit in advance. The audio reproducing device described.

The audio reproduction according to claim 1, wherein the voice content rate calculating unit sequentially calculates a voice content rate used when the speed ratio calculating unit calculates based on a discrimination result discriminated in the past by the discrimination unit. apparatus.

The voice content rate calculating means is used when the speed ratio calculating means calculates based on a determination result determined in the past for a first predetermined time from when the speed ratio calculating means calculates. Are sequentially calculated for each second predetermined time that is equal to or shorter than the first predetermined time,
The audio playback device is configured to determine the amount of data input to the speed conversion unit, the amount of data output from the speed conversion unit, and a playback time for playback at a playback speed set in advance in the audio signal. A drawing ratio calculating means for sequentially calculating a drawing ratio for each second predetermined time based on a drawing ratio indicating a ratio of drawing in time;
The speed ratio calculating means includes
The companding ratio for each second predetermined time, the voice content ratio for each second predetermined time, the voice average speed ratio indicating the average speed ratio of the voice section, and the average speed ratio of the non-voice section Using the correspondence information indicating the correspondence with the calculation method of the non-voice average speed ratio, the voice average speed ratio and the non-voice average speed ratio are calculated respectively, and the calculated voice average speed ratio and the non-voice average speed ratio are set as speed. A speed ratio condition setting means for sequentially setting the ratio condition every second predetermined time;
With respect to the voice section and the non-voice section included in the second predetermined time, a speed ratio in each section into which the voice section is subdivided is determined as a speed ratio based on the voice average speed ratio; The speed ratio in each section obtained by subdividing the non-speech section is determined as a speed ratio based on the non-speech average speed ratio, and the speech is based on a speed ratio condition sequentially set every second predetermined time. Speed ratio determining means for calculating the speed ratio of each section and non-voice section,
The speed conversion means converts the playback speeds of a voice section and a non-voice section included in the audio signal based on the speed ratio of the voice section and the non-voice section calculated by the speed ratio determination means, respectively. The audio reproducing device according to 1.

The sound reproducing device includes a statistic calculating unit that calculates a statistic for suppressing a change in the second predetermined time indicated by a speed ratio in each section of the sound section determined by the speed ratio determining unit. In addition,
The speed ratio condition setting means is based on the voice average speed and the voice average speed based on the statistic, the voice content rate at the second predetermined time, and the companding ratio at the second predetermined time. 14. The audio playback device according to claim 13, wherein an end speed ratio of a voice section is calculated, and the calculated average voice speed and end speed ratio are sequentially set every second predetermined time as a speed ratio condition.

The statistic calculation means includes audio included in the time from the start time of the content to the time when the speed ratio calculation means calculates based on the determination result from the start time of the content to the time when the speed ratio calculation means calculates. The audio reproduction device according to claim 14, wherein an audio content rate indicating a ratio of sections is calculated as the statistic.

The voice reproduction device calculates a voice interval length that is a time from a start time to an end time of the voice interval used when the speed ratio calculation unit calculates based on a determination result determined in the past by the determination unit. Voice section length calculating means for sequentially calculating;
The speed ratio condition setting means sequentially sets an end speed ratio at an end time of a voice section according to the voice average speed ratio as a speed ratio condition, and further sequentially every second predetermined time,
The speed ratio determining means calculates a speed ratio in each section of the voice section based on the voice average speed ratio, the voice section length, and the end speed ratio for each voice section discriminated by the discrimination means. The audio reproduction device according to claim 13, wherein the audio reproduction device is determined.

The voice section length used when the speed ratio calculating means calculates is the voice section length that is equal to or longer than a predetermined section length among the voice section lengths calculated from the start and end times of the voice sections discriminated in the past by the discriminating means. The sound reproduction device according to claim 16, wherein the sound reproduction device is calculated based only on the sound.

The voice section length used when the speed ratio calculating unit calculates is calculated based on a maximum value of a voice section length calculated from a start time and an end time of a voice section determined in the past by the determining unit. Item 17. The audio playback device according to Item 16.

A storage means for storing in advance the audio signal for a predetermined time and the determination result determined by the determination means for the audio signal for the predetermined time;
2. The voice content rate calculating unit calculates a voice content rate indicating a ratio of a voice section included in the audio signal for the predetermined time based on a determination result stored in the storage unit in advance. Audio playback device.

The determining means determines, for the audio signal, a specific event section including a specific event sound, the voice section other than the specific event section, and the non-voice section,
The audio reproduction device further includes a specific event content rate calculating unit that calculates a specific event content rate indicating a ratio of the specific event section included in the audio signal based on a determination result determined by the determining unit.
The speed ratio calculating means calculates a speed ratio of the specific event section to a playback speed preset in the audio signal based on the specific event content rate, and the playback time of the audio signal is the predetermined playback time. So as to calculate the speed ratio of the voice section and the non-voice section other than the specific event section with respect to the playback speed set in advance in the audio signal based on the voice content rate, respectively.
The speed conversion means converts the playback speed between the specific event section included in the audio signal and the voice section and the non-voice section other than the specific event section based on the speed ratio. Audio playback device.

An audio playback method of changing the playback speed of an audio signal of input content and playing it at a predetermined playback time,
A determination step for determining a speech section including speech and a non-speech section not including speech for the audio signal;
A voice content ratio calculating step for calculating a voice content ratio indicating a ratio of a voice section included in the audio signal based on the determination result determined in the determination step;
Speeds for calculating the speed ratios of the voice and non-voice sections to the playback speed set in advance in the audio signal based on the voice content rate so that the playback time of the audio signal becomes the predetermined playback time. A ratio calculating step;
And a speed conversion step of converting the playback speeds of a voice section and a non-speech section included in the audio signal based on the speed ratio.

A program for causing a computer of an audio reproduction device to reproduce at a predetermined reproduction time by changing the reproduction speed of an audio signal of input content,
A determination step for determining a speech section including speech and a non-speech section not including speech for the audio signal;
A voice content ratio calculating step for calculating a voice content ratio indicating a ratio of a voice section included in the audio signal based on the determination result determined in the determination step;
Speeds for calculating the speed ratios of the voice and non-voice sections to the playback speed set in advance in the audio signal based on the voice content rate so that the playback time of the audio signal becomes the predetermined playback time. A ratio calculating step;
A program that causes the computer to execute a speed conversion step that receives the audio signal as input and converts the playback speed of a voice section and a non-voice section included in the audio signal based on the speed ratio.

A computer-readable recording medium on which the program according to claim 22 is recorded.

An integrated circuit that plays back a predetermined playback time by changing the playback speed of an audio signal of input content,
A discriminating means for discriminating a voice section including voice and a non-voice section not containing voice with respect to the audio signal;
A voice content rate calculating unit that calculates a voice content rate indicating a ratio of a voice section included in the audio signal based on the determination result determined by the determination unit;
Speeds for calculating the speed ratios of the voice and non-voice sections to the playback speed set in advance in the audio signal based on the voice content rate so that the playback time of the audio signal becomes the predetermined playback time. A ratio calculating means;
An integrated circuit comprising: the audio signal as an input; and speed conversion means for converting the playback speed of a voice section and a non-voice section included in the audio signal based on the speed ratio.