JP2007025039A

JP2007025039A - Voice reproducing device, voice recording/rereproducing device, methods therefor, recording medium, and integrated circuit

Info

Publication number: JP2007025039A
Application number: JP2005204211A
Authority: JP
Inventors: Masayuki Misaki; 正之三崎; Meiko Masaki; 芽衣子正木; Takeshi Kawamura; 岳河村
Original assignee: Matsushita Electric Industrial Co Ltd
Current assignee: Panasonic Holdings Corp
Priority date: 2005-07-13
Filing date: 2005-07-13
Publication date: 2007-02-01
Anticipated expiration: 2025-07-13
Also published as: JP4580297B2

Abstract

<P>PROBLEM TO BE SOLVED: To realize reproduced voice quality that is easier to here by performing an optimum velocity ratio control according to fluctuation of a voice content rate. <P>SOLUTION: A voice period and a non-voice period are determined in a voice/no-voice determination section 11 and a voice content rate in a frame length for calculation is calculated as voice information in a voice information calculation section 12, and an average value and a standard deviation of the voice content ratio are calculated. Then, the velocity ratio in the voice period is calculated for each frame using the voice information in a velocity ratio calculation section 14, and the velocity ratio in the non-voice period is calculated so that regeneration time may become a target regeneration time using the velocity ratio in the voice period. Based on the calculated velocity ratio in the voice period and the non-voice period, a regeneration velocity of an input voice signal is changed in a voice velocity changing section 15 and the voice signal is output. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、音声再生装置、音声録音再生装置、およびそれらの方法、記録媒体、集積回路に関し、より特定的には、再生速度を変換して再生する音声再生装置、音声録音再生装置、およびそれらの方法、記録媒体、集積回路に関する。 The present invention relates to an audio reproducing device, an audio recording / reproducing device, and a method thereof, a recording medium, and an integrated circuit. More specifically, the present invention relates to an audio reproducing device, an audio recording / reproducing device, and the like that convert and reproduce the reproducing speed. And a recording medium and an integrated circuit.

従来、予め記録された音声を再生する音声再生装置において、声の高さを変えることなく、より高速に再生する方法が知られている（例えば、特許文献１参照）。特許文献１に開示された音声再生装置では、音声信号全体を指定速度で再生するとき、音声区間については部分的に再生速度比を低速化している。これにより、特許文献１に開示された従来の音声再生装置は、情報の欠落が少なく、聴き取りやすい再生音声を提供することができる。
特開２００１−２２２３００号公報 2. Description of the Related Art Conventionally, there has been known a method of reproducing at higher speed without changing the pitch of a voice in an audio reproducing apparatus that reproduces prerecorded audio (for example, see Patent Document 1). In the audio reproduction device disclosed in Patent Document 1, when the entire audio signal is reproduced at a specified speed, the reproduction speed ratio is partially reduced for the audio section. As a result, the conventional audio reproduction device disclosed in Patent Document 1 can provide reproduced audio that is easy to listen to with little loss of information.
JP 2001-222300 A

以下、図１１を参照して、上記特許文献１に開示された従来の音声再生装置９について、具体的に説明する。図１１は、従来の音声再生装置９の構成を示すブロック図である。図１１において、従来の音声再生装置９は、音響分析部９１、話速変換部９２、非音声区間長制御部９３、および合成部９４を備える。 Hereinafter, with reference to FIG. 11, the conventional audio reproduction device 9 disclosed in Patent Document 1 will be specifically described. FIG. 11 is a block diagram showing a configuration of a conventional audio reproduction device 9. In FIG. 11, the conventional audio reproduction device 9 includes an acoustic analysis unit 91, a speech speed conversion unit 92, a non-speech section length control unit 93, and a synthesis unit 94.

音響分析部９１は、入力される音声データに対して、予め設定されているパワー閾値に基づき音声区間および非音声区間を判別する。そして、音響分析部９１は、音声区間および非音声区間の時間情報をそれぞれ求める。図１１に示す従来の音声再生装置９では、音響分析部９１において判別された音声区間および非音声区間に対して、異なる再生処理を適用する。音響分析部９１で判別された音声区間の音声データおよび上記各時間情報は、話速変換部９２に出力される。音響分析部９１で判別された非音声区間の音声データは、非音声区間長制御部９３に出力される。 The acoustic analysis unit 91 discriminates a voice segment and a non-speech segment from input voice data based on a preset power threshold. And the acoustic analysis part 91 calculates | requires the time information of a voice area and a non-voice area, respectively. In the conventional audio reproduction device 9 shown in FIG. 11, different reproduction processes are applied to the audio segment and the non-speech segment determined by the acoustic analysis unit 91. The voice data of the voice section determined by the acoustic analysis unit 91 and the time information are output to the speech speed conversion unit 92. The voice data of the non-speech section determined by the acoustic analysis unit 91 is output to the non-speech section length control unit 93.

話速変換部９２は、まず音声区間の音声データと上記各時間情報とに基づいて、一定時間長以上の非音声区間に挟まれた音声区間を特定する。そして、話速変換部９２は、当該音声区間の冒頭部分の速度比を所定速度比より遅く、末尾に向けて次第に所定速度比に戻すような速度比制御を行う。速度比が制御された音声区間の音声データは、合成部９４に出力される。また、話速変換部９２は、波形の伸長処理によって生じる音声区間の遅延時間情報を非音声区間長制御部９３に出力する。 The speech speed conversion unit 92 first specifies a speech section sandwiched between non-speech sections having a predetermined time length or more based on the speech data of the speech section and each time information. Then, the speech speed conversion unit 92 performs speed ratio control such that the speed ratio at the beginning of the speech section is slower than the predetermined speed ratio and gradually returns to the predetermined speed ratio toward the end. The voice data of the voice section in which the speed ratio is controlled is output to the synthesis unit 94. Further, the speech speed conversion unit 92 outputs the delay time information of the voice segment generated by the waveform expansion process to the non-speech segment length control unit 93.

一方、非音声区間長制御部９３では、話速変換部９２から出力された上記遅延時間情報に基づいて、非音声区間の音声データに対して削除および圧縮する処理を適宜行う。つまり、非音声区間長制御部９３では、目標の指定速度比に合うように、かつ、話速変換部９２で生じた音声区間の遅延を解消するような処理が行われる。非音声区間長制御部９３において処理された非音声区間の音声データは、合成部９４に出力される。 On the other hand, the non-speech section length control unit 93 appropriately performs a process of deleting and compressing the speech data in the non-speech section based on the delay time information output from the speech speed conversion unit 92. That is, the non-speech interval length control unit 93 performs a process so as to meet the target designated speed ratio and eliminate the delay of the speech interval caused by the speech rate conversion unit 92. The voice data of the non-speech section processed by the non-speech section length control unit 93 is output to the synthesis unit 94.

合成部９４は、話速変換部９２から出力された音声区間の音声データと、非音声区間長制御部９３から出力された非音声区間の音声データとを合成する。そして、合成部９４は、速度比が変換された音声区間と非音声区間とが合成された音声データを変換音声データとして、最終的な再生音声を出力する。 The synthesizer 94 synthesizes the voice data of the voice section output from the speech speed conversion unit 92 and the voice data of the non-voice section output from the non-voice section length control unit 93. The synthesizing unit 94 then outputs the final reproduced audio using the audio data obtained by synthesizing the audio section in which the speed ratio is converted and the non-audio section as converted audio data.

上記従来の音声再生装置９では、例えば指定速度としてｍ倍速（ｍは１以上の正数）が与えられたとき、音声区間の冒頭部分ではｍ倍速より遅い速度比で再生する。そして、従来の音声再生装置９は、音声区間の末尾に向かって次第に再生速度比を速くする。ここで、一般的に音声区間の冒頭部分には、重要な情報が含まれている場合が多い。したがって、従来の音声再生装置９によれば、音声区間の冒頭部分にある重要な情報を欠落させることなく、聴きとりやすい再生を実現することができる。このように従来の音声再生装置９では、音声区間については聴き取りやすい処理が、非音声区間については指定速度比に適応するような処理がそれぞれ行われている。 In the conventional audio reproduction device 9, for example, when m-times speed (m is a positive number of 1 or more) is given as the designated speed, reproduction is performed at a speed ratio slower than m-times speed at the beginning of the audio section. Then, the conventional audio reproduction device 9 gradually increases the reproduction speed ratio toward the end of the audio section. Here, in general, there are many cases where important information is included at the beginning of the speech section. Therefore, according to the conventional audio reproducing device 9, it is possible to realize reproduction that is easy to listen to without missing important information at the beginning of the audio section. As described above, in the conventional audio reproduction device 9, processing that is easy to hear is performed for the voice section, and processing that is adapted to the designated speed ratio is performed for the non-voice section.

ここで、高速再生時には、音声の発話速度が速くなり、ユーザにとって内容を理解するための負荷が大きくなる。さらに、番組全体の中で音声区間が偏って集中すると（音声が連続的に発声されると）、ユーザにとってさらに理解が困難になる。しかしながら、上記従来の音声再生装置９では、一つの音声区間の中で再生速度比を変更することのみを想定している。つまり、上記従来の音声再生装置９では、例えばテレビ番組などの全体を通して、同一の速度比制御処理が適用される。したがって、従来の音声再生装置９においては、音声区間が偏って集中する部分で相対的に音声の内容の聴き取りが困難になるという本質的課題があった。 Here, at the time of high-speed playback, the speech utterance speed increases, and the load for the user to understand the content increases. Furthermore, if the voice sections are concentrated and concentrated in the entire program (if the voice is continuously uttered), it becomes more difficult for the user to understand. However, the conventional audio reproducing apparatus 9 assumes only changing the reproduction speed ratio in one audio section. That is, in the conventional audio reproduction device 9, the same speed ratio control process is applied throughout the entire television program, for example. Therefore, the conventional audio reproduction device 9 has an essential problem that it is relatively difficult to listen to the audio content at a portion where the audio sections are concentrated and concentrated.

それ故、本発明の目的は、テレビなどの番組全体を考慮した最適な速度比制御を行って、より聴き取りやすい再生を実現する音声再生装置、音声録音再生装置、およびそれらの方法、記録媒体、および集積回路を提供することを目的とする。 SUMMARY OF THE INVENTION Therefore, an object of the present invention is to provide an audio reproducing device, an audio recording / reproducing device, and a method and a recording medium for realizing an easy-to-listen reproduction by performing optimum speed ratio control in consideration of the entire program such as a television. And to provide an integrated circuit.

第１の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生する音声再生装置であって、音声信号に対して音声を含む音声区間と音声を含まない非音声区間とを判別する判別部と、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出部と、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出部とを備える。 A first aspect of the present invention is an audio reproducing apparatus for reproducing an audio signal by converting the reproduction speed set to the input audio signal to the same speed and shortening the reproduction time, A determination unit for determining a speech section including speech and a non-speech section that does not include speech; and at least a speech content ratio indicating a ratio of the speech section to a predetermined time length as speech information related to the speech section and the non-speech section The speed of the audio section in the predetermined time length when the audio content rate of the predetermined time length is relatively high, with the audio information calculation unit and the ratio of the speed conversion from the same speed reproduction speed being 1 or more as a reference value A speed ratio calculating unit that sets the ratio smaller than the reference value and sets the speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate for the predetermined time length is relatively low. .

第２の発明は、上記第１の発明において、速度比算出部は、ユーザの操作に応じて短縮された再生時間を設定し、算出した音声区間の速度比に基づいて、音声信号の再生時間が設定された再生時間となるように非音声区間の速度比を算出することを特徴とする。 In a second aspect based on the first aspect, the speed ratio calculation unit sets a shortened playback time in accordance with a user operation, and based on the calculated speed ratio of the voice section, the playback time of the voice signal Is characterized in that the speed ratio of the non-voice section is calculated so as to be the set playback time.

第３の発明は、上記第２の発明において、速度比算出部は、設定された再生時間内において非音声区間の速度比を一定に算出することを特徴とする。 According to a third aspect, in the second aspect, the speed ratio calculation unit calculates the speed ratio of the non-speech section to be constant within the set reproduction time.

第４の発明は、上記第１の発明において、所定時間長は、１以上の単位時間長を含み、速度比算出部は、所定時間長に対して算出された速度比を当該所定時間長に含まれる何れか１つの単位時間長における音声区間の速度比に設定することを特徴とする。 In a fourth aspect based on the first aspect, the predetermined time length includes one or more unit time lengths, and the speed ratio calculation unit sets the speed ratio calculated for the predetermined time length to the predetermined time length. It is set to the speed ratio of the voice section in any one unit time length included.

第５の発明は、上記第１の発明において、音声再生装置は、入力される音声信号のうち、少なくとも所定時間長分の音声信号を含むように当該音声信号を順次更新しながら記録するバッファと、バッファに記録された音声信号に対して速度変換処理を行って出力する速度変換部とを、さらに備え、判別部は、バッファに記録された所定時間長の音声信号に対して音声区間と非音声区間とを判別し、音声情報算出部は、さらに、音声情報として音声含有率に関する統計値を算出して、予め記憶されている統計値を単位時間毎に順次更新し、速度比算出部は、単位時間ごとに更新される統計値および当該更新時の所定時間長に設定された音声含有率に応じて音声区間の速度比を算出し、速度変換部は、バッファで順次更新される音声信号に対して、単位時間ごとに算出された音声区間の速度比を用いて順次速度変換処理を行うことを特徴とする。 In a fifth aspect based on the first aspect, the audio reproduction device includes a buffer for recording the audio signal while sequentially updating the audio signal to include at least a predetermined time length of the audio signal. A speed conversion unit that performs speed conversion processing on the audio signal recorded in the buffer and outputs the audio signal. The speech information calculation unit further calculates a statistical value related to the speech content rate as speech information, sequentially updates the statistical values stored in advance for each unit time, and the speed ratio calculation unit The voice ratio is calculated according to the statistical value updated every unit time and the voice content rate set to the predetermined time length at the time of the update, and the speed conversion unit sequentially updates the audio signal in the buffer. Against And performing sequential speed conversion processing by using the speed ratio of the speech interval calculated for each unit time.

第６の発明は、上記第１の発明において、音声情報算出部は、音声情報として音声含有率に関する統計値をさらに算出し、速度比算出部は、統計値および音声含有率に応じて音声区間の速度比を算出することを特徴とする。 In a sixth aspect based on the first aspect, the voice information calculation unit further calculates a statistical value related to the voice content rate as voice information, and the speed ratio calculation unit determines the voice interval according to the statistical value and the voice content rate. The speed ratio is calculated.

第７の発明は、上記第５または６の発明において、統計値は、所定時間長毎の音声含有率の平均値および標準偏差であることを特徴とする。 A seventh invention is characterized in that, in the fifth or sixth invention, the statistical value is an average value and a standard deviation of a voice content rate for each predetermined time length.

第８の発明は、上記第７の発明において、速度比算出部は、所定時間長における音声含有率の平均値に対する変動差および標準偏差に応じた係数を速度比の基準値に乗じて、音声区間の速度比を算出することを特徴とする。 In an eighth aspect based on the seventh aspect, the speed ratio calculation unit multiplies the reference value of the speed ratio by a coefficient corresponding to a variation difference and a standard deviation with respect to the average value of the voice content rate over a predetermined time length, and The speed ratio of the section is calculated.

第９の発明は、上記第８の発明において、音声情報算出部は、それぞれ時間長が異なる所定時間長を複数設定してそれぞれ音声含有率を算出し、速度比算出部は、所定時間長それぞれより少なくとも短い単位時間長において、当該単位時間長に含まれる音声区間の速度比を、当該単位時間長を共通して含むそれぞれの所定時間長の音声含有率に対応する係数の総和を速度比の基準値に乗じて算出することを特徴とする。 In a ninth aspect based on the eighth aspect, the voice information calculation unit sets a plurality of predetermined time lengths each having a different time length and calculates a voice content rate, and the speed ratio calculation unit determines each predetermined time length. In at least a shorter unit time length, the speed ratio of the voice section included in the unit time length is the sum of coefficients corresponding to the voice content rates of the predetermined time lengths including the unit time length in common. It is calculated by multiplying the reference value.

第１０の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生する音声再生方法であって、音声信号に対して音声を含む音声区間と音声を含まない非音声区間とを判別する判別ステップと、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出ステップと、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出ステップとを含む。 A tenth aspect of the present invention is an audio reproduction method for reproducing an audio signal by converting the reproduction speed set to the input audio signal to the same speed and shortening the reproduction time. A discrimination step for discriminating between a speech segment including a speech segment and a non-speech segment including no speech, and at least a speech content ratio indicating a ratio of the speech segment to a predetermined time length as speech information related to the speech segment and the non-speech segment The speed of the audio section in the predetermined time length when the audio content ratio of the predetermined time length is relatively high with the audio information calculating step and the ratio of speed conversion from the same speed reproduction speed being 1 or more as a reference value A speed ratio that sets the ratio smaller than the reference value and sets the speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate for the predetermined time length is relatively low Out and a step.

第１１の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生するコンピュータで実行される音声再生プログラムを記録した当該コンピュータで読み取り可能な記録媒体であって、コンピュータに、音声信号に対して音声を含む音声区間と音声を含まない非音声区間とを判別する判別ステップと、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出ステップと、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出ステップとを実行させるためのプログラムを記録した、コンピュータに読み取り可能な記録媒体である。 An eleventh aspect of the present invention is a computer that records a sound reproduction program that is executed by a computer that reproduces the sound signal by converting the reproduction speed set to the input sound signal at the same magnification and reducing the reproduction time. A readable recording medium, in which a computer has a determination step for determining a voice section including voice and a non-voice section that does not contain voice with respect to a voice signal, and voice information about the voice section and the non-voice section as predetermined information. A voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to the time length, and a speed ratio with a speed conversion ratio of 1 or more from the same playback speed as a reference value, a predetermined time length When the voice content rate is relatively high, the speed ratio of the voice section in the predetermined time length is set smaller than the reference value, and the voice content of the predetermined time length is included. There the speed ratio of the speech interval in the predetermined time length has been recorded a program for executing the speed ratio calculation step of setting greater than the reference value when a relatively low, which is a computer-readable recording medium.

第１２の発明は、入力される音声信号に設定された等倍の再生速度を速度変換して加速させる集積回路であって、音声信号に対して音声を含む音声区間と音声を含まない非音声区間とを判別する判別部と、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出部と、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出部とを備える。 A twelfth aspect of the present invention is an integrated circuit for speed-converting and accelerating a playback speed set to the input audio signal at the same magnification, and a non-voice that does not contain voice and a voice section that contains voice with respect to the voice signal. A discrimination unit that discriminates a section; a voice information calculation unit that calculates at least a voice content rate indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section; The speed ratio of speed to speed conversion is set to a reference value of a speed ratio of 1 or more, and when the voice content rate of the predetermined time length is relatively high, the speed ratio of the voice section in the predetermined time length is set smaller than the reference value, A speed ratio calculation unit configured to set a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low.

第１３の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生する音声録音再生装置であって、入力される音声信号を記録する情報記録部と、情報記録部に記録される前の音声信号に対して音声を含む音声区間と音声を含まない非音声区間とを判別する判別部と、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出部と、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出部とを備える。 A thirteenth aspect of the present invention is an audio recording / reproducing apparatus for reproducing an audio signal by converting the reproduction speed set to the input audio signal at the same magnification and shortening the reproduction time. An information recording unit to be recorded, a discrimination unit for discriminating a voice segment including voice and a non-speech segment not including voice with respect to a voice signal before being recorded in the information recording unit, and voices related to the voice zone and the non-voice zone As information, a voice information calculation unit that calculates at least a voice content rate indicating a ratio of the voice section to a predetermined time length, and a speed ratio with a speed conversion rate of 1 or more from the same playback speed as a reference value, When the voice content rate of the predetermined time length is relatively high, the speed ratio of the voice section in the predetermined time length is set smaller than the reference value, and the predetermined time is set when the voice content rate of the predetermined time length is relatively low Long The speed ratio of the definitive voice section and a speed ratio calculating section for greater than the reference value.

第１４の発明は、上記第１３の発明において、情報記録部には、音声信号が記録される際に判別部が判別した結果が記録され、音声情報算出部は、情報記録部に記録された結果に基づいて、音声情報を算出することを特徴とする。 In a fourteenth aspect based on the thirteenth aspect, the information recording unit records a result determined by the determining unit when the audio signal is recorded, and the audio information calculating unit is recorded in the information recording unit. Voice information is calculated based on the result.

第１５の発明は、上記第１３の発明において、情報記録部には、音声信号が記録される際に、判別部が判別した結果および音声情報が記録され、速度比算出部は、情報記録部に記録された音声情報を用いて、音声区間の速度比を算出することを特徴とする。 In a fifteenth aspect based on the thirteenth aspect, the information recording section records the result and the sound information determined by the determining section when the sound signal is recorded, and the speed ratio calculating section includes the information recording section. The speed ratio of the voice section is calculated using the voice information recorded in the above.

第１６の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生する音声録音再生方法であって、入力される音声信号を記録する情報記録ステップと、情報記録ステップに記録される前の音声信号に対して音声を含む音声区間と音声を含まない非音声区間と判別する判別ステップと、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出ステップと、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出ステップとを含む。 A sixteenth aspect of the present invention is a voice recording / playback method for playing back an audio signal by converting the playback speed set to the input audio signal at the same magnification and reducing the playback time. An information recording step for recording, a discrimination step for discriminating a voice section including voice and a non-voice section not containing voice with respect to the voice signal before being recorded in the information recording step, and voice information relating to the voice section and the non-voice section As a reference value, a voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section included in a predetermined time length, and a speed ratio at which a speed conversion rate from the same playback speed is 1 or more is set as a reference value. When the voice content rate of time length is relatively high, the speed ratio of the voice section in the predetermined time length is set smaller than the reference value, and the voice content rate of the predetermined time length is relatively low The speed ratio of the speech interval in the predetermined time length when including a speed ratio calculation step of setting greater than the reference value.

第１７の発明は、入力される音声信号に設定された等倍の再生速度を速度変換し再生時間を短縮して当該音声信号を再生するコンピュータで実行される音声録音再生プログラムを記録した記録媒体であって、コンピュータに、入力される音声信号を記録部に記録する情報記録ステップと、記録部に記録される前の音声信号に対して音声を含む音声区間と音声を含まない非音声区間と判別する判別ステップと、音声区間および非音声区間に関する音声情報として、所定時間長に対する当該音声区間が含まれる比率を示す音声含有率を少なくとも算出する音声情報算出ステップと、等倍の再生速度から速度変換する比率が１以上の速度比を基準値として、所定時間長の音声含有率が相対的に高いときに当該所定時間長における音声区間の速度比を当該基準値より小さく設定し、所定時間長の音声含有率が相対的に低いときに当該所定時間長における音声区間の速度比を当該基準値より大きく設定する速度比算出ステップとを実行させるためのプログラムを記録した、コンピュータに読み取り可能な記録媒体である。 According to a seventeenth aspect of the present invention, there is provided a recording medium on which an audio recording / reproducing program executed by a computer that reproduces the audio signal by converting the reproduction speed set to the input audio signal to speed and reducing the reproduction time. An information recording step for recording an audio signal input to the computer in the recording unit, an audio segment including audio and a non-audio segment not including audio with respect to the audio signal before being recorded in the recording unit, A discrimination step for discriminating, a voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to a predetermined time length as voice information regarding the voice section and the non-speech section, and a speed from the same playback speed The speed ratio of the voice section in the predetermined time length when the voice content rate of the predetermined time length is relatively high with the speed ratio of the conversion ratio of 1 or more as a reference value A speed ratio calculating step for setting the speed ratio of the voice section for the predetermined time length to be larger than the reference value when the voice content rate for the predetermined time length is relatively low. A computer-readable recording medium that records a program.

第１の発明によれば、音声含有率の変動に応じた音声区間の速度比を算出することで、入力された音声信号の速度変換後の再生音質を音声含有率の変動に応じた了解性の優れたものにすることができる。 According to the first invention, by calculating the speed ratio of the voice section according to the fluctuation of the voice content rate, the reproduction sound quality after the speed conversion of the input voice signal is understood according to the fluctuation of the voice content rate. Can be excellent.

第２の発明によれば、設定されて再生時間となるように、重要な音声情報が含まれていない非音声区間の速度比を音声区間の速度比とは別に算出することで、音声区間の速度比をユーザが聴取可能な範囲内の速度比に調整することができる According to the second aspect of the present invention, the speed ratio of the non-speech section that does not include important speech information is calculated separately from the speed ratio of the speech section so that the playback time is set. The speed ratio can be adjusted to a speed ratio within the range that the user can hear.

第３の発明によれば、重要な音声情報が含まれていない非音声区間の速度比を一定の速度比とすることで、能率のよい速度変換をした再生が可能となる。 According to the third aspect of the invention, it is possible to perform playback with efficient speed conversion by setting the speed ratio of the non-voice section in which no important voice information is included to a constant speed ratio.

第４の発明によれば、例えば単位時間長を数多く含み、所定時間長が長い場合には、設定される音声区間の速度比が音声含有率の変動に対して大局的でより正確性の高い値となる。また例えば、所定時間長が短く、含まれる単位時間長が少ない場合には、設定される音声区間の速度比が音声含有率の変動に対して敏感でより追従性のよい値となる。つまり、設定される音声区間の速度比に対して、音声含有率の変動に対する正確性または追従性を自由に選択することができる。 According to the fourth aspect of the invention, for example, when a unit time length is included and the predetermined time length is long, the speed ratio of the set voice section is global with respect to the fluctuation of the voice content rate and is more accurate. Value. Further, for example, when the predetermined time length is short and the unit time length included is small, the speed ratio of the set speech section is sensitive to fluctuations in the speech content rate and becomes a value with better followability. That is, it is possible to freely select the accuracy or followability with respect to the fluctuation of the voice content with respect to the speed ratio of the set voice section.

第５の発明によれば、統計値を単位時間毎に更新することで、音声信号の入力に応じて即時に速度変換処理をして再生することができる。 According to the fifth aspect, by updating the statistical value every unit time, it is possible to immediately perform the speed conversion process according to the input of the audio signal and reproduce it.

第６の発明によれば、音声区間の速度比の算出に対して、統計値を用いることで、より実際の音声含有率の変動に即した音声区間の速度比を算出することができ、結果的に速度変換後の再生音質をより了解性のある自然なものにすることができる。 According to the sixth invention, by using the statistical value for calculating the speed ratio of the voice section, the speed ratio of the voice section can be calculated more in line with the actual fluctuation of the voice content rate. Thus, the reproduced sound quality after speed conversion can be made more natural and understandable.

第７の発明によれば、音声区間の存在の偏り度合いを考慮した音声区間の速度比を算出することができる。 According to the seventh aspect, it is possible to calculate the speed ratio of the voice section in consideration of the degree of bias of the existence of the voice section.

第８の発明によれば、音声区間の存在の偏り度合いに即した音声区間の速度比を算出することができる。 According to the eighth aspect of the invention, it is possible to calculate the speed ratio of the voice section in accordance with the degree of bias of the existence of the voice section.

第９の発明によれば、単位時間長に含まれる音声区間の速度比を、当該単位時間長を共通して含むそれぞれの所定時間長の音声含有率に対応する係数の総和を速度比の基準値に乗じて算出することで、音声含有率の敏感な変動および大局的な変動の双方に対応した最適な音声区間の速度比を算出することができる。 According to the ninth aspect, the speed ratio of the voice section included in the unit time length is the sum of the coefficients corresponding to the voice content rates of the predetermined time lengths including the unit time length in common. By multiplying by the value, it is possible to calculate the optimum speed ratio of the voice section corresponding to both the sensitive fluctuation and the global fluctuation of the voice content rate.

第１３の発明によれば、音声含有率の変動に応じた音声区間の速度比を算出することで、記録した音声信号の速度変換後の再生音質を音声含有率の変動に応じた了解性の優れたものにすることができる。 According to the thirteenth aspect, by calculating the speed ratio of the voice section according to the fluctuation of the voice content rate, the reproduced sound quality after the speed conversion of the recorded voice signal is changed according to the fluctuation of the voice content rate. It can be excellent.

第１４の発明によれば、音声信号を記録後、速度変換した再生が行われる前までの処理時間を判別部における処理時間分だけ短縮することができる。 According to the fourteenth aspect, it is possible to shorten the processing time from the recording of the audio signal to the time before the speed-converted reproduction is performed by the processing time in the determination unit.

第１５の発明によれば、音声信号を記録後、速度変換した再生が行われる前までの処理時間を判別部および音声情報算出部における処理時間分だけ短縮することができ、音声信号を記録後、即時に速度変換をした再生を行うことができる。 According to the fifteenth aspect, it is possible to shorten the processing time after recording an audio signal and before performing speed-converted reproduction by the processing time in the determination unit and the audio information calculation unit. , Playback with speed conversion can be performed immediately.

（第１の実施形態）
図１を参照して、本発明における第１の実施形態に係る音声再生装置について説明する。図１は、本発明における第１の実施形態に係る音声再生装置１の構成を示すブロック図である。図１において、音声再生装置１は、音声／非音声判別部１１、音声情報算出部１２、音声情報記録部１３、速度比算出部１４、および音声速度変換部１５を有する。なお、本実施形態に係る音声再生装置１は、記録メディアなどに録音された音声信号を速度変換して再生する前に一旦、録音された音声信号全体について読み出し可能であることを想定した装置である。ここで、録音対象としては、例えばテレビやラジオ番組が挙げられる。また記録メディアは、例えば映画などが予め収録されたＤＶＤ等の記録メディアであってもよい。以下の説明では、一例として、第１の実施形態に係る音声再生装置１が、録音されたテレビ番組の音声信号に対して速度変換処理を行うとする。 (First embodiment)
With reference to FIG. 1, the audio | voice reproduction apparatus which concerns on the 1st Embodiment in this invention is demonstrated. FIG. 1 is a block diagram showing a configuration of an audio reproduction device 1 according to the first embodiment of the present invention. In FIG. 1, the audio reproduction device 1 includes an audio / non-audio discrimination unit 11, an audio information calculation unit 12, an audio information recording unit 13, a speed ratio calculation unit 14, and an audio speed conversion unit 15. Note that the audio playback device 1 according to the present embodiment is a device that assumes that the entire recorded audio signal can be read once before the audio signal recorded on the recording medium is speed-converted and reproduced. is there. Here, examples of the recording target include a television and a radio program. Further, the recording medium may be a recording medium such as a DVD in which a movie or the like is recorded in advance. In the following description, as an example, it is assumed that the audio reproduction device 1 according to the first embodiment performs speed conversion processing on the audio signal of a recorded television program.

記録メディアなどに録音された音声信号が読み出され、音声／非音声判別部１１に入力される。音声／非音声判別部１１は、入力された音声信号のパワーの包絡値や周期性などの分析を行う。そして、音声／非音声判別部１１は入力された音声信号に対して音声区間および非音声区間を時間軸上で判別する。音声信号の時間軸上で判別された音声区間および非音声区間の情報（以下、判別情報という）は、速度変換した再生を行う前に音声情報算出部１２に出力される。 An audio signal recorded on a recording medium or the like is read and input to the audio / non-audio discrimination unit 11. The voice / non-voice discrimination unit 11 analyzes the power envelope value and periodicity of the input voice signal. Then, the voice / non-speech discrimination unit 11 discriminates a speech segment and a non-speech segment on the time axis with respect to the input speech signal. Information on the voice section and the non-voice section discriminated on the time axis of the voice signal (hereinafter referred to as discrimination information) is output to the voice information calculation unit 12 before performing speed-converted reproduction.

音声情報算出部１２は、音声／非音声区間の判別情報に基づいて、音声区間および非音声区間の速度比を算出するために必要な音声情報を算出する。音声情報としては、音声含有率、音声含有率の平均値、および標準偏差などがある。具体的には、音声情報算出部１２は、録音された番組全体を通して音声含有率を算出した後に、音声含有率の平均値と標準偏差とを算出する。音声情報算出部１２で算出された音声含有率、音声含有率の平均値、および標準偏差は、音声情報記録部１３にそれぞれ記録される。以下、音声含有率、音声含有率の平均値、および標準偏差について説明する。 The voice information calculation unit 12 calculates voice information necessary for calculating the speed ratio between the voice section and the non-voice section based on the discrimination information of the voice / non-voice section. The voice information includes a voice content rate, an average value of the voice content rate, and a standard deviation. Specifically, the audio information calculation unit 12 calculates the audio content rate throughout the recorded program, and then calculates the average value and standard deviation of the audio content rate. The voice content rate calculated by the voice information calculation unit 12, the average value of the voice content rate, and the standard deviation are recorded in the voice information recording unit 13. Hereinafter, the voice content rate, the average value of the voice content rate, and the standard deviation will be described.

音声含有率は、所定数（少なくとも１つ以上）のフレームに対して音声区間が含まれる時間比率を示すものである。音声含有率はフレーム毎に算出される。ここでフレームとは、入力される音声信号を単位時間で区切った区間であり、当該フレームの時間長をフレーム長とする。当該フレームには、音声区間および／または非音声区間が含まれる。また、音声含有率の算出に用いられる少なくとも１つ以上のフレームを算出用フレームとし、その時間長を算出用フレーム長とする。以下の説明では、一例として、１フレームの時間長（１フレーム長）を１分とする。また、音声含有率を算出するための算出用フレーム長をｎ（ｎは正数）分とする。つまり、１フレーム長を１分としたので、算出用フレームはｎ個のフレームから構成されることとなる。また、録音された番組全体のフレーム数がＮ（Ｎは正数）個あるとする。そして、フレームナンバーをｋ（ｋ＝１〜Ｎ）として、フレームナンバーがｋのときのフレームを「第ｋフレーム」とする。このとき、第ｋフレームの音声含有率Ｒｉｓ＿ｎ（ｋ）は、数式（１）で表現される。

つまり、数式（１）によって算出される第ｋフレームの音声含有率Ｒｉｓ＿ｎ（ｋ）は、算出用フレーム長に対して音声区間が含まれる時間比率を示す。 The voice content rate indicates a time ratio in which a voice section is included with respect to a predetermined number (at least one) of frames. The voice content rate is calculated for each frame. Here, the frame is a section obtained by dividing the input audio signal by unit time, and the time length of the frame is the frame length. The frame includes a speech segment and / or a non-speech segment. Further, at least one frame used for calculating the voice content rate is set as a calculation frame, and the time length thereof is set as a calculation frame length. In the following description, as an example, the time length of one frame (one frame length) is 1 minute. In addition, the calculation frame length for calculating the voice content rate is n (n is a positive number). That is, since the length of one frame is 1 minute, the calculation frame is composed of n frames. Further, it is assumed that the number of frames of the entire recorded program is N (N is a positive number). The frame number is k (k = 1 to N), and the frame when the frame number is k is the “kth frame”. At this time, the voice content rate Ris_n (k) of the k-th frame is expressed by Equation (1).

That is, the voice content rate Ris_n (k) of the k-th frame calculated by Expression (1) indicates a time ratio in which a voice section is included with respect to the calculation frame length.

ここで、図２〜図４を参照して、上記音声含有率Ｒｉｓ＿ｎ（ｋ）の算出例を挙げる。図２〜図４では、一例として、テレビ放送のドキュメンタリ番組（３０分間）の音声含有率を算出するとし、１分、５分、および１０分の３種類の算出用フレーム長で算出している。図２は、算出用フレーム長が１分のときの音声含有率Ｒｉｓ＿１（ｋ）の算出例を示す図である。図３は、算出用フレーム長が５分のときの音声含有率Ｒｉｓ＿５（ｋ）の算出例を示す図である。図４は、算出用フレーム長が１０分のときの音声含有率Ｒｉｓ＿１０（ｋ）の算出例を示す図である。なお、図２〜図４において、横軸はフレームナンバー（ｋ）を示し、縦軸は音声含有率（％）を示す。また、図２〜図４において、１フレーム長は１分とし、番組全体のフレーム数Ｎは３０とする。 Here, with reference to FIG. 2 to FIG. 4, a calculation example of the voice content rate Ris_n (k) will be given. In FIG. 2 to FIG. 4, as an example, when calculating the audio content rate of a TV broadcast documentary program (30 minutes), calculation is performed with three types of calculation frame lengths of 1 minute, 5 minutes, and 10 minutes. . FIG. 2 is a diagram illustrating a calculation example of the voice content rate Ris_1 (k) when the calculation frame length is 1 minute. FIG. 3 is a diagram illustrating a calculation example of the voice content rate Ris_5 (k) when the calculation frame length is 5 minutes. FIG. 4 is a diagram illustrating a calculation example of the voice content rate Ris_10 (k) when the calculation frame length is 10 minutes. 2 to 4, the horizontal axis represents the frame number (k), and the vertical axis represents the voice content rate (%). 2 to 4, the length of one frame is 1 minute, and the number N of frames of the entire program is 30.

図２において、第１フレーム（ｋ＝１）の音声含有率Ｒｉｓ＿１（１）は、算出用フレーム長を１分としたので、数式（１）より第１フレームの音声含有率そのものとなる。図３においては、数式（１）より算出される第１フレームの音声含有率Ｒｉｓ＿５（１）は、図２の第１〜第５フレームの音声含有率を平均したものである。図４においては、数式（１）より算出される第１フレームの音声含有率Ｒｉｓ＿１０（１）は、図２の第１〜第１０フレームの音声含有率を平均したものである。 In FIG. 2, the audio content rate Ris_1 (1) of the first frame (k = 1) is the audio content rate of the first frame from Equation (1) because the calculation frame length is 1 minute. In FIG. 3, the voice content rate Ris_5 (1) of the first frame calculated from the formula (1) is an average of the voice content rates of the first to fifth frames in FIG. In FIG. 4, the voice content rate Ris_10 (1) of the first frame calculated from the formula (1) is an average of the voice content rates of the first to tenth frames in FIG.

図２〜図４に示すように、各算出用フレーム長で音声含有率の変動の様子が異なることが分かる。具体的には、算出用フレーム長が短い場合（図２）には、音声含有率のフレーム間の変動差が比較的大きくなる。つまり、算出用フレーム長が短い場合には、音声含有率の実際の変動が敏感に反映されたものとなる。これに対し、図３および図４に示すように、算出用フレーム長が長くなるにつれて、音声含有率のフレーム間の変動差が比較的小さくなる。これは、上述したように、算出用フレーム長が長くなるにつれて各フレームの音声含有率が平均化されるためである。つまり、算出用フレーム長が長い場合には、平均化によって小さい変動差が吸収され、音声含有率の変動が大局的に反映される。また、各算出用フレーム長の分散および標準偏差も、音声含有率の変動差の違いにより、異なる値となる。 As shown in FIG. 2 to FIG. 4, it can be seen that the variation of the voice content rate is different for each calculation frame length. Specifically, when the calculation frame length is short (FIG. 2), the difference in fluctuation of the voice content rate between frames becomes relatively large. That is, when the calculation frame length is short, the actual fluctuation of the voice content rate is sensitively reflected. On the other hand, as shown in FIGS. 3 and 4, as the calculation frame length increases, the difference in fluctuation of the voice content rate between frames becomes relatively small. This is because, as described above, the audio content rate of each frame is averaged as the calculation frame length increases. That is, when the calculation frame length is long, the small fluctuation difference is absorbed by the averaging, and the fluctuation of the voice content rate is reflected globally. In addition, the variance and standard deviation of the frame lengths for calculation also have different values due to the difference in the fluctuation of the voice content rate.

次に音声含有率の平均値および標準偏差について説明する。音声含有率の平均値は、音声含有率Ｒｉｓ＿ｎ（ｋ）を番組全体において平均した値である。上述した図２でいえば、Ｒｉｓ＿１（１）からＲｉｓ＿１（３０）の音声含有率を平均した値である。つまり、算出用フレーム長ｎ（ｎは正数）で表現すれば、音声含有率の平均値は、Ｒｉｓ＿ｎ（１）からＲｉｓ＿ｎ（Ｎ）までの音声含有率の平均である。また、標準偏差は、音声含有率Ｒｉｓ＿ｎ（ｋ）と音声含有率の平均値とを用いて算出される値である。ここで、上記図２〜図４に示した音声含有率Ｒｉｓ＿ｎ（ｋ）の値をもとに、各算出用フレーム長について、それぞれ音声含有率の平均値と標準偏差とを求めると図５に示すような値となる。図５は、各算出用フレーム長の音声含有率の平均値および標準偏差の算出結果を示す図である。図５において、算出用フレーム長が１分である音声含有率の平均値Ａ１は０．５０６と、算出用フレーム長が５分である音声含有率の平均値Ａ５は０．４９８と、算出用フレーム長が１０分である音声含有率の平均値Ａ１０は０．４８８となる。また、図５において、平均値Ａ１に対する標準偏差Ｓ１は０．１６１と、平均値Ａ５に対する標準偏差Ｓ５は０．０７３と、平均値Ａ１０に対する標準偏差Ｓ１０は０．０２８となる。 Next, the average value and standard deviation of the voice content will be described. The average value of the audio content rate is a value obtained by averaging the audio content rate Ris_n (k) in the entire program. In FIG. 2 described above, it is a value obtained by averaging the voice content ratios of Ris_1 (1) to Ris_1 (30). In other words, when expressed by the calculation frame length n (n is a positive number), the average value of the voice content rate is the average of the voice content rates from Ris_n (1) to Ris_n (N). The standard deviation is a value calculated using the voice content rate Ris_n (k) and the average value of the voice content rate. Here, based on the value of the voice content ratio Ris_n (k) shown in FIG. 2 to FIG. 4, the average value and the standard deviation of the voice content ratio are obtained for each calculation frame length. It becomes a value as shown. FIG. 5 is a diagram showing the calculation results of the average value and standard deviation of the audio content rate of each calculation frame length. In FIG. 5, the average value A1 of the voice content rate when the calculation frame length is 1 minute is 0.506, and the average value A5 of the voice content rate when the calculation frame length is 5 minutes is 0.498. The average value A10 of the voice content rate when the frame length is 10 minutes is 0.488. In FIG. 5, the standard deviation S1 with respect to the average value A1 is 0.161, the standard deviation S5 with respect to the average value A5 is 0.073, and the standard deviation S10 with respect to the average value A10 is 0.028.

このように、図５に示すように、標準偏差においては、算出用フレーム長が短い場合には、変動差が大きく（ばらつきが大きく）なるために標準偏差の値が大きくなる。算出用フレーム長が長い場合には、変動差が小さく（ばらつきが小さく）なるために標準偏差の値が小さくなる。つまり、標準偏差は、算出用フレーム長の長さによって大きな影響を受ける値であり、一般的には番組全体における音声区間の存在の偏りを示す値と考えることができる。 Thus, as shown in FIG. 5, in the standard deviation, when the calculation frame length is short, the fluctuation difference is large (the variation is large), and thus the standard deviation value is large. When the calculation frame length is long, the fluctuation difference is small (variation is small), and thus the standard deviation value is small. That is, the standard deviation is a value that is greatly affected by the length of the calculation frame length, and can generally be considered as a value that indicates a bias in the presence of an audio section in the entire program.

次に、入力される音声信号を速度変換して再生する段階において、速度比算出部１４は、音声情報記録部１３に記録された音声情報（音声含有率、音声含有率の平均値、および標準偏差）を用いて、音声区間の存在の偏りに応じた音声区間の速度比をフレーム毎に算出する。そして、速度比算出部１４は、上記音声区間の速度比とユーザなどが入力する所望再生時間とに基づいて、非音声区間の速度比を算出する。そして、速度比算出部１４は、音声／非音声判別部１１において判別された判別情報に対して、フレーム毎の速度比を設定して音声速度変換部１５へ出力する。なお、ここでは算出された各フレームの音声区間の速度比は、当該フレーム内に存在する音声区間に一律に適用されるとする。また、非音声区間の速度比は、後述するように例えば一定の速度比でフレーム内の非音声区間に適用されるとする。 Next, in the stage of converting the speed of the input audio signal and reproducing it, the speed ratio calculation unit 14 sets the audio information (the audio content rate, the average value of the audio content rate, and the standard) recorded in the audio information recording unit 13. Deviation) is used to calculate, for each frame, the speed ratio of the speech section corresponding to the bias of the presence of the speech section. Then, the speed ratio calculation unit 14 calculates the speed ratio of the non-voice section based on the speed ratio of the voice section and the desired playback time input by the user or the like. Then, the speed ratio calculation unit 14 sets a speed ratio for each frame with respect to the discrimination information discriminated by the voice / non-speech discrimination unit 11 and outputs it to the voice speed conversion unit 15. Here, it is assumed that the calculated speed ratio of the voice section of each frame is uniformly applied to the voice section existing in the frame. Further, the speed ratio of the non-speech section is assumed to be applied to the non-speech section in the frame at a constant speed ratio, as will be described later.

ここで、速度比の算出方法を説明する前に、音声区間の速度比の最適性について説明する。記録時間より短い時間で音声信号を聴取するために、記録時間に対する再生時間長の設定値である目標再生時間比Ｒｔ（０＜Ｒｔ＜１）が与えられたとする。例えばユーザが記録時間に対して半分の再生時間で聴取しようとすると、目標再生時間比ＲｔはＲｔ＝０．５となる。このような目標再生時間比Ｒｔは、数式（２）で表現される。数式（２）において、音声含有率の平均値をＡ０と、音声含有率が一定であるときの音声区間の速度比をＳＲｓ０と、および音声含有率が一定であるときの非音声区間の速度比をＳＲｎｓ０とする。

数式（２）より、目標再生時間比Ｒｔおよび音声含有率の平均値が与えられれば、音声区間の速度比ＳＲｓ０および非音声区間の速度比ＳＲｎｓ０のうち、いずれか一方が決まれば残りの他方が算出されることが分かる。 Here, before explaining the calculation method of the speed ratio, the optimality of the speed ratio of the voice section will be explained. Assume that a target reproduction time ratio Rt (0 <Rt <1), which is a set value of the reproduction time length with respect to the recording time, is given to listen to the audio signal in a time shorter than the recording time. For example, when the user tries to listen at half the playback time with respect to the recording time, the target playback time ratio Rt is Rt = 0.5. Such a target reproduction time ratio Rt is expressed by Equation (2). In Equation (2), the average value of the voice content rate is A0, the speed ratio of the voice interval when the voice content rate is constant, SRs0, and the speed ratio of the non-speech interval when the voice content rate is constant Is SRns0.

From Equation (2), if the target playback time ratio Rt and the average value of the voice content rate are given, if one of the speed ratio SRs0 of the voice section and the speed ratio SRns0 of the non-voice section is determined, the other one is determined. It can be seen that it is calculated.

数式（２）に示す音声区間の速度比ＳＲｓ０は、一般的に通常速（等倍速）である１．０に近い値ほど聴き取りやすい。音声区間の速度比ＳＲｓ０の値が大きくなるほど、単位時間当たりの情報量が増大するので、ユーザにとって聴取が難しくなる。また、音声区間の速度比ＳＲｓ０の値が２．０程度になると、ユーザが聴き取りに集中しなければ内容を理解することが困難となる。このように、音声区間の速度比ＳＲｓ０が大きい場合、長時間の聴取にかなりの困難さが生じてくる。したがって、音声区間の速度比ＳＲｓ０は、目標再生時間比Ｒｔにある程度左右されることなく、ユーザの聴取可能な範囲内で設定されるのが最適である。これに基づき、通常は音声区間の速度比ＳＲｓ０が１〜１．８程度となる範囲を利用する。また、一定速度比であれば、実用上は音声区間の速度比ＳＲｓ０を１．３〜１．５とすることが多い。 The voice section speed ratio SRs0 shown in Equation (2) is generally easier to hear as the value is closer to 1.0, which is generally normal speed (same speed). As the value of the speed ratio SRs0 of the voice interval increases, the amount of information per unit time increases, so that it becomes difficult for the user to listen. Further, when the value of the speed ratio SRs0 of the voice section is about 2.0, it is difficult to understand the contents unless the user concentrates on listening. Thus, when the speed ratio SRs0 of the voice section is large, considerable difficulty occurs in listening for a long time. Therefore, the speed ratio SRs0 of the voice section is optimally set within a range in which the user can listen without depending on the target reproduction time ratio Rt to some extent. Based on this, a range in which the speed ratio SRs0 of the speech section is normally about 1 to 1.8 is used. If the speed ratio is constant, the speed ratio SRs0 of the voice section is often set to 1.3 to 1.5 for practical use.

本実施形態においては、上記音声区間の速度比ＳＲｓ０の最適な設定範囲を考慮しつつ、上述したように標準偏差が番組全体における音声区間の存在の偏りの度合いを示すと考え、音声含有率と音声含有率の平均値との差と、標準偏差とを用いて音声区間の速度比ＳＲｓ０を可変する。すなわち、速度比ＳＲｓ０を基準値として、音声区間が集中して音声含有率が上記音声含有率の平均値より高い部分に関しては当該基準値より音声区間の速度比を小さく設定し、逆に音声含有率が上記音声含有率の平均値より低い部分に関しては当該基準値より音声区間の速度比を大きく設定する。 In the present embodiment, considering the optimum setting range of the speed ratio SRs0 of the voice section, the standard deviation is considered to indicate the degree of bias of the existence of the voice section in the entire program as described above, and the voice content rate The speed ratio SRs0 of the voice section is varied using the difference from the average value of the voice content rate and the standard deviation. In other words, with the speed ratio SRs0 as a reference value, the speed ratio of the voice interval is set smaller than the reference value for a portion where the voice interval is concentrated and the voice content rate is higher than the average value of the voice content rate. For the portion where the rate is lower than the average value of the voice content rate, the speed ratio of the voice section is set larger than the reference value.

ここで、番組全体のフレーム数をＮと、算出用フレーム長がｎ分のときの標準偏差をＳｎと、算出用フレーム長がｎ分のときの第ｋフレームにおける音声含有率をＲｉｓ＿ｎ（ｋ）と、第ｋフレームにおける音声区間の速度比をＳＲｓ（ｋ）と、算出用フレーム長がｎ分のときの音声含有率の平均値をＡｎと、算出用フレーム長ごとに異なる重み係数をＣｎと、非音声区間の速度比をＳＲｎｓと、および音声含有率が一定と仮定したときの基準値の速度比をＳＲｓ０とする。なお、非音声区間の速度比ＳＲｎｓは、ここではフレームの音声含有率に依存せず一定値とする。このとき、音声含有率の存在の偏りに応じた音声区間の速度比ＳＲｓ（ｋ）は、例えば数式（３）と表現される。

Here, N is the number of frames of the entire program, Sn is the standard deviation when the calculation frame length is n minutes, and Ris_n (k) is the audio content rate at the kth frame when the calculation frame length is n minutes. SRs (k) is the speed ratio of the voice section in the k-th frame, An is the average voice content when the calculation frame length is n minutes, and Cn is a weighting factor that differs for each calculation frame length. The speed ratio of the non-voice section is SRns, and the speed ratio of the reference value when the voice content rate is assumed to be constant is SRs0. Here, the speed ratio SRns of the non-speech section is assumed to be a constant value without depending on the speech content rate of the frame. At this time, the speed ratio SRs (k) of the voice section according to the bias of the presence of the voice content rate is expressed as, for example, Expression (3).

さらに、音声区間の速度比ＳＲｓ（ｋ）を音声含有率の大局的な変動および短期的な変動の双方が反映した値として算出する場合には、それぞれ時間長が異なる複数種類の算出用フレーム長の音声情報を用いて算出する。つまり、複数種類の算出用フレーム長の音声情報を多重に用いて音声区間の速度比を算出する。ここで、Ｍ種類の算出用フレーム長の音声情報を用いるとすると、第ｋフレームの音声区間の速度比ＳＲｓ（ｋ）は、数式（４）となる。

数式（４）において、Ｃｎは、算出用フレーム長ごとに異なる重み係数であり、各算出用フレーム長の音声含有率の偏差を音声区間の速度比ＳＲｓ０に反映させる度合いを示すものである。 Furthermore, when the speed ratio SRs (k) of the voice section is calculated as a value reflecting both the global fluctuation and the short-term fluctuation of the voice content rate, a plurality of types of calculation frame lengths having different time lengths are used. It is calculated using the voice information. That is, the speed ratio of the voice section is calculated by using multiple types of calculation frame length voice information in a multiplexed manner. Here, assuming that M types of speech information having a calculation frame length are used, the speed ratio SRs (k) of the speech section of the kth frame is expressed by Equation (4).

In Equation (4), Cn is a different weighting factor for each calculation frame length, and indicates the degree to which the deviation of the voice content rate of each calculation frame length is reflected in the speed ratio SRs0 of the voice section.

ここで、多重の音声情報として、算出用フレーム長が１分、５分、１０分のときの各音声情報を用いたとき、音声区間の速度比ＳＲｓ（ｋ）は、数式（５）となる。

ここで、数式（５）により音声情報を多重に用いた速度比の算出結果の一例を図６に示す。図６は、音声情報を多重に用いた速度比の算出結果の一例を示す図である。なお、図６に示す算出例は、数式（５）においてＳＲｓ０＝１．５、Ｃ１＝１、Ｃ２＝１０、Ｃ３＝２０として算出し、短期的変動よりも長期的な変動に重点を置いた速度比を算出することを意図した例である。また、Ａ１、Ａ５、Ａ１０、Ｓ１、Ｓ５、Ｓ１０、Ｒｉｓ＿１（ｋ）、Ｒｉｓ＿５（ｋ）、およびＲｉｓ＿１０（ｋ）は、それぞれ図２〜図５に示した値である。また、図６では、数式（５）により音声情報を多重に用いた速度比の他に、数式（３）を用いて算出フレーム長（１分、５分、および１０分）に基づく音声情報から算出された各速度比を比較のために示している。 Here, when each voice information when the calculation frame length is 1 minute, 5 minutes, and 10 minutes is used as multiplexed voice information, the speed ratio SRs (k) of the voice section is expressed by Equation (5). .

Here, FIG. 6 shows an example of the calculation result of the speed ratio in which the voice information is multiplexed by the equation (5). FIG. 6 is a diagram illustrating an example of a calculation result of a speed ratio in which audio information is multiplexed. The calculation example shown in FIG. 6 is calculated as SRs0 = 1.5, C1 = 1, C2 = 10, and C3 = 20 in Equation (5), and focuses on long-term fluctuations rather than short-term fluctuations. This is an example intended to calculate the speed ratio. A1, A5, A10, S1, S5, S10, Ris_1 (k), Ris_5 (k), and Ris_10 (k) are the values shown in FIGS. In addition, in FIG. 6, in addition to the speed ratio in which the voice information is multiplexed using Formula (5), the voice information based on the calculated frame length (1 minute, 5 minutes, and 10 minutes) using Formula (3) is used. Each calculated speed ratio is shown for comparison.

図６において、菱形のプロットで描かれたグラフは、音声情報を多重に用いて算出された音声区間の速度比を示す。また、丸のプロットで描かれたグラフは、算出用フレーム長が１分のときの音声情報のみを用いて算出された音声区間の速度比を示す。四角のプロットで描かれたグラフは、算出用フレーム長が５分のときの音声情報のみを用いて算出された音声区間の速度比を示す。三角のプロットで描かれたグラフは、算出用フレーム長が１０分のときの音声情報のみを用いて算出された音声区間の速度比を示す。 In FIG. 6, a graph drawn with rhombus plots indicates a speed ratio of a voice section calculated by using voice information in a multiplexed manner. Further, the graph drawn with a circle plot shows the speed ratio of the voice section calculated using only the voice information when the calculation frame length is 1 minute. The graph drawn by the square plot shows the speed ratio of the voice section calculated using only the voice information when the calculation frame length is 5 minutes. The graph drawn by the triangular plot shows the speed ratio of the voice section calculated using only the voice information when the calculation frame length is 10 minutes.

図６に示すように、音声情報を多重に用いて算出された音声区間の速度比は、それぞれ単独の算出用フレーム長の音声情報のみを用いて算出された速度比と比べて、音声含有率の短期的な変動および長期的な変動の双方が反映された値であることが分かる。つまり、多重の音声情報を用いて算出された音声区間の速度比は、番組全体を通して音声区間の存在の偏りに応じた速度比であり、最適な速度比である。 As shown in FIG. 6, the speed ratio of the voice section calculated using the voice information in a multiplexed manner is higher than the speed ratio calculated using only the voice information of the single calculation frame length. It can be seen that this value reflects both short-term fluctuations and long-term fluctuations. That is, the speed ratio of the voice section calculated using the multiplexed voice information is a speed ratio according to the bias of the existence of the voice section throughout the program, and is the optimum speed ratio.

速度比算出部１４は、上述した方法で音声区間の速度比ＳＲｓを算出後、入力される再生時間から設定される目標再生時間比Ｒｔを達成するように非音声区間の速度比ＳＲｎｓを算出する。なお、非音声区間の速度比ＳＲｎｓは、上述したように例えば可変とせず一定の速度比とする。これは、有益な情報の大部分が音声区間に含まれていることに基づくものである。これにより、本実施形態に係る音声再生装置は、能率良い再生を実現できる。以下、非音声区間の速度比ＳＲｎｓの算出方法について説明する。 The speed ratio calculation unit 14 calculates the speed ratio SRns of the non-voice section so as to achieve the target playback time ratio Rt set from the input playback time after calculating the speed ratio SRs of the voice section by the method described above. . Note that the speed ratio SRns of the non-speech section is not variable and is a constant speed ratio as described above. This is based on the fact that most of the useful information is included in the speech segment. Thereby, the audio reproducing apparatus according to the present embodiment can realize efficient reproduction. Hereinafter, a method for calculating the speed ratio SRns of the non-voice section will be described.

目標再生時間比Ｒｔは、数式（４）に基づいて算出されたフレーム毎の音声区間の速度比ＳＲｓ（ｋ）を用いて、数式（６）と表現される。なお、Ｒｉｓ（ｋ）は、音声含有率を求める算出用フレーム長の最も短いものとする。上述の例で考えると、３種類の算出用フレーム長のうち最も短いのは、１分の算出用フレーム長である。

The target reproduction time ratio Rt is expressed as Expression (6) using the speed ratio SRs (k) of the voice section for each frame calculated based on Expression (4). Note that Ris (k) is the shortest calculation frame length for obtaining the audio content rate. Considering the above example, the shortest of the three types of calculation frame length is the one-minute calculation frame length.

したがって、非音声区間の速度比ＳＲｎｓは、数式（６）を整理して数式（７）となる。

なお、数式（７）からも分かるように、音声区間の速度比ＳＲｓ（ｋ）がフレーム毎に算出されるのに対して、非音声区間の速度比ＳＲｎｓは、フレームには依存せず（ｋには依存せず）一定速度比として算出される。ここで、非音声区間の速度比ＳＲｎｓの算出例を挙げる。例えば音声区間の速度比が１分、５分、１０分の多重な音声情報を用いて算出されるとする。また、数式（４）において、ＳＲｓ０を１．５と、重み係数をＣ１＝１、Ｃ２＝１０、Ｃ３＝２０とする。このとき、図６に示したように、音声情報を多重に用いて算出された音声区間の速度比ＳＲｓ（ｋ）は１．２３〜１．６８の範囲の値となる。ここで、目標再生時間比Ｒｔを例えば０．５とする。このとき、非音声区間の速度比ＳＲｎｓは、数式（７）より、３．１７７となる。つまり、非音声区間の速度比ＳＲｎｓは、音声区間の速度比（例えば図６に示す１．２３〜１．６８）より高速の速度比に設定される。このように、速度比算出部１４は、音声情報記録部１３に記録された音声情報を用いて、音声含有率の変動に応じた音声区間の速度比をフレーム毎に算出し、非音声区間の速度比をフレームに関係なく一定の速度比で算出する。そして、算出された音声区間および非音声区間の速度比の情報は、音声速度変換部１５に出力される。 Therefore, the speed ratio SRns of the non-speech section is expressed by Expression (7) by rearranging Expression (6).

As can be seen from Equation (7), the speed ratio SRs (k) of the speech section is calculated for each frame, whereas the speed ratio SRns of the non-speech section does not depend on the frame (k It is calculated as a constant speed ratio. Here, a calculation example of the speed ratio SRns of the non-voice section will be given. For example, it is assumed that the speed ratio of the voice section is calculated using multiplexed voice information of 1 minute, 5 minutes, and 10 minutes. In Equation (4), SRs0 is 1.5, and the weighting coefficients are C1 = 1, C2 = 10, and C3 = 20. At this time, as shown in FIG. 6, the speed ratio SRs (k) of the voice section calculated by using the voice information in a multiplexed manner is a value in the range of 1.23 to 1.68. Here, the target reproduction time ratio Rt is set to 0.5, for example. At this time, the speed ratio SRns of the non-speech section is 3.177 from Equation (7). That is, the speed ratio SRns of the non-voice section is set to a speed ratio higher than the speed ratio of the voice section (for example, 1.23 to 1.68 shown in FIG. 6). As described above, the speed ratio calculation unit 14 calculates the speed ratio of the voice section corresponding to the fluctuation of the voice content rate for each frame using the voice information recorded in the voice information recording unit 13, and The speed ratio is calculated at a constant speed ratio regardless of the frame. Information on the speed ratio between the calculated speech segment and the non-speech segment is output to the speech speed conversion unit 15.

音声速度変換部１５は、速度比算出部１４において算出された音声区間および非音声区間の速度比の情報に基づいて、入力される記録メディアなどに録音された音声信号に対して、速度変換処理を行う。速度変換処理の方法としては、例えば入力される音声信号を時間軸上にて圧縮伸長して速度変換を行う方法などがある。しかし、この方法に限定されず、その他の公知方法を用いて速度変換処理が行われてもよい。このように、本実施形態の音声速度変換部１５において速度変換された音声信号は、音声／非音声判別部１１の判別結果と音声含有率に応じて動的に可変する速度比で変換された音声信号である。 The voice speed conversion unit 15 performs speed conversion processing on the voice signal recorded on the input recording medium or the like based on the information on the speed ratio of the voice section and the non-voice section calculated by the speed ratio calculation unit 14. I do. As a method of speed conversion processing, for example, there is a method of performing speed conversion by compressing and expanding an input audio signal on a time axis. However, the speed conversion process may be performed using other known methods without being limited to this method. In this way, the audio signal speed-converted by the audio speed conversion unit 15 of the present embodiment is converted at a speed ratio that varies dynamically according to the discrimination result of the audio / non-audio discrimination unit 11 and the audio content rate. It is an audio signal.

次に、図７を参照して、本実施形態に係る音声再生装置１の処理の流れについて説明する。図７は、本実施形態に係る音声再生装置１の処理の流れを示すフローチャートである。図７において、まず、ユーザが例えば記録メディアに記録された番組全体の記録時間に対して目標とする再生時間を設定する（ステップＳ１）。これにより、目標再生時間比Ｒｔ（０＜Ｒｔ＜１）が設定される。次に、記録メディアなどに録音された番組全体が読み出され、音声／非音声判別部１１において、再生前に番組全体を通して音声区間および非音声区間を判別する（ステップＳ２）。そして、音声情報算出部１２において、ステップＳ２で判別された音声／非音声区間の情報に基づいて、複数種類の算出用フレーム長について音声含有率がそれぞれ算出される（ステップＳ３）。次に、音声情報算出部１２において、ステップＳ３で算出された各算出用フレーム長の音声含有率を用いて、音声含有率の平均値および標準偏差がそれぞれ算出される（ステップＳ４）。そして、ステップＳ３およびＳ４で算出された音声情報（音声含有率、音声含有率の平均値および標準偏差）が音声情報記録部１３に記録される（ステップＳ５）。ここまでが再生前に行われる処理である。番組全体を通して音声情報が算出された後、速度変換をする再生が開始される。再生される段階で、速度比算出部１４は、音声情報記録部１３に記録された音声情報に基づいて、音声区間の存在の偏りに応じた音声区間の速度比をフレーム毎に算出する（ステップＳ６）。次に、速度比算出部１４において、ステップＳ６で算出された音声区間の速度比と、ステップＳ１で設定された目標再生時間比Ｒｔとに基づいて、非音声区間の速度比が算出される（ステップＳ７）。そして、音声／非音声判別部１１において判別された音声／非音声区間の判別情報に対して、フレーム毎の速度比を設定して音声速度変換部１５へ出力する。ステップＳ７の次に、ステップＳ６およびＳ７で算出された音声区間および非音声区間の速度比の情報に基づいて、入力される記録メディアなどに録音された音声信号に対して、速度変換処理を行う（ステップＳ８）。以上で本実施形態に係る音声再生装置１の処理の流れについての説明を終了する。 Next, with reference to FIG. 7, the process flow of the audio reproduction device 1 according to the present embodiment will be described. FIG. 7 is a flowchart showing the flow of processing of the audio reproduction device 1 according to this embodiment. In FIG. 7, first, the user sets a target reproduction time with respect to the recording time of the entire program recorded on the recording medium, for example (step S1). Thereby, the target reproduction time ratio Rt (0 <Rt <1) is set. Next, the entire program recorded on the recording medium or the like is read, and the audio / non-audio discrimination unit 11 discriminates the audio section and non-audio section through the entire program before reproduction (step S2). Then, the voice information calculation unit 12 calculates the voice content ratio for each of a plurality of types of calculation frame lengths based on the information of the voice / non-voice section determined in step S2 (step S3). Next, the audio information calculation unit 12 calculates the average value and the standard deviation of the audio content rates using the audio content rates of the respective calculation frame lengths calculated in step S3 (step S4). Then, the audio information (audio content rate, average value and standard deviation of the audio content rate) calculated in steps S3 and S4 is recorded in the audio information recording unit 13 (step S5). This is the processing performed before reproduction. After the audio information is calculated throughout the entire program, playback for speed conversion is started. At the stage of reproduction, the speed ratio calculation unit 14 calculates the speed ratio of the voice section for each frame according to the bias of the existence of the voice section based on the voice information recorded in the voice information recording unit 13 (step S6). Next, the speed ratio calculation unit 14 calculates the speed ratio of the non-voice section based on the speed ratio of the voice section calculated in step S6 and the target playback time ratio Rt set in step S1 ( Step S7). Then, the speed ratio for each frame is set for the speech / non-speech section discrimination information discriminated by the speech / non-speech discriminator 11 and output to the audio speed converter 15. Subsequent to step S7, speed conversion processing is performed on the audio signal recorded on the input recording medium or the like based on the information on the speed ratio between the voice section and the non-voice section calculated in steps S6 and S7. (Step S8). Above, description about the flow of a process of the audio | voice reproduction apparatus 1 which concerns on this embodiment is complete | finished.

以上のように、本実施形態に係る音声再生装置によれば、音声含有率を音声信号全体に対して算出後、統計値として音声含有率の平均値と標準偏差とを算出して番組中の音声区間の存在の偏り度合いを予め求め、これらの音声情報を用いて音声区間の速度比を算出することで、音声含有率の変動に応じて動的に可変する音声区間の速度比を算出することができる。つまり、本実施形態に係る音声再生装置は、音声が集中する部分には速度比を低減し、音声が集中していない部分には速度比を増加させる処理を行う。これにより、本実施形態に係る音声再生装置によれば、テレビ番組や映画など全体を通して音声の了解性を保つことができる。また、非音声区間の速度比は、所定の再生時間となるように音声区間の速度比に基づいて一定速度比として算出される。これにより、能率のよい再生速度での再生が可能となる。また、各算出用フレーム長の音声情報を多重して平均値などの統計値を求めることで、音声含有率の長期的な変動や短期的な変動に対して、追従性の高い、より滑らかな速度比の制御を実現することが可能となる。 As described above, according to the audio reproduction device according to the present embodiment, after calculating the audio content rate for the entire audio signal, the average value and the standard deviation of the audio content rate are calculated as statistical values and By calculating in advance the degree of bias of the presence of the voice section and calculating the speed ratio of the voice section using these voice information, the speed ratio of the voice section that varies dynamically according to the fluctuation of the voice content rate is calculated. be able to. That is, the audio reproduction device according to the present embodiment performs a process of reducing the speed ratio in a portion where the sound is concentrated and increasing the speed ratio in a portion where the sound is not concentrated. Thereby, according to the audio reproducing device according to the present embodiment, it is possible to maintain the intelligibility of the audio throughout the television program and the movie. Further, the speed ratio of the non-voice section is calculated as a constant speed ratio based on the speed ratio of the voice section so that a predetermined reproduction time is obtained. As a result, reproduction at an efficient reproduction speed becomes possible. Also, by calculating the statistical value such as the average value by multiplexing the audio information of each calculation frame length, it is more smooth and smoother for long-term fluctuations and short-term fluctuations in the voice content rate. Control of the speed ratio can be realized.

なお、上述した速度比算出部１４では、各算出用フレーム長の音声情報を多重して音声区間の速度比ＳＲｓ（ｋ）を算出したが、これに限定されない。例えば、音声区間の速度比ＳＲｓ（ｋ）が単独の算出用フレーム長のみ用いて算出されたものでもよい。時間長が長い算出用フレーム長を用いて算出した場合には、算出された音声区間の速度比は、変化する音声含有率に対して大局的な値であり、より正確性のある値となる。時間長が短い算出用フレーム長を用いて算出した場合には、算出された音声区間の速度比は、変動する音声含有率に対してより追従性のよい値となる。 Note that the speed ratio calculation unit 14 described above calculates the voice section speed ratio SRs (k) by multiplexing the audio information of each calculation frame length, but is not limited thereto. For example, the speed ratio SRs (k) of the voice section may be calculated using only a single calculation frame length. When the calculation is performed using the calculation frame length having a long time length, the calculated speed ratio of the voice section is a global value with respect to the changing voice content rate, and is a more accurate value. . When the calculation is performed using the calculation frame length having a short time length, the calculated speed ratio of the voice section is a value that has better followability with respect to the changing voice content rate.

また、上述した速度比算出部１４では、音声区間の速度比を算出するための音声情報として、音声含有率Ｒｉｓ＿ｎ（ｋ）、音声含有率の平均値Ａｎ、標準偏差Ｓｎを用いるとしたが、これに限定されない。例えば、上記標準偏差の代わりに、分散や偏差平均など、標準偏差と同等の統計値が用いられてもよい。つまり、音声区間の速度比を算出するための音声情報としては、音声含有率Ｒｉｓ＿ｎ（ｋ）以外に、音声含有率の平均値Ａｎおよび標準偏差と同等の統計値が含まれる。 In the speed ratio calculation unit 14 described above, the voice content rate Ris_n (k), the average value An of the voice content rate, and the standard deviation Sn are used as the voice information for calculating the speed ratio of the voice section. It is not limited to this. For example, instead of the standard deviation, a statistical value equivalent to the standard deviation such as variance or average deviation may be used. In other words, the speech information for calculating the speed ratio of the speech section includes, in addition to the speech content rate Ris_n (k), a statistical value equivalent to the average value An and standard deviation of the speech content rate.

また、上述した速度比算出部１４では、音声区間の速度比をフレーム毎に算出するとしたが、フレーム内の音声区間１つ１つに対して、さらに文頭、文中、文末などの区分に分け、各区分で速度比を可変してもよい。例えば、ある音声区間の文頭では、速度比算出部１４で算出された音声区間の速度比に対してやや速度比を小さくする。そして、文末になるにつれて速度比が大きくなるように設定する。これにより、重要な情報を多く含む文頭部分がユーザにとってより聴き取りやすいものとなる。このように、速度比算出部１４は、１つの音声区間中の各区分について速度比を可変するものであってもよい。 Further, in the speed ratio calculation unit 14 described above, the speed ratio of the voice section is calculated for each frame, but for each voice section in the frame, it is further divided into sections such as a sentence head, a sentence, and a sentence end, The speed ratio may be varied in each section. For example, at the beginning of a certain voice section, the speed ratio is slightly reduced with respect to the speed ratio of the voice section calculated by the speed ratio calculation unit 14. And it sets so that a speed ratio may become large as the end of a sentence is reached. This makes it easier for the user to listen to sentence head portions that contain a lot of important information. As described above, the speed ratio calculation unit 14 may change the speed ratio for each section in one voice section.

なお、上述した第１の実施形態で説明した音声／非音声判別部１１、音声情報算出部１２、速度比算出部１４、および音声速度変換部１５は、例えば音声信号を入力とし、音声速度変換部１５で速度変換された音声信号を出力とする一般的なコンピュータシステム等の情報処理装置で実現可能である。この場合、上述した動作をコンピュータに実行させるプログラムを所定の情報記録媒体に格納し、当該情報記録媒体に格納されたプログラムをコンピュータが読み出して実行することによって、本発明の実現が可能となる。この場合、上記情報処理装置に接続されたキーボードなどの入力部を用いて、ユーザが所望する再生時間を入力する。また、音声情報算出部１２で算出される音声情報は、例えば情報処理装置内のハードディスクなどに記録される。また、上記プログラムを格納する情報記録媒体は、例えば、ＲＯＭまたはフラッシュメモリのような不揮発性半導体メモリやＣＤ−ＲＯＭ、ＤＶＤ、あるいはそれらに類する光学式ディスク状記録媒体である。また、プログラムを他の媒体や通信回線を通じて上記情報処理装置に供給してもかまわない。また、音声情報算出部１２で算出される音声情報は情報処理装置内のハードディスクに記録されるとしたが、情報処理装置内のメモリや情報処理装置外の他の記録媒体に記録されてもよい。 Note that the voice / non-voice discrimination unit 11, the voice information calculation unit 12, the speed ratio calculation unit 14, and the voice speed conversion unit 15 described in the first embodiment described above receive, for example, a voice signal and perform voice speed conversion. It can be realized by an information processing apparatus such as a general computer system that outputs the audio signal speed-converted by the unit 15. In this case, the present invention can be realized by storing a program for causing the computer to execute the above-described operation in a predetermined information recording medium, and reading and executing the program stored in the information recording medium. In this case, the playback time desired by the user is input using an input unit such as a keyboard connected to the information processing apparatus. The audio information calculated by the audio information calculation unit 12 is recorded on, for example, a hard disk in the information processing apparatus. The information recording medium for storing the program is, for example, a nonvolatile semiconductor memory such as a ROM or a flash memory, a CD-ROM, a DVD, or an optical disk-like recording medium similar to them. Further, the program may be supplied to the information processing apparatus through another medium or a communication line. Further, although the sound information calculated by the sound information calculation unit 12 is recorded on the hard disk in the information processing apparatus, it may be recorded on a memory in the information processing apparatus or another recording medium outside the information processing apparatus. .

（第２の実施形態）
図８を参照して、本発明における第２の実施形態に係る音声再生装置について説明する。図８は、本発明における第２の実施形態に係る音声再生装置２の構成を示すブロック図である。図８において、音声再生装置２は、入力バッファ２１、音声／非音声判別部１１、音声情報逐次更新部２２、速度比算出部１４、および音声速度変換部１５を有する。 (Second Embodiment)
With reference to FIG. 8, an audio reproducing apparatus according to the second embodiment of the present invention will be described. FIG. 8 is a block diagram showing the configuration of the audio reproduction device 2 according to the second embodiment of the present invention. In FIG. 8, the audio reproduction device 2 includes an input buffer 21, an audio / non-audio discrimination unit 11, an audio information sequential update unit 22, a speed ratio calculation unit 14, and an audio speed conversion unit 15.

なお、本実施形態に係る音声再生装置２は、例えばテレビ番組や映画などの音声信号全体が既に記録メディアなどに録音済みであり、録音された音声信号全体のうち一部（所定時間分）の音声信号を一時的に保存しながら逐次的に音声情報を算出して、音声信号の入力に応じて即座に速度変換した再生を行うことを想定した装置である。そのため、本実施形態に係る音声再生装置２は、上述した第１の実施形態に係る音声再生装置１に対して、入力バッファ２１を新たに有し、音声情報逐次更新部２において音声情報を逐次更新する点で大きく異なる。以下、異なる点を中心に説明する。また、音声／非音声判別部１１、速度比算出部１４、および音声速度変換部１５は、上述した第１の実施形態と同様であるので、同一の符号を付して、詳細な説明を省略する。 Note that the audio reproduction device 2 according to the present embodiment has already recorded the entire audio signal of, for example, a TV program or a movie on a recording medium or the like, and a part of the entire recorded audio signal (for a predetermined time). This is an apparatus that assumes that audio information is sequentially calculated while temporarily storing the audio signal, and that the speed conversion is performed immediately in response to the input of the audio signal. Therefore, the audio reproduction device 2 according to the present embodiment has a new input buffer 21 with respect to the audio reproduction device 1 according to the first embodiment described above, and the audio information sequential update unit 2 sequentially acquires audio information. It differs greatly in the point to update. Hereinafter, different points will be mainly described. Further, since the voice / non-voice discrimination unit 11, the speed ratio calculation unit 14, and the voice speed conversion unit 15 are the same as those in the first embodiment described above, the same reference numerals are given and detailed description is omitted. To do.

記録メディアなどに録音された音声信号が入力バッファ２１に入力される。入力バッファ２１は、入力された音声信号を適宜バッファする。つまり、入力バッファ２１では、音声情報逐次更新部２２で音声情報を逐次更新するために必要な所定時間分の音声信号のデータが一時的に記録される。一時的に保存された所定時間分の音声信号は、音声／非音声判別部１１および音声速度変換部１５にそれぞれ出力される。音声／非音声判別部１１は、入力された所定時間分の音声信号に対して音声区間および非音声区間を判別する。音声／非音声判別部１１において判別された音声／非音声区間の情報は、音声情報逐次更新部２２および速度比算出部１４にそれぞれ出力される。 An audio signal recorded on a recording medium or the like is input to the input buffer 21. The input buffer 21 buffers input audio signals as appropriate. That is, in the input buffer 21, audio signal data for a predetermined time necessary for sequentially updating audio information by the audio information sequential updating unit 22 is temporarily recorded. The temporarily stored audio signals for a predetermined time are output to the audio / non-audio discriminating unit 11 and the audio speed converting unit 15, respectively. The voice / non-speech discrimination unit 11 discriminates a speech segment and a non-speech segment from the input audio signal for a predetermined time. Information on the voice / non-voice section determined by the voice / non-voice determination unit 11 is output to the voice information sequential update unit 22 and the speed ratio calculation unit 14, respectively.

音声情報逐次更新部２２は、音声／非音声区間の判別情報に基づいて音声情報を逐次更新する。なお、第１の実施形態では数式（３）および数式（４）において、音声含有率Ｒｉｓ＿ｎ（ｋ）を音声信号全体について一旦算出した後に、統計値である音声含有率の平均値Ａｎおよび標準偏差Ｓｎを算出していた。これに対し、本実施形態では、音声信号の入力に応じて即座に速度変換した再生を行うために、統計値である上記音声含有率の平均値Ａｎおよび標準偏差Ｓｎの初期値を予め記録部（図示しない）などにそれぞれ記録設定して、当該統計値を記録部などに逐次記録しながら更新していく。以下、音声情報である音声含有率の平均値および標準偏差の更新方法について説明する。 The voice information sequential update unit 22 sequentially updates the voice information based on the discrimination information of the voice / non-voice section. In the first embodiment, in Equations (3) and (4), the speech content rate Ris_n (k) is once calculated for the entire speech signal, and then the average value An and standard deviation of the speech content rate, which are statistical values, are calculated. Sn was calculated. On the other hand, in this embodiment, in order to perform playback with speed conversion immediately according to the input of the audio signal, the average value An of the voice content rate and the initial value of the standard deviation Sn, which are statistical values, are recorded in advance. (Not shown) or the like is recorded and set, and the statistical value is updated while being sequentially recorded in a recording unit or the like. Hereinafter, a method of updating the average value and the standard deviation of the voice content rate that is the voice information will be described.

音声含有率の平均値Ａｎは、更新に際して初期値が設定される。そして、音声含有率の平均値Ａｎは、音声信号が入力される毎に初期値を元に逐次更新される。上記初期値は、例えば再生する番組のジャンルなどによって異なり、当該ジャンルに合わせて適宜設定される。例えば、頻繁にアナウンサが話す機会の多いテレビのニュース番組などの場合は、音声含有率の平均値が８５％程度となる。また、話者の話す機会が少ない様々な映像シーンを多用するドキュメンタリ番組などの場合は、音声含有率の平均値が５０％程度になる。 The average value An of the voice content rate is set to an initial value when updated. The average value An of the voice content rate is sequentially updated based on the initial value every time a voice signal is input. The initial value varies depending on, for example, the genre of the program to be played back, and is appropriately set according to the genre. For example, in the case of a television news program where the announcer frequently speaks, the average value of the audio content is about 85%. In addition, in the case of a documentary program that frequently uses various video scenes where there are few opportunities for speakers to speak, the average value of the audio content rate is about 50%.

ここで、入力バッファに記録される音声信号の所定時間分を例えば上述した算出用フレーム長（ｎ分）とする。そして、入力バッファは、算出用フレーム長（ｎ分）分の音声信号を確保しながら、例えば１フレーム分の音声信号を順次記録更新していくとする。また、音声情報逐次更新部２２は、例えば音声／非音声判別部１１で１フレーム分の音声／非音声区間が判別される毎に、音声情報の平均値Ａｎの逐次更新を行うとする。この場合、音声含有率の平均値Ａｎはフレーム毎に更新され、ｋフレーム目の逐次更新される音声含有率の平均値の更新値（以下、音声含有率の更新平均値とする）をＡｎ（ｋ）とする。このとき、音声含有率の更新平均値Ａｎ（ｋ）は、数式（８）で表現される。

なお、数式（８）において、α１およびβ１は音声含有率の更新平均値Ａｎ（ｋ）の更新速度を規定するパラメータである。すなわち、α１の値が大きいほどｋフレームの１つ前のフレームの更新平均値Ａｎ（ｋ−１）の占める割合が高くなり、更新平均値Ａｎ（ｋ）の更新速度が緩やかになる。また、β１の値が大きいほどｋフレームの音声含有率Ｒｉｓ＿ｎ（ｋ）の占める割合が高くなり、更新平均値Ａｎ（ｋ）の更新速度が速くなる。数値例としては、例えばα１＝０．９８、β１＝０．０２としてもよい。 Here, the predetermined time of the audio signal recorded in the input buffer is assumed to be, for example, the calculation frame length (n minutes) described above. Then, it is assumed that the input buffer sequentially records and updates the audio signal for one frame, for example, while securing the audio signal for the calculation frame length (n minutes). In addition, the audio information sequential update unit 22 sequentially updates the average value An of the audio information every time the audio / non-audio determination unit 11 determines an audio / non-audio section for one frame, for example. In this case, the average value An of the audio content rate is updated for each frame, and an update value of the average value of the audio content rate that is sequentially updated in the kth frame (hereinafter referred to as an updated average value of the audio content rate) is An ( k). At this time, the update average value An (k) of the voice content rate is expressed by Expression (8).

In Equation (8), α1 and β1 are parameters that define the update rate of the update average value An (k) of the voice content rate. That is, as the value of α1 increases, the ratio of the update average value An (k−1) of the frame immediately before k frames increases, and the update speed of the update average value An (k) becomes slower. In addition, the larger the value of β1, the higher the proportion of the audio content ratio Ris_n (k) of k frames, and the update speed of the update average value An (k) becomes faster. As numerical examples, for example, α1 = 0.98 and β1 = 0.02 may be used.

また、標準偏差Ｓｎも上記音声含有率の平均値と同様に、更新に際して初期値が設定される。そして、標準偏差Ｓｎは、フレーム毎に初期値を元に逐次更新される。上記初期値は、音声含有率の平均値Ａｎと同様に、例えば再生する番組のジャンルなどによって異なり、当該ジャンルに合わせて適宜設定される。具体的には標準偏差Ｓｎは、上記初期値と、更新平均値Ａｎ（ｋ）と、ｋフレームの音声含有率Ｒｉｓ＿ｎ（ｋ）とを用いて更新される。ここで、ｋフレーム目の標準偏差の更新値をＳｎ（ｋ）とすると、標準偏差の更新値Ｓｎ（ｋ）は、数式（９）で表現される。

なお、数式（９）において、α２およびβ２は標準偏差の更新値Ｓｎ（ｋ）の更新速度を規定するパラメータである。数値例としては、例えばα２＝０．９８、β２＝０．０２としてもよい。 The standard deviation Sn is also set to an initial value upon updating, as is the case with the average value of the voice content rate. The standard deviation Sn is sequentially updated based on the initial value for each frame. Similar to the average value An of the audio content rate, the initial value varies depending on, for example, the genre of the program to be played back, and is appropriately set according to the genre. Specifically, the standard deviation Sn is updated using the initial value, the updated average value An (k), and the k frame audio content rate Ris_n (k). Here, if the update value of the standard deviation of the kth frame is Sn (k), the update value Sn (k) of the standard deviation is expressed by Equation (9).

In Equation (9), α2 and β2 are parameters that define the update rate of the standard deviation update value Sn (k). As numerical examples, for example, α2 = 0.98 and β2 = 0.02 may be used.

次に、速度比算出部１４は、音声含有率Ｒｉｓ＿ｎ（ｋ）と、フレーム毎に更新された音声含有率の更新平均値Ａｎ（ｋ）および標準偏差の更新値Ｓｎ（ｋ）とに基づいて、上述した第１の実施形態と同様に、数式（３）〜数式（５）に基づいて音声区間の速度比ＳＲｓ（ｋ）を算出する。また、速度比算出部１４は、算出した音声区間の速度比ＳＲｓ（ｋ）と目標再生時間比Ｒｔとに基づいて非音声区間の速度比ＳＲｎｓを算出する。そして、速度比算出部１４は、音声／非音声判別部１１から入力される音声／非音声区間の判別情報に対して、フレーム毎の速度比を設定して音声速度変換部１５へ出力する。音声速度変換部１５は、速度比算出部１４において算出された音声区間および非音声区間の速度比の情報に基づいて、入力バッファ２１から入力される音声信号に対してフレーム毎に逐次速度変換処理を行う。 Next, the speed ratio calculation unit 14 is based on the voice content rate Ris_n (k), the updated average value An (k) of the voice content rate updated for each frame, and the updated value Sn (k) of the standard deviation. Similarly to the above-described first embodiment, the speed ratio SRs (k) of the speech section is calculated based on the formulas (3) to (5). Further, the speed ratio calculation unit 14 calculates the speed ratio SRns of the non-voice section based on the calculated speed ratio SRs (k) of the voice section and the target reproduction time ratio Rt. Then, the speed ratio calculation unit 14 sets a speed ratio for each frame with respect to the discrimination information of the voice / non-speech section input from the voice / non-speech discrimination unit 11 and outputs it to the voice speed conversion unit 15. The voice speed conversion unit 15 sequentially performs speed conversion processing for each frame on the voice signal input from the input buffer 21 based on the speed ratio information of the voice section and the non-voice section calculated by the speed ratio calculation unit 14. I do.

以上のように、本実施形態に係る音声再生装置２は、統計値である音声含有率の平均値および標準偏差を逐次更新する。これにより、本実施形態に係る音声再生装置２は、音声情報を番組全体に対して事前に算出することなく、音声信号の入力に応じて即時に速度変換処理を行うことができる。 As described above, the audio reproduction device 2 according to the present embodiment sequentially updates the average value and standard deviation of the audio content rate, which are statistical values. Thereby, the audio reproduction device 2 according to the present embodiment can perform speed conversion processing immediately according to the input of the audio signal without calculating the audio information in advance for the entire program.

なお、上述した第２の実施形態で説明した音声再生装置２は、音声／非音声判別部１１、音声情報逐次更新部２２、速度比算出部１４、および音声速度変換部１５は、例えば音声信号を入力とし、音声速度変換部１５で速度変換された音声信号を出力とする一般的なコンピュータシステム等の情報処理装置で実現可能である。この場合、上述した動作をコンピュータに実行させるプログラムを所定の情報記録媒体に格納し、当該情報記録媒体に格納されたプログラムをコンピュータが読み出して実行することによって、本発明の実現が可能となる。また、上記情報処理装置に接続されるキーボードなどの入力部において、ユーザが所望する再生時間や上述した初期値を入力する。また、入力バッファ２１は、例えば情報処理装置内のハードディスク内で構成される。また、上記プログラムを格納する情報記録媒体は、例えば、ＲＯＭまたはフラッシュメモリのような不揮発性半導体メモリやＣＤ−ＲＯＭ、ＤＶＤ、あるいはそれらに類する光学式ディスク状記録媒体である。また、プログラムを他の媒体や通信回線を通じて上記情報処理装置に供給してもかまわない。また、入力バッファ２１を例えば情報処理装置内のハードディスク内で構成されるとしたが、情報処理装置内のメモリや情報処理装置外の他の記録媒体で構成されてもよい。 Note that the audio reproduction device 2 described in the second embodiment described above includes the audio / non-audio discriminating unit 11, the audio information sequential updating unit 22, the speed ratio calculating unit 14, and the audio speed converting unit 15, for example, an audio signal. And an information processing apparatus such as a general computer system that outputs a voice signal whose speed has been converted by the voice speed conversion unit 15. In this case, the present invention can be realized by storing a program for causing the computer to execute the above-described operation in a predetermined information recording medium, and reading and executing the program stored in the information recording medium. In addition, a playback time desired by the user and the initial value described above are input in an input unit such as a keyboard connected to the information processing apparatus. Moreover, the input buffer 21 is comprised in the hard disk in information processing apparatus, for example. The information recording medium for storing the program is, for example, a nonvolatile semiconductor memory such as a ROM or a flash memory, a CD-ROM, a DVD, or an optical disk-like recording medium similar to them. Further, the program may be supplied to the information processing apparatus through another medium or a communication line. Further, although the input buffer 21 is configured in, for example, a hard disk in the information processing apparatus, it may be configured in a memory in the information processing apparatus or another recording medium outside the information processing apparatus.

（第３の実施形態）
図９を参照して、本発明における第３の実施形態に係る音声録音再生装置について説明する。図９は、本発明における第３の実施形態に係る音声録音再生装置３の構成を示すブロック図である。図９において、音声録音再生装置３は、音声／非音声判別部１１、情報記録部３１、音声情報算出部１２、音声情報記録部１３、速度比算出部１４、および音声速度変換部１５を有する。 (Third embodiment)
With reference to FIG. 9, a voice recording / reproducing apparatus according to a third embodiment of the present invention will be described. FIG. 9 is a block diagram showing a configuration of a voice recording / reproducing apparatus 3 according to the third embodiment of the present invention. In FIG. 9, the voice recording / reproducing apparatus 3 includes a voice / non-voice discrimination unit 11, an information recording unit 31, a voice information calculation unit 12, a voice information recording unit 13, a speed ratio calculation unit 14, and a voice speed conversion unit 15. .

なお、本実施形態に係る音声録音再生装置３は、情報記録部３１に音声を記録して再生する音声録音再生装置であって、入力される音声信号を情報記録部３１に記録すると同時に、音声／非音声判別部１１で判別された音声区間や非音声区間の情報も情報記録部３１に記録することを特徴とする装置である。以下、この特徴を中心に説明する。また、音声／非音声判別部１１、音声情報算出部１２、音声情報記録部１３、速度比算出部１４、および音声速度変換部１５は、上述した第１の実施形態と同様であるので、同一の符号を付して、詳細な説明を省略する。 The audio recording / reproducing apparatus 3 according to the present embodiment is an audio recording / reproducing apparatus that records and reproduces audio in the information recording unit 31, and simultaneously records an input audio signal in the information recording unit 31. The information recording unit 31 also records the information of the voice section and the non-voice section discriminated by the non-voice discriminating unit 11. Hereinafter, this feature will be mainly described. Further, the voice / non-voice discrimination unit 11, the voice information calculation unit 12, the voice information recording unit 13, the speed ratio calculation unit 14, and the voice speed conversion unit 15 are the same as those in the first embodiment described above, and thus the same. The detailed description is abbreviate | omitted.

録音対象となる音声信号が音声／非音声判別部１１および情報記録部３１にそれぞれ入力される。音声／非音声判別部１１は、入力された音声信号に対して音声区間および非音声区間を判別する。音声／非音声判別部１１において判別された音声／非音声区間の判別情報は、情報記録部３１に出力される。情報記録部３１において、入力された録音対象である音声信号と音声／非音声区間の判別情報とがそれぞれ記録される。 An audio signal to be recorded is input to the audio / non-audio discriminating unit 11 and the information recording unit 31, respectively. The voice / non-voice discrimination unit 11 discriminates a voice section and a non-voice section from the input voice signal. The discrimination information of the voice / non-voice section discriminated by the voice / non-voice discrimination unit 11 is output to the information recording unit 31. The information recording unit 31 records the input audio signal to be recorded and the discrimination information of the voice / non-voice section.

音声情報算出部１２は、情報記録部３１に記録された音声信号全体についての音声／非音声区間の情報を読み出して、音声情報を算出する。具体的には、音声情報算出部１２は、記録された音声信号全体を通して音声含有率を算出した後に、音声含有率の平均値および標準偏差を算出する。そして、音声情報算出部１２で算出された音声含有率、音声含有率の平均値、および標準偏差は、音声情報記録部１３にそれぞれ記録される。 The voice information calculation unit 12 reads out the voice / non-speech section information about the entire voice signal recorded in the information recording unit 31 and calculates the voice information. Specifically, the audio information calculation unit 12 calculates the audio content rate through the entire recorded audio signal, and then calculates the average value and the standard deviation of the audio content rate. Then, the voice content rate calculated by the voice information calculation unit 12, the average value of the voice content rate, and the standard deviation are recorded in the voice information recording unit 13.

そして、再生される段階において、速度比算出部１４は、音声情報記録部１３に記録された音声情報を用いて、音声含有率の変動に応じた音声区間の速度比をフレーム毎に算出する。また、速度比算出部１４は、音声区間の速度比と目標再生時間比Ｒｔとに基づいて非音声区間の速度比を算出する。そして、記録された音声／非音声区間の判別情報に対して、フレーム毎の速度比を設定して音声速度変換部１５へ出力する。音声速度変換部１５は、速度比算出部１４において算出された音声区間および非音声区間の速度比の情報に基づいて、情報記録部３１に記録された音声信号に対して速度変換処理を行う。 Then, at the stage of reproduction, the speed ratio calculation unit 14 uses the audio information recorded in the audio information recording unit 13 to calculate the speed ratio of the audio section corresponding to the change in the audio content rate for each frame. Further, the speed ratio calculation unit 14 calculates the speed ratio of the non-voice section based on the speed ratio of the voice section and the target reproduction time ratio Rt. Then, a speed ratio for each frame is set for the recorded voice / non-voice section discrimination information and output to the voice speed converter 15. The voice speed conversion unit 15 performs speed conversion processing on the voice signal recorded in the information recording unit 31 based on the speed ratio information of the voice section and the non-voice section calculated by the speed ratio calculation unit 14.

以上のように、本実施形態に係る音声録音再生装置３は、入力される音声信号を情報記録部３１に記録するとともに、音声／非音声判別部１１で判別された音声区間や非音声区間の情報も情報記録部３１に記録している。これにより、本実施形態に係る音声録音再生装置３によれば、音声信号全体を記録した段階で音声信号全体についての音声区間や非音声区間の判別が終了しているため、再生前に行われる音声情報の算出時間を短縮することができる。 As described above, the audio recording / reproducing apparatus 3 according to the present embodiment records the input audio signal in the information recording unit 31 and also determines the audio interval and non-audio interval determined by the audio / non-audio determination unit 11. Information is also recorded in the information recording unit 31. Thereby, according to the audio recording / reproducing apparatus 3 according to the present embodiment, since the determination of the audio section and the non-audio section for the entire audio signal is completed at the stage of recording the entire audio signal, it is performed before the reproduction. The calculation time of voice information can be shortened.

なお、上述した情報記録部３１において、音声／非音声判別部１１で判別された音声区間や非音声区間の判定情報に加え、さらに音声情報算出部１２で算出された音声情報が記録されてもよい。この場合、図１０に示すように、音声情報記録部１３は省略される。図１０は、情報記録部３１に音声区間や非音声区間の情報と音声情報とを記録する音声録音再生装置４の構成を示すブロック図である。図１０において、音声録音再生装置４は、音声／非音声判別部１１、情報記録部３１、音声情報算出部１２、速度比算出部１４、および音声速度変換部１５を有する。 In the information recording unit 31 described above, the voice information calculated by the voice information calculation unit 12 is recorded in addition to the determination information of the voice section and the non-voice section determined by the voice / non-voice determination unit 11. Good. In this case, as shown in FIG. 10, the audio information recording unit 13 is omitted. FIG. 10 is a block diagram showing a configuration of the voice recording / reproducing apparatus 4 that records information of voice sections and non-voice sections and voice information in the information recording unit 31. In FIG. 10, the voice recording / playback apparatus 4 includes a voice / non-voice discrimination unit 11, an information recording unit 31, a voice information calculation unit 12, a speed ratio calculation unit 14, and a voice speed conversion unit 15.

図１０において、情報記録部３１では、入力された録音対象である音声信号と、音声／非音声判別部１１において判別された音声／非音声区間の情報と、音声情報算出部１２で算出された音声情報とがそれぞれ記録される。つまり、音声録音再生装置４は、記録とともに音声／非音声区間の判別情報および音声情報が情報記録部３１に記録される。これにより、音声録音再生装置４によれば、記録後において再生時間が入力されれば、即時に速度比を算出することができる。その結果、音声録音再生装置４は、速度変換した再生音声を短時間で出力することができる。 In FIG. 10, in the information recording unit 31, the input audio signal to be recorded, the information of the voice / non-voice section determined by the voice / non-voice determination unit 11, and the voice information calculation unit 12 are calculated. Audio information is recorded. That is, the voice recording / reproducing apparatus 4 records the voice / non-voice section discrimination information and the voice information in the information recording unit 31 together with the recording. Thereby, according to the audio recording / reproducing apparatus 4, if the reproduction time is input after recording, the speed ratio can be calculated immediately. As a result, the voice recording / playback apparatus 4 can output the playback voice whose speed has been converted in a short time.

なお、上述した第３の実施形態で説明した音声／非音声判別部１１、音声情報算出部１２、音声情報記録部１３、速度比算出部１４、および音声速度変換部１５は、例えば音声信号を入力とし、音声速度変換部１５で速度変換された音声信号を出力とする一般的なコンピュータシステム等の情報処理装置で実現可能である。この場合、上述した動作をコンピュータに実行させるプログラムを所定の情報記録媒体に格納し、当該情報記録媒体に格納されたプログラムをコンピュータが読み出して実行することによって、本発明の実現が可能となる。また、上記情報処理装置に接続されるキーボードなどの入力部において、ユーザが所望する再生時間が入力される。また、情報記録部３１および音声情報記録部１３は、例えば情報処理装置内のハードディスク内で構成される。また、上記プログラムを格納する情報記録媒体は、例えば、ＲＯＭまたはフラッシュメモリのような不揮発性半導体メモリやＣＤ−ＲＯＭ、ＤＶＤ、あるいはそれらに類する光学式ディスク状記録媒体である。また、プログラムを他の媒体や通信回線を通じて上記情報処理装置に供給してもかまわない。また、情報記録部３１および音声情報記録部１３を例えば情報処理装置内のハードディスク内で構成されるとしたが、情報処理装置内のメモリや情報処理装置外の他の記録媒体で構成されてもよい。 The voice / non-voice discrimination unit 11, the voice information calculation unit 12, the voice information recording unit 13, the speed ratio calculation unit 14, and the voice speed conversion unit 15 described in the third embodiment described above, for example, It can be realized by an information processing apparatus such as a general computer system that inputs and outputs a voice signal speed-converted by the voice speed converter 15. In this case, the present invention can be realized by storing a program for causing the computer to execute the above-described operation in a predetermined information recording medium, and reading and executing the program stored in the information recording medium. In addition, a playback time desired by the user is input in an input unit such as a keyboard connected to the information processing apparatus. The information recording unit 31 and the audio information recording unit 13 are configured in, for example, a hard disk in the information processing apparatus. The information recording medium for storing the program is, for example, a nonvolatile semiconductor memory such as a ROM or a flash memory, a CD-ROM, a DVD, or an optical disk-like recording medium similar to them. Further, the program may be supplied to the information processing apparatus through another medium or a communication line. In addition, although the information recording unit 31 and the audio information recording unit 13 are configured in, for example, a hard disk in the information processing apparatus, they may be configured in a memory in the information processing apparatus or another recording medium outside the information processing apparatus. Good.

また、上述した第１〜第３の実施形態で説明した音声／非音声判別部１１、音声情報算出部１２、音声情報記録部１３、速度比算出部１４、音声情報逐次更新部２２および音声速度変換部１５は、例えば音声信号、再生時間情報、および上述した初期値などを入力とし、音声速度変換部１５で速度変換された音声信号を出力とする集積回路でも実現可能である。この場合、第１の実施形態における音声情報記録部１３、第２の実施形態における入力バッファ２１、第３の実施形態における音声情報記録部１３および情報記録部３１は、例えば集積回路内のメモリで構成される。そして、上述した機能を果たす電気回路を１つの小型パッケージに集積して、音声信号の処理等を行う音声信号処理回路ＤＳＰ（ＤｉｇｉｔａｌＳｉｇｎａｌＰｒｏｃｅｓｓｏｒ）等を構成することによって、本発明の実現が可能となる。なお、第１の実施形態における音声情報記録部１３、第２の実施形態における入力バッファ２１、第３の実施形態における音声情報記録部１３および情報記録部３１は、上記集積回路とは別の他の記録媒体で構成されてもよい。 Further, the voice / non-voice discrimination unit 11, the voice information calculation unit 12, the voice information recording unit 13, the speed ratio calculation unit 14, the voice information successive update unit 22, and the voice speed described in the first to third embodiments described above. The conversion unit 15 can also be realized by an integrated circuit that receives, for example, an audio signal, reproduction time information, the above-described initial value, and the like, and outputs the audio signal speed-converted by the audio speed conversion unit 15. In this case, the audio information recording unit 13 in the first embodiment, the input buffer 21 in the second embodiment, the audio information recording unit 13 and the information recording unit 31 in the third embodiment are, for example, memories in an integrated circuit. Composed. The present invention can be realized by integrating an electric circuit that performs the above-described functions in one small package and configuring an audio signal processing circuit DSP (Digital Signal Processor) that performs audio signal processing or the like. Become. The audio information recording unit 13 in the first embodiment, the input buffer 21 in the second embodiment, the audio information recording unit 13 and the information recording unit 31 in the third embodiment are different from the integrated circuit described above. The recording medium may be configured as follows.

本発明に係る音声再生装置、音声録音再生装置、およびそれらの方法、記録媒体、および集積回路は、音声含有率の変動に応じた最適な速度比制御を行って、より聴き取りやすい再生を実現するＤＶＤプレーヤ、ＨＤＤプレーヤ、ＣＤプレーヤ等にも有用である。 The audio reproducing device, audio recording / reproducing device, and method, recording medium, and integrated circuit according to the present invention perform optimum speed ratio control in accordance with fluctuations in the audio content, thereby realizing reproduction that is easier to listen to. It is also useful for DVD players, HDD players, CD players and the like.

本発明における第１の実施形態に係る音声再生装置１の構成を示すブロック図The block diagram which shows the structure of the audio | voice reproduction apparatus 1 which concerns on 1st Embodiment in this invention. 算出用フレーム長が１分のときの音声含有率Ｒｉｓ＿１（ｋ）の算出例を示す図The figure which shows the example of calculation of audio | voice content rate Ris_1 (k) when the frame length for calculation is 1 minute 算出用フレーム長が５分のときの音声含有率Ｒｉｓ＿５（ｋ）の算出例を示す図The figure which shows the example of calculation of audio | voice content rate Ris_5 (k) when the frame length for calculation is 5 minutes 算出用フレーム長が１０分のときの音声含有率Ｒｉｓ＿１０（ｋ）の算出例を示す図The figure which shows the example of calculation of audio | voice content rate Ris_10 (k) when the frame length for calculation is 10 minutes 各算出用フレーム長の音声含有率の平均値および標準偏差の算出結果を示す図The figure which shows the calculation result of the average value of the audio | voice content rate of each calculation frame length, and a standard deviation 多重の音声情報を用いた速度比の算出結果の一例を示す図The figure which shows an example of the calculation result of the speed ratio using multiple audio | voice information 本実施形態に係る音声再生装置１の処理の流れを示すフローチャートThe flowchart which shows the flow of a process of the audio | voice reproduction apparatus 1 which concerns on this embodiment. 本発明における第２の実施形態に係る音声再生装置２の構成を示すブロック図The block diagram which shows the structure of the audio | voice reproduction apparatus 2 which concerns on 2nd Embodiment in this invention. 本発明における第３の実施形態に係る音声録音再生装置３の構成を示すブロック図The block diagram which shows the structure of the audio | voice recording / reproducing apparatus 3 which concerns on 3rd Embodiment in this invention. 情報記録部３１に音声区間や非音声区間の情報と音声情報とを記録する音声録音再生装置４の構成を示すブロック図The block diagram which shows the structure of the audio | voice recording / reproducing apparatus 4 which records the information and audio | voice information of an audio | voice area and a non-audio | voice area in the information recording part 31. FIG. 従来の音声再生装置９の構成を示すブロック図The block diagram which shows the structure of the conventional audio | voice reproduction apparatus 9.

Explanation of symbols

１、２音声再生装置
３、４音声録音再生装置
１１音声／非音声判別部
１２音声情報算出部
１３音声情報記録部
１４速度比算出部
１５音声速度変換部
２１入力バッファ
２２音声情報逐次更新部
３１情報記録部 1, 2 Audio playback device 3, 4 Audio recording / playback device 11 Audio / non-audio discrimination unit 12 Audio information calculation unit 13 Audio information recording unit 14 Speed ratio calculation unit 15 Audio speed conversion unit 21 Input buffer 22 Audio information successive update unit 31 Information recording section

Claims

An audio reproduction device that reproduces the audio signal by converting the reproduction speed set to the input audio signal to the same speed and reducing the reproduction time,
A discriminator for discriminating a voice section including voice and a non-voice section not containing voice with respect to the voice signal;
A voice information calculation unit that calculates at least a voice content rate indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A voice ratio comprising a speed ratio calculation unit that is set to be smaller than a reference value and sets a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low. Playback device.

The speed ratio calculation unit sets the shortened playback time according to a user operation, and based on the calculated speed ratio of the voice section, the playback time of the audio signal is set to the set playback time. The audio reproduction device according to claim 1, wherein a speed ratio of the non-audio section is calculated.

The sound reproduction apparatus according to claim 2, wherein the speed ratio calculation unit calculates the speed ratio of the non-speech section constant within the set reproduction time.

The predetermined time length includes one or more unit time lengths,
The speed ratio calculation unit sets the speed ratio calculated for the predetermined time length to the speed ratio of the voice section in any one unit time length included in the predetermined time length, The sound reproducing device according to claim 1.

The audio playback device
A buffer for recording the audio signal while sequentially updating the audio signal so as to include at least the audio signal for the predetermined time length of the input audio signal;
A speed conversion unit that performs a speed conversion process on the audio signal recorded in the buffer and outputs the audio signal;
The determination unit determines the voice interval and the non-voice interval for the audio signal having the predetermined time length recorded in the buffer,
The voice information calculation unit further calculates a statistical value related to a voice content rate as the voice information, and sequentially updates a statistical value stored in advance for each unit time,
The speed ratio calculation unit calculates the speed ratio of the voice section according to the statistical value updated every unit time and the voice content set in the predetermined time length at the time of the update,
The speed conversion unit sequentially performs a speed conversion process on a voice signal sequentially updated in the buffer using a speed ratio of the voice section calculated for each unit time. 2. The audio reproduction device according to 1.

The voice information calculation unit further calculates a statistical value related to a voice content rate as the voice information,
The audio reproduction device according to claim 1, wherein the speed ratio calculation unit calculates a speed ratio of the audio section according to the statistical value and the audio content rate.

The sound reproducing apparatus according to claim 5 or 6, wherein the statistical value is an average value and a standard deviation of the sound content rate for each predetermined time length.

The speed ratio calculation unit calculates a speed ratio of the voice section by multiplying a reference value of the speed ratio by a coefficient corresponding to a variation difference with respect to the average value of the voice content rate and the standard deviation in the predetermined time length. The sound reproducing device according to claim 7, wherein:

The voice information calculation unit sets a plurality of the predetermined time lengths each having a different time length, and calculates the voice content rate respectively.
The speed ratio calculation unit is configured so that each unit time length includes the unit time length in common with the speed ratio of the voice section included in the unit time length in at least a unit time length shorter than each of the predetermined time lengths. 9. The sound reproducing apparatus according to claim 8, wherein the sum of the coefficients corresponding to the sound content ratio is multiplied by a reference value of the speed ratio.

An audio reproduction method for reproducing the audio signal by converting the reproduction speed set to the input audio signal to the same speed and reducing the reproduction time,
A determination step of determining a speech section including speech and a non-speech section not including speech for the speech signal;
A voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A speed ratio calculating step that is set smaller than a reference value and sets a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low. Playback method.

A computer-readable recording medium recording a sound reproduction program executed by a computer that converts the reproduction speed set to the input audio signal to the same speed and shortens the reproduction time to reproduce the audio signal. There,
In the computer,
A determination step of determining a speech section including speech and a non-speech section not including speech for the speech signal;
A voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A speed ratio calculating step for setting the speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate is set to be lower than the reference value and the voice content rate of the predetermined time length is relatively low A computer-readable recording medium on which the program is recorded.

An integrated circuit that accelerates by converting the playback speed set to the input audio signal at the same magnification,
A discriminator for discriminating a voice section including voice and a non-voice section not containing voice with respect to the voice signal;
A voice information calculation unit that calculates at least a voice content rate indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A speed ratio calculation unit that is set smaller than a reference value and sets a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low. circuit.

An audio recording / playback apparatus that converts the playback speed set to the input audio signal to the same speed, shortens the playback time, and plays back the audio signal,
An information recording unit for recording the input audio signal;
A discriminating unit for discriminating a voice section including voice and a non-voice section not containing voice with respect to the voice signal before being recorded in the information recording unit;
A voice information calculation unit that calculates at least a voice content rate indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A voice ratio comprising a speed ratio calculation unit that is set to be smaller than a reference value and sets a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low. Recording / playback device.

The information recording unit records the result of the determination by the determination unit when the audio signal is recorded,
14. The audio recording / reproducing apparatus according to claim 13, wherein the audio information calculation unit calculates audio information based on a result recorded in the information recording unit.

When the audio signal is recorded in the information recording unit, a result determined by the determination unit and the audio information are recorded,
The voice recording / reproducing apparatus according to claim 13, wherein the speed ratio calculation unit calculates a speed ratio of the voice section using voice information recorded in the information recording unit.

An audio recording / reproducing method for reproducing the audio signal by converting the reproduction speed set to the input audio signal to the same speed and reducing the reproduction time,
An information recording step for recording the input audio signal;
A determination step for determining a voice section including voice and a non-voice section not including voice with respect to the voice signal before being recorded in the information recording step;
A voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A speed ratio calculating step that is set smaller than a reference value and sets a speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate of the predetermined time length is relatively low. Recording and playback method.

A recording medium that records a voice recording / playback program executed by a computer that converts the playback speed set to the input voice signal to the same speed and shortens the playback time to play back the voice signal,
In the computer,
An information recording step of recording the input audio signal in a recording unit;
A determination step of determining a speech section including speech and a non-speech section not including speech with respect to the speech signal before being recorded in the recording unit,
A voice information calculation step for calculating at least a voice content ratio indicating a ratio of the voice section to a predetermined time length as voice information related to the voice section and the non-voice section;
When the ratio of the speed conversion from the same playback speed is a speed ratio of 1 or more as a reference value, the speed ratio of the audio section in the predetermined time length is determined when the audio content rate of the predetermined time length is relatively high. A speed ratio calculating step for setting the speed ratio of the voice section in the predetermined time length to be larger than the reference value when the voice content rate is set to be lower than the reference value and the voice content rate of the predetermined time length is relatively low A computer-readable recording medium on which the program is recorded.