WO2006077626A1

WO2006077626A1 - Speech speed changing method, and speech speed changing device

Info

Publication number: WO2006077626A1
Application number: PCT/JP2005/000549
Authority: WO
Inventors: Hitoshi Sasaki; Hiroshi Katayama; Rika Nishiike
Original assignee: Fujitsu Limited
Priority date: 2005-01-18
Filing date: 2005-01-18
Publication date: 2006-07-27
Also published as: US7912710B2; EP1840877A1; JPWO2006077626A1; EP1840877A4; US20070265839A1; JP4630876B2

Abstract

A speech speed changing method is provided for changing a speech speed by storing an input voice signal in a buffer, by leaving as it is or extending a voice signal to be read from the buffer, for a sound section in which the power of the input voice signal exceeds a threshold value, and by leaving as it is, compressing or deleting the voice signal to be read from the buffer, for a silent section. In this method, a speech head protection section to be set prior to the voice section is made into the storage amount of the buffer limited by a predetermined limit value, and the compression or deletion of the voice signal is inhibited or adjusted in the compression ratio, if the voice section is in the speech head protection section, to perform the speech head protection so that the delay can be minimized to reduce the occurrence of the cutting of the speech head.

Description

Specification

Speech speed conversion method and speech speed converter

Technical field

TECHNICAL FIELD [0001] The present invention relates to a speech speed conversion method and a speech speed conversion apparatus, and more particularly to a speech speed conversion method and a speech speed conversion apparatus that convert a voice reproduction speed without changing the pitch of the sound.

Background art

Conventionally, there has been proposed a technique for making it easy to hear the content of a conversation by slowing down the voice reproduction speed, that is, the speaking speed without changing the pitch of the other party's voice. At this time, if the speech speed is simply slowed down, a delay corresponding to the slowdown will occur.

[0003] In order to solve such problems, delays are eliminated by closing silent intervals (intervals without sound such as human voice) that exist in the middle of conversations or by increasing the speech speed in silent intervals. Technology has been proposed.

FIG. 1 shows a block diagram of an example of a conventional speech speed conversion device. In the figure, a digital audio signal in units of frames is input to a terminal 10 in one frame 20 ms, and is supplied to a sound / silence determination unit 11 and a speech speed conversion unit 12.

[0005] The sound / silence determination unit 11 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the sound threshold, and compares the input sound signal with the sound threshold. Then, the section where the audio signal is equal to or higher than the sound threshold is determined as the sound determination section, and the determination result is supplied to the speech speed determination unit 13.

[0006] The speech rate determination unit 13 is supplied with an accumulation amount (number of accumulated frames) from the input accumulation amount calculation unit 14, and is set with a speech head protection interval (a fixed number of frames). The speech speed is determined according to the accumulated amount and the speech protection interval, and this speech speed is supplied to the speech speed converting unit 12 and the input accumulated amount calculating unit 14.

The speech speed conversion unit 12 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech speed from the speech speed determination unit 13, and outputs it from the terminal 15. Based on the speech speed from the speech rate determination unit 13, the input accumulation amount calculation unit 14 calculates the accumulation amount stored in the buffer of the speech rate conversion unit 12 and supplies it to the speech rate determination unit 13. FIG. 2 shows a speech speed determination table of the speech speed determination unit 13. In voiced sections, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, the speech speed is multiplied by 1. If there is a speech protection section within the next 3 frames, that is, a speech determination section, the speech speed will be multiplied by 1. If there is a speech protection section, that is, if there is a voiced judgment section within the past 10 frames, the speech speed will be multiplied by 1. The speech speed is doubled during the pause holding period, that is, within 10 frames after the end of talk protection. In the silence deletion section, the audio signal is deleted and packed outside the above sections. However, if there is no processing delay time, the speech speed is set to 1 time.

[0009] It should be noted that in Patent Document 1, the beginning portion of a speech section sandwiched between non-speech sections of a certain length of time becomes slower than a predetermined playback speed and gradually plays a predetermined playback toward the end. It is described that the speech speed is converted back to the speed.

Patent Document 1: Japanese Patent Laid-Open No. 2001-222300

Disclosure of the invention

Problems to be solved by the invention

[0010] However, when performing the process of closing the silent section or the process of increasing the speech speed in the silent section, it is necessary to consider the accuracy of the sound / silence determination. For example, in a noisy environment, an erroneous determination may occur in the presence or absence of sound. In a no-noise environment, the presence or absence of speech and silence is determined relatively accurately even at the beginning and end of the story. However, in a noisy environment, the noise level may be a value close to or exceeding the power value at the beginning or end of the talk. In this case, the beginning or end of the talk will be buried in noise.

[0011] For this reason, it becomes difficult to accurately determine whether there is a sound or no sound in a noisy environment.

For example, in a noisy environment, parts with low voice power, such as the beginning, end, and unvoiced consonants, are more likely to be misjudged as silence despite being a voiced section.

[0012] When a process of closing a silent section or a process of increasing the speech speed is executed based on such a misjudgment, problems such as occurrence of sound interruption and excessive reduction in the duration of silence occur.

[0013] Fig. 3 (A) shows the approximate time variation of the input audio signal power (volume) with a solid line. Steady power noise is superimposed on the audio signal, and the noise level +4 dB is set as the sound threshold. ing. The determination results for each section are shown in the lower part of Fig. 3 (A). However, only the portion from the beginning of the speech protection section is described from the beginning of the speech protection section, and the portion from the ending of the ending protection section. The first, second, fifth, and sixth voices from the left are judged to be voiced sections. The third and fourth voices are considered to be silent sections because they are buried in noise.

[0014] For the third voice, the power that can be removed by ending protection. For the fourth voice, the fixed head protection section is short, so the head break occurs. Figure 3 (B) shows the audio signal power after speech speed conversion.

[0015] Section (1) in Fig. 3 (B): Assume that there is already 10 frames of processing delay (input accumulation) force in speech speed conversion at the start.

[0016] Section (2), Section (3): Since the first and second voices are sounded, they are doubled (1Z2 speed). During sections (2) and (3), the output is 1x speed with speech protection and ending protection.

[0017] Section (4): The third voice is silent, but it is output at 1x speed because it enters the ending protection and pause holding section. Subsequent silent sections are output at 1x speed in the pause holding section, and are deleted thereafter.

[0018] Section (5): The fourth speech is silence-protected and only part of the head is protected. Since there is sufficient speech speed conversion delay (input accumulation amount) at this point, only the protected section is output at 1x speed, and the rest are deleted, causing the head to break.

[0019] Section (6): Since the fifth sound is a sound determination, it is expanded twice.

[0020] For speech protection, a fixed-length speech protection section is conventionally set, and therefore it is necessary to insert (add) a delay corresponding to the speech protection. For example, sufficient protection can be set for stored sounds such as recorded messages on the telephone. However, when converting the speech speed in real-time calls, it is necessary to minimize the delay, so it is not possible to set a sufficiently long talk head protection section, and there is a possibility that the talk head may be cut off. There was a problem.

The present invention has been made in view of the above points, and it is a general object of the present invention to provide a speech speed conversion method and a speech speed converter that can minimize the delay and reduce the occurrence of a head loss. Let's say.

Means for solving the problem

In order to achieve this object, the present invention stores an input audio signal in a buffer, and In speech periods where the power of the audio signal exceeds the threshold, the audio signal read from the buffer is either directly or expanded, and in the silent period, the audio signal read from the buffer is unchanged or compressed or deleted to convert the speech speed. According to the conversion method, the speech protection interval set in advance between the speech zones is set as the accumulated amount of the buffer limited by a predetermined limit value, and the speech protection interval is within the speech protection interval. For example, compression or deletion of the audio signal is prohibited, or speech head protection is performed by adjusting the compression rate.

The invention's effect

[0023] According to such a speech speed conversion method, it is possible to minimize the delay and reduce the occurrence of the head break.

Brief Description of Drawings

FIG. 1 is a block diagram of an example of a conventional speech speed conversion device.

FIG. 2 is a diagram showing a speech speed determination table of a speech speed determination unit of a conventional speech speed conversion device.

FIG. 3 is a diagram showing conventional input voice signal power and voice signal power after speech speed conversion.

FIG. 4 is a block diagram of the first embodiment of the speech speed converting apparatus of the present invention.

FIG. 5 is a diagram showing a speech speed determination table of a speech speed determination unit in the first embodiment.

FIG. 6 is a diagram showing the input voice signal power and the voice signal power after speech speed conversion according to the present invention.

FIG. 7 is a diagram showing a voice / silence determination table of a voice / silence determination unit in the second embodiment.

FIG. 8 is a diagram showing a speech speed determination table of a speech speed determination unit in the second embodiment.

FIG. 9 is a block diagram of a third embodiment of the speech speed converting apparatus of the present invention.

FIG. 10 is a diagram showing a speech speed determination table of a speech speed determination unit in the fourth embodiment. Explanation of symbols

[0025] 20, 26 terminals

21 Sound / silence determination section

22 Speech rate converter

23 Spoken speed decision section

24 Input accumulation calculation section 25, 31 Head protection section decision section

30 Estimated SNR judgment unit

BEST MODE FOR CARRYING OUT THE INVENTION

Hereinafter, embodiments of the present invention will be described with reference to the drawings.

FIG. 4 shows a block diagram of the first embodiment of the speech speed converting apparatus of the present invention. In the figure, a digital audio signal in units of frames is input to the terminal 20 in one frame 20 ms, and is supplied to the sound / silence determination unit 21 and the speech speed conversion unit 22.

[0027] The sound / silence determination unit 21 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the sound threshold, and the input sound signal exceeds the sound threshold. The section is determined to be a sound determination section, and the determination result is supplied to the speech speed determination unit 23. For simplicity, it is decided to make a sound determination only with power (volume), but it is also possible to make a sound determination using a characteristic quantity such as frequency characteristics.In addition, a fixed value is used as the sound threshold. May be

[0028] The speech rate determination unit 23 is supplied with the accumulation amount (accumulated number of frames) from the input accumulation amount calculation unit 24 and is also supplied with the speech protection period (variable number of frames) from the speech protection period determination unit 25. The speech speed is determined according to the sound determination result, the accumulation amount, and the speech protection section, and this speech speed is supplied to the speech speed conversion unit 22 and the input accumulation amount calculation unit 24.

The speech speed conversion unit 22 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech speed from the speech speed determination unit 23, and outputs it from the terminal 26. The deletion section simply discards the data. In order to slow down the speech speed, for example, each frame is divided into about 4 subframes, and each subframe is repeatedly played according to the expansion ratio. In the case of 2 times extension, each subframe is played back twice. 1. For 5x expansion, play odd subframes once and repeat even subframes twice. At this time, as described in Japanese Patent No. 3147562, a method is generally used in which the connection is shifted so that the connection can be made smoothly based on information such as correlation.

[0030] Note that the speech speed conversion unit 22 may compress the speech speed at a higher speed instead of deleting the voice signal. When compressing the speech speed by doubling, for example, an odd subframe is played once and an even number Delete the subframe.

[0031] The input accumulation amount calculation unit 24 calculates the accumulation amount accumulated in the buffer of the speech rate conversion unit 22 based on the speech rate from the speech rate determination unit 23, and the speech rate determination unit 23 and the speech head protection Supply to section determination unit 25. Specifically, if deleted, the accumulated amount and delay decrease by the number of frames to be deleted, and if the speech rate is increased 0.5 times, the accumulated amount increases by 20 ms per frame. This modified accumulated amount is used to determine the speech rate of the next frame.

[0032] The speech protection section determination unit 25 determines a speech protection section (variable number of frames) according to the accumulation amount. For example, if the accumulated amount (corresponding to the delay in speech speed conversion) is 10 frames or less, the accumulated amount (number of accumulated frames) is set as the speech protection section. If the accumulated amount is 10 frames or more, the head protection section is set to 10 frames.

FIG. 5 shows a speech speed determination table of the speech speed determination unit 23 in the first embodiment. Between voiced zones, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.

[0034] If there is a voiced judgment section within the number of frames determined by the speech protection section, that is, the speech protection section determination section 25, deletion of the voice signal is prohibited and the speech speed is increased by one. You can adjust the compression ratio instead of prohibiting deletion.

[0035] If there is a speech protection section, that is, if there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time.

[0036] In the pause holding section, that is, the pause holding section of the N frame after the end of the talk protection, the deletion of the voice signal is prohibited and the speech speed is set to 1 time. N = 13—Speech protection interval (where N is 10 frames, lower limit is 5 frames).

[0037] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time.

[0038] Fig. 6 (A) shows the approximate time variation of the input audio signal power (volume) with a solid line. Steady power noise is superimposed on the audio signal, and the noise level + 4dB is set as the sound threshold. The judgment results for each section are shown in the lower part of Fig. 6 (A). However, only the portion from the beginning of the speech protection section is described from the beginning of the speech protection section, and the portion from the ending of the ending protection section. 1 from the left The second, fifth, sixth, and sixth voices are judged to be in a voiced section. The third and fourth voices are buried in noise and are judged to be silent sections.

FIG. 6 (B) shows the audio signal power after the speech speed conversion.

[0040] Section (1) in Fig. 6 (B): It is assumed that there is already 10 frames of processing delay (input accumulation) force in speech speed conversion at the start time.

[0041] Section (2), Section (3): Since the first and second voices are determined to be voiced sections, they are doubled (1 Z2 double speed). During section (2) and (3), the output is 1x speed with speech protection and ending protection.

[0042] Section (4): In the silent section following the third voice, the point force deletion starts earlier by the amount that the pause holding section (1x speed) is reduced compared to the conventional one.

[0043] Section (5): In the fourth voice, the head break is eliminated because the head protection is increased.

[0044] Section (6): Since the fifth voice is a sound determination, it is doubled.

[0045] It is necessary to close the silent section when a delay occurs, that is, when unprocessed audio signal data is accumulated. Therefore, speech protection can be implemented without increasing the delay by setting the speech protection interval according to the buffer storage amount of the speech speed conversion unit 22 and limiting it to a predetermined value, and the pause holding interval can be set as the speech protection interval. By varying this according to the situation, it is possible to achieve more accurate speech protection than before without increasing the delay amount when the buffer storage amount is large.

In the second embodiment, since the operations of the sound / silence determination unit 21 and the speech speed determination unit 23 shown in the block diagram of FIG. 4 are different from those of the first embodiment, the sound / silence determination unit 21 and the speech speed determination unit The operation of 2 and 3 will be explained.

FIG. 7 shows a voice / silence determination table of the voice / silence determination unit 21 in the second embodiment. The utterance / silence determination unit 21 learns the noise level during initial silence before the start of utterance, etc., sets the learned silence level, for example, +4 dB as the utterance threshold, and determines the learned silence level + Id B as the silence certainty level. Set as a value.

[0047] The sound / silence determination unit 21 determines a section where the input sound signal is equal to or greater than the sound threshold as a sound determination section. If the input sound signal is equal to or less than the sound threshold and equal to or greater than the sound certainty determination value, the certainty level is determined. small If it is equal to or less than the silence certainty judgment value, it is judged as a silent section with a high certainty, and the judgment result is supplied to the speech speed determination unit 23.

FIG. 8 shows a speech speed determination table of the speech speed determination unit 23 in the second embodiment. Between voiced zones, the speech speed is 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.

[0049] When there is a speech determination section within the number of frames determined by the speech protection section, that is, the speech protection section determination section 25, or the number of frames determined by the speech protection section determination section 25 is less than 10 frames. If there is a silent section with a low degree of certainty, deletion of the audio signal is prohibited and the speech speed is set to 1. Note that the compression rate may be adjusted instead of prohibiting deletion.

[0050] When there is a speech protection section, that is, when there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time.

[0051] In the pause holding section, that is, the pause holding section of 10 frames after the end of the talk protection, the voice signal is prohibited from being deleted and the speech speed is set to 1 time.

[0052] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time.

[0053] In this way, when the speech protection section is less than 10 frames, the speech protection section is relatively short by deleting or setting the target at 1x speed only when the silence reliability of the current frame is high!頭 If the talk breaks out easily! Reduce the problem of wrinkles.

FIG. 9 shows a block diagram of a third embodiment of the speech speed converting apparatus of the present invention. In the figure, the same parts as those in FIG.

In FIG. 9, a digital audio signal in units of frames is input to the terminal 20 in one frame 20 ms, and supplied to the sound / silence determination unit 21, speech rate conversion unit 22, and estimated SNR calculation unit 27.

[0055] The voice / silence determination unit 21 learns the noise level at the time of initial silence before the start of utterance, sets the learned silence level, for example, + 4dB as the voice threshold, and the input voice signal exceeds the voice threshold. The section is determined to be a sound determination section, and the determination result is supplied to the speech speed determination unit 23. For simplicity, we decided to make a sound determination only with power (volume). The sound determination may be performed using the amount, or a fixed value may be used as the sound threshold.

[0056] The estimated SNR determination unit 30 estimates an SNR (signal-to-noise ratio) and determines whether the estimated SNR is high or low. As an SNR estimation judgment method, for example, the difference between the maximum power (volume) and the minimum power in the past 30 seconds is obtained, and if the difference exceeds a threshold (for example, 15 dB), the estimated SNR is considered to be high V, and the threshold The estimated SNR is considered to be low if

[0057] The speech rate determination unit 23 is supplied with the accumulation amount (accumulated number of frames) from the input accumulation amount calculation unit 24, and is also supplied with the speech protection interval (variable number of frames) from the speech protection interval determination unit 31. The speech speed is determined according to the sound determination result, the accumulation amount, and the speech protection section, and this speech speed is supplied to the speech speed conversion unit 22 and the input accumulation amount calculation unit 24.

The speech rate conversion unit 22 writes the input speech signal into the buffer, reads the speech signal from the buffer according to the speech rate from the speech rate determination unit 23, and outputs it from the terminal 26. The deletion section simply discards the data. In order to slow down the speech speed, for example, each frame is divided into about 4 subframes, and each subframe is repeatedly played according to the expansion ratio. In the case of 2 times extension, each subframe is played back twice. 1. For 5x expansion, play odd subframes once and repeat even subframes twice.

[0059] The input accumulation amount calculation unit 24 calculates the accumulation amount accumulated in the buffer of the speech rate conversion unit 22 based on the speech rate from the speech rate determination unit 23, and the speech rate determination unit 23 and the speech head protection Supply to section determination unit 31. Specifically, if deleted, the accumulated amount and delay decrease by the number of frames to be deleted, and if the speech rate is increased 0.5 times, the accumulated amount increases by 20 ms per frame. This modified accumulated amount is used to determine the speech rate of the next frame.

[0060] The speech protection section determination unit 31 determines a speech protection section (variable number of frames) according to the accumulated amount and the estimated SNR. For example, when the estimated SNR is low, if the accumulated amount (corresponding to the delay in speech speed conversion) is 10 frames or less, the accumulated amount (accumulated number of frames) is used as the head protection section. When the accumulated amount is 10 frames or more, the head protection section is set to 10 frames.

[0061] When the estimated SNR is high, if the accumulated amount is 3 frames or less, the accumulated amount (the number of accumulated frames) is set as the speech protection section. When the accumulated amount is 3 frames or more, the head protection section is set to 3 frames. [0062] In the present embodiment, when the estimated SNR is high, there is less risk of erroneously determining the speech head to be silent, and therefore it is possible to prevent setting a protection interval excessively.

In the fourth embodiment, since the operations of the sound / silence determination unit 21 and the speech speed determination unit 23 3 shown in the block diagram of FIG. 4 are different from those of the third embodiment, the sound / silence determination unit 21 and the speech speed determination unit The operation of 2 and 3 will be explained.

The voice / silence determination table of the voice / silence determination unit 21 in the fourth embodiment is as shown in FIG. The sound / silence determination unit 21 learns the noise level during initial silence before the start of utterance, sets the learned silence level, e.g., +4 dB as the sound threshold, and uses the learned silence level + ldB as the silence certainty level. Set as judgment value.

[0064] The sound / silence determination unit 21 determines a section where the input sound signal is equal to or greater than the sound threshold as a sound determination section. If the input sound signal is equal to or less than the sound threshold and equal to or greater than the sound certainty determination value, the certainty level is determined. It is determined that the silent period is small, and if it is equal to or less than the silence certainty determination value, it is determined as a silent section with high certainty, and the determination result is supplied to the speech speed determining unit 23.

FIG. 10 shows a speech speed determination table of the speech speed determination unit 23 in the fourth embodiment. In the voiced section, the speech speed is set to 0.5 times (2 times expansion). However, if the processing delay time is 1 second (= 50 frames) or more, deletion of the audio signal is prohibited and the speech speed is set to 1 time.

[0066] When there is a voiced determination section within the number of frames determined by the speech protection section, that is, the head protection section determination unit 25, deletion of the voice signal is prohibited and the speech speed is increased by 1. However, if the current frame and the following three frames are all silent sections with a high degree of certainty, speech protection is not performed.

[0067] If there is a speech protection section, that is, if there is a speech determination section within the past 10 frames, deletion of the voice signal is prohibited and the speech speed is set to 1 time. Note that the compression ratio may be adjusted instead of prohibiting deletion.

[0068] In the pause holding section, that is, the pause holding section of 10 frames after the end of the talk protection, the voice signal is prohibited from being deleted and the speech speed is set to 1 time.

[0069] The silent deletion section is other than the above sections, and the audio signal is deleted when there is a processing delay time. When there is no processing delay time, the speech speed is set to 1 time. [0070] In the present embodiment, when the silence certainty of the current frame and the subsequent three frames is large, there is little possibility that the speech head is erroneously determined to be silent, so that it is possible to prevent setting the protection section excessively.

[0071] It should be noted that the speech protection section determination units 25 and 31 correspond to the speech protection section determination means described in the claims, and the speech speed determination section 23 corresponds to the speech protection means and pause holding section setting means. The determination unit 21 corresponds to a silence certainty determination unit, and the estimated SNR determination unit 30 corresponds to a signal-to-noise ratio estimation unit.

Claims

The scope of the claims

[1] The input audio signal is accumulated in the buffer, and the voice signal read from the buffer is directly or stretched in a voiced section where the power of the input voice signal exceeds a threshold value, and the voice signal read from the buffer is left as it is in a silent section. Talking about how to convert speech speed by compressing or deleting it,

The speech protection section set prior to the voiced section is defined as the accumulated amount of the buffer limited by a predetermined limit value.

A speech speed conversion method for performing speech protection by prohibiting or adjusting a compression rate if the speech signal is within the speech protection segment, if the speech segment is present.

[2] In the speech speed conversion method according to claim 1,

A speech speed conversion method of setting a length of a pause holding section set after the end of a predetermined length of the tail protection section following the voiced section according to the length of the head protection section.

[3] According to the method for converting speech speed according to claim 1 or 2,

The silence certainty is determined in a silent section where the power of the input voice signal is less than the threshold, and if the silence certainty of the silent section in the speech protection section is small, the compression or deletion of the voice signal is prohibited or the compression rate is set. Speaking speed conversion method that adjusts and protects speech head

[4] In the speech rate conversion method according to any one of claims 1 to 3,

Estimating the signal-to-noise ratio of the input speech signal;

A speech speed conversion method for setting the limit value for the head protection interval when the estimated signal noise ratio is higher than a certain value to be smaller than the limit value for the speech protection interval when the estimated signal noise ratio is lower than a certain value .

[5] The input audio signal is accumulated in the buffer, and the voice signal read from the buffer is left or expanded as it is during a voiced section in which the power of the input voice signal exceeds a threshold, or the voice signal read from the buffer is left as it is during a silent period. In a speech speed conversion device that converts speech speed by compressing or deleting,

A speech protection interval determination means that sets the speech protection interval set prior to the voiced interval as an accumulation amount of the buffer limited by a predetermined limit value; Speech protection means for protecting the speech head by prohibiting or adjusting the compression rate if the voice signal is within the speech protection section, and if the voice signal is compressed or deleted;

A speech speed conversion device.

[6] In the speech speed conversion device according to claim 5,

A speech speed conversion device comprising: a pause holding section setting means for setting a length of a pause holding section set after the end of a predetermined length of the speech protection section following the voiced section according to the length of the head protection section.

[7] In the speech speed conversion device according to claim 5 or 6,

A silence certainty degree determining means for determining a silence certainty level in a silent section where the power of the input voice signal is less than the threshold value;

The speech speed converting device that performs speech head protection by prohibiting or adjusting a compression rate when the silence protection degree of the silent section in the speech protection section is small.

[8] In the speech rate conversion device according to any one of claims 5 to 7,

Signal noise ratio estimation means for estimating the signal-to-noise ratio of the input speech signal, the speech protection section determination means, the limit value for the speech protection section when the estimated signal noise ratio is lower than a certain value! Accordingly, the speech speed converting apparatus that sets the limit value for the speech protection interval when the estimated signal-to-noise ratio is higher than a certain value.

[9] The input voice signal is accumulated in the buffer, and the voiced section in which the power of the input voice signal exceeds the threshold is compressed more than the silent section in which the power falls below the threshold when the voice signal read from the buffer is compressed and expanded. In a speech speed conversion device that converts the speech speed so as to be slow,

A speech protection interval determination means that sets the speech protection interval set prior to the voiced interval as an accumulation amount of the buffer limited by a predetermined limit value;

Speech protection means for protecting the speech head by prohibiting or adjusting the compression rate if the voice signal is within the speech protection section, and if the voice signal is compressed or deleted;

A speech speed conversion device.