WO2021166158A1

WO2021166158A1 - Speaking speed conversion device, speaking speed conversion method, program, and storage medium

Info

Publication number: WO2021166158A1
Application number: PCT/JP2020/006780
Authority: WO
Inventors: 茂明鈴木; 木村　勝
Original assignee: 三菱電機株式会社
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2021-08-26
Also published as: JP7019117B2; JPWO2021166158A1; TW202133149A

Abstract

This speaking speed conversion device for converting speaking speed in a voice communication device decrypts voice encoded data, outputs a voice signal, generates frequency information from information obtained in the course of decryption, obtains the amount of variation over time of the frequency information as information change volume, determines, on the basis of the voice signal, whether a reception voice represented by the voice encoded data has a sound or no sound, determines that a syllable has transitioned if the information change volume during the time of determining that the reception voice has a sound satisfies a predetermined condition, calculates the speaking speed on the basis of the result of determination about the transition of the syllable, determines the conversion rate on the basis of the speaking speed, and converts the speaking speed with the determined conversion rate. The invention makes it possible to reduce the amount of calculation and carry out appropriate speaking speed conversion in accordance with the speaking speed.

Description

Speaking speed converter, speaking speed conversion method, program and recording medium

This disclosure relates to a speech speed conversion device, a speech speed conversion method, a program, and a recording medium.

In voice communication that sends and receives highly efficient encoded voice data, in order to improve the ease of hearing the voice, a speech speed conversion technology that slows down or speeds up the playback speed without changing the voice quality is used. It is being developed.
In voice communication, when converting the speaking speed, the speaking speed is lowered to convert the voice in the sounded section to make it easier to hear, and part or all of the silent section is deleted or the speaking speed is increased. It is often done to prevent an increase in delay.

In speech speed conversion, lowering the speaking speed of a fast-spoken voice can improve the ease of hearing the voice, but lowering the speaking speed of a slowly spoken voice makes it difficult to understand the rhythm of the speech, and rather listens. Ease may be compromised.
Therefore, a mechanism for measuring the speech speed of the voice before the speech speed conversion is required.

Conventionally, a technique for measuring a speech speed by obtaining a spectral feature of a spoken voice has been disclosed (Patent Document 1). In this technique, the speech voice is subjected to spectral analysis by linear prediction method (LPC) or fast Fourier transform (FFT) based on a full pole model every 10 ms, and a spectral feature amount vector is obtained based on the spectral analysis result.
Then, the transition of the syllable is detected by observing the change of the spectral feature vector, and the speaking speed is measured.

Japanese Unexamined Patent Publication No. 2005-331589

When such a speech speed conversion device is used to convert the speech speed of the received voice in a voice communication device that transmits and receives high-efficiency encoded voice code data, the decoding process and the speech speed conversion process are performed at the same time. Since it is necessary to perform it, there is a problem that the amount of calculation is large.

Furthermore, when measuring the speaking speed based on the audio signal obtained by decoding the highly efficient coded audio code data, distortion is superimposed on the audio signal obtained by decoding, which is why. There is a problem that the measurement accuracy of speaking speed is low.

This disclosure is made to solve the above-mentioned problems, makes it possible to reduce the amount of calculation, and accurately measure the speaking speed of the voice signal obtained by decoding the voice code data. , The purpose is to make it possible to perform an appropriate speech speed conversion according to the speech speed.

The speech speed converter of the present disclosure is
In a speech speed converter that converts speech speed in a voice communication device,
A voice decoding unit that decodes high-efficiency encoded voice code data and outputs a voice signal,
A frequency information generation unit that generates frequency information from information obtained in the process of decoding the voice code data in the voice decoding unit, and a frequency information generation unit.
An information change amount calculation unit that obtains the time change amount of the generated frequency information at regular time intervals as the information change amount, and
Based on the voice signal, a sound detection unit that determines whether the received voice represented by the voice code data is sound or no sound, and a sound detection unit.
A syllable transition determination unit that determines that a syllable of the received voice has transitioned when the amount of change in information while the received voice is determined to be sound by the sound detection unit satisfies a predetermined condition. When,
A speech speed calculation unit that calculates the speech speed based on the determination result by the syllable transition determination unit, and
A conversion rate determination unit that determines the conversion rate based on the speech speed calculated by the speech speed calculation unit,
It has a speaking speed conversion unit that converts the speaking speed of the audio signal at the conversion rate determined by the conversion rate determining unit.

According to the present disclosure, it is possible to reduce the amount of calculation and perform appropriate speech speed conversion according to the speech speed.

It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 1. FIG. It is a block diagram which shows the structural example of the audio decoding part of FIG. It is a block diagram which shows the structural example of the sound detection part of FIG. (A) to (h) are time charts showing signals appearing in each part of the sound detection unit of FIG. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 2. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 3. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 4. FIG. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 5. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 6. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 7. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 8. It is a block diagram which shows the hardware structure of the computer which realizes all the functions of a speech speed converter. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 1 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 5 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 6 is composed of the computer of FIG. It is a flowchart which shows the procedure of processing by a processor when the speech speed conversion apparatus of FIG. 7 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 8 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 9 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 10 is composed of the computer of FIG. It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 11 is composed of the computer of FIG.

Embodiment 1.
FIG. 1 shows the configuration of the speech speed conversion device according to the first embodiment.
The illustrated speech speed conversion device converts the speech speed of the received voice in the voice communication device, and has a voice decoding unit 1, a frequency information generation unit 2, an information change amount calculation unit 3, an extreme value detection unit 4, and a presence. It has a sound detection unit 5, a syllable transition determination unit 6, a speech speed calculation unit 7, a conversion rate determination unit 8, and a speech speed conversion unit 9.

First, the outline of the operation of each part will be described.
Highly efficient coded voice code data Da is input to the illustrated speech speed conversion device. The voice code data Da includes pitch cycle information of voice, information representing a fixed codebook vector, gain information, and information representing an LSP coefficient for each voice frame. Audio frames are simply referred to as frames below.

The voice decoding unit 1 decodes the voice code data Da and generates a voice signal (decoded voice signal) Db representing a linear PCM (Pulse Code Modulation) code.

The frequency information generation unit 2 extracts and outputs the frequency information Fa from the information generated in the decoding process in the voice decoding unit 1 at regular intervals. The frequency information Fa represents the vocal tract frequency characteristics when each phoneme is emitted.

The information change amount calculation unit 3 calculates the time change amount (information change amount) Vf of the frequency information Fa output from the frequency information generation unit 2 at regular time intervals.

The extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf calculated by the information change amount calculation unit 3.

The sound detection unit 5 determines whether the voice (received voice) represented by the voice code data Da is sound or no sound based on the voice signal Db output from the voice decoding unit 1, and determines whether the voice (received voice) is sound or no sound. Information indicating that, that is, sound / silence information Lm is output.

The syllable transition determination unit 6 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected by the extreme value detection unit 4 and the sound / silence information Lm output from the sound detection unit 5. Then, the determination result Sy is output.

The speaking speed calculation unit 7 calculates the speaking speed Ss based on the determination result Sy of the syllable transition determination unit 6. The speaking speed Ss is represented by the number of syllables per unit time.

The conversion rate determination unit 8 determines the speech speed conversion rate Rc of the received voice based on the speech speed Ss calculated by the speech speed calculation unit 7.

The speech speed conversion unit 9 performs a speech speed conversion process on the audio signal Db based on the speech speed conversion rate Rc determined by the conversion rate determination unit 8, and outputs the converted audio signal Dc.

The operation of each part will be described in more detail below.
The audio decoding unit 1 receives the highly efficient coded audio code data Da, decodes it into a linear PCM code, and outputs an audio signal (decoded audio signal) Db representing the linear PCM code.

FIG. 2 shows a configuration example of the audio decoding unit 1 of FIG. The voice decoding unit 1 shown in FIG. 2 is described in an ITU-T (International Telecommunication Union Telecommunication Standardization Sector) recommendation G.I. It conforms to the CS-ACELP (Conjugate Structure Algebraic Code Excited Liner Edition) coding method specified in 729.

The audio decoding unit 1 shown in FIG. 2 includes an adaptive codebook vector decoding unit 101, a gain decoding unit 102, a fixed codebook vector decoding unit 103, an adaptive prefilter unit 104, a predicted gain calculation unit 105, and an excitation signal generation unit 106. It has an LSP coefficient decoding unit 107, an interpolation unit 108, an LPC coefficient conversion unit 109, a composite filter unit 110, and a post filter unit 111.

The adaptive codebook vector decoding unit 101 decodes the pitch period information of the voice from the voice code data Da of each received frame and generates the adaptive codebook vector. The adaptive codebook vector represents an excitation signal generated in the past. Considering that the audio signal has a strong periodicity, it can be said that the excitation signal generated in the past is stored and reused based on the pitch period information.

The fixed code book vector decoding unit 103 decodes the fixed code book vector from the voice code data Da of each received frame.

The adaptive prefilter unit 104 emphasizes the pitch component of the decoded fixed codebook vector.

The gain decoding unit 102 decodes the gain information from the received voice code data Da of each frame, and outputs the gain of the adaptive codebook vector and the gain of the fixed codebook vector.

The predicted gain calculation unit 105 is based on the gain of the fixed codebook vector of each frame output from the gain decoding unit 102 and the past fixed codebook vector output from the adaptive prefilter unit 104. Find the predicted gain of the vector.

The excitation signal generation unit 106 is output from the adaptive codebook vector of each frame output from the adaptive codebook vector decoding unit 101, the fixed codebook vector of each frame output from the adaptive prefilter unit 104, and the gain decoding unit 102. The excitation signal Se is generated using the gain of the adaptive codebook vector of each frame and the predicted gain of the fixed codebook vector output from the predicted gain calculation unit 105.

The LSP coefficient decoding unit 107 decodes the LSP coefficient from the voice code data Da of each received frame.
In the CS-ACELP coding method, the frame length is 10 milliseconds, and the 10th-order LSP coefficient is decoded every 10 milliseconds.

The interpolation unit 108 uses the LSP coefficient of the current frame and the LSP coefficient of the previous frame to generate an LSP coefficient at an intermediate timing between them, that is, 5 milliseconds before the current frame by interpolation.

The LPC coefficient conversion unit 109 converts the LSP coefficient of the current frame and the LSP coefficient generated by interpolation into an LPC (Linear Predictive Coding) coefficient.

The synthetic filter unit 110 is a full-pole filter having an LPC coefficient output from the LPC coefficient conversion unit 109 as a filter coefficient, and generates a synthetic voice signal Sf by inputting an excitation signal Se generated by the excitation signal generation unit 106. ..

The post filter unit 111 emphasizes the pitch component of the synthetic audio signal Sf generated by the synthetic filter unit 110 to improve the audible quality.

The post filter unit 111 is a series of a plurality of filters. The long-term post filter among the plurality of filters is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.

The gain coefficient is generated by the processing by the above long-term post filter. Specifically, the delay in which the autocorrelation of the composite signal output by the composite filter unit 110 becomes large is searched, and if the autocorrelation in the delay is small, the gain coefficient is set to 0, and if not, the delay component (pitch) is set. A coefficient (greater than 0 and less than 1) is set to emphasize.

The output of the post filter unit 111 is output as the output of the audio decoding unit 1, that is, the decoded audio signal Db.

The frequency information generation unit 2 extracts and outputs the frequency information Fa from the information of each frame generated in the decoding process in the voice decoding unit 1.
The frequency information generation unit 2 shown in FIG. 1 includes an LSP coefficient extraction unit 21.
The LSP coefficient extraction unit 21 extracts the LSP (Line Spectral Pair) coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1, and outputs it as the frequency information Fa.

As described above, in the audio decoding process by the CS-ACELP coding method, the 10th-order LSP coefficient is decoded for each frame, and the LSP coefficient extraction unit 21 in FIG. 1 has this information, that is, the LSP coefficient decoding unit. The 10th-order LSP coefficient output from 107 is extracted for each frame and output as frequency information Fa. The 10th-order LSP coefficient of each frame can be seen as constructing one 10-dimensional LSP coefficient vector.

The information change amount calculation unit 3 calculates the distance (inter-vector distance) between the LSP coefficient vector of the current frame and the LSP coefficient vector one frame before as the information change amount Vf.

For example, assuming that n (n is an integer) represents the current time (encoded frame number) and n-i (i is an integer) represents the time i frames before the current time n, the LSP coefficient vector at time n Is f1 (n), f2 (n), ..., F10 (n), and the information change amount calculation unit 3 obtains the inter-vector distance d (n) by the calculation according to the following equation (1).

d (n)
= {F1 (n) -f1 (n-1)} x {f1 (n) -f1 (n-1)}
+ {F2 (n) -f2 (n-1)} x {f2 (n) -f2 (n-1)}
:
:
+ {F10 (n) -f10 (n-1)} x {f10 (n) -f10 (n-1)}
… (1)

In the following description, n will be described as indicating the current time.

The extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf within a certain period in the latest past. The most recent past fixed period referred to here is the period of the most recent past Na frame, that is, the period from the current time n to the time n-Na + 1 before the (Na-1) frame (Na is an integer of 4 or more). ..

Hereinafter, a procedure for obtaining the maximum value Mx and the minimum value Mn when the information change amount Vf is the inter-vector distance d (n) will be described.
First, the extremum detection unit 4 determines whether or not the inter-vector distance d (n-1) at time n-1 one frame before the current time n is maximum.
Then, if the inter-vector distance d (n-1) is maximum, the minimum in the latest past Na frame is detected.

For example, when d (n) is smaller than d (n-1) and d (n-1) is larger than d (n-2), d (n-1) is determined to be maximum.
If this condition is not satisfied, it is determined that d (n-1) is not maximum.

When d (n-1) is the maximum, the minimum is continuously specified.
For example, the latest time m in which d (m) is larger than d (m-1), d (m-1) is smaller than d (m-2), and n—Na + 2 ≦ m ≦ n-1 is satisfied. Search for (largest value). When m satisfying these conditions exists, it is determined that d (m-1) is the minimum.
When m satisfying this condition does not exist, d (n—Na + 1) is set to the minimum for convenience. The minimum for this convenience corresponds to the minimum value among d (n-Na + 1), d (n-Na + 2), ..., D (n-2).

The extreme value detection unit 4 acquires the maximum value (maximum value) Mx and the minimum value (minimum value) Mn detected as described above.

The Na value is set so that Na times the frame length is equal to or greater than the maximum value of the syllable length expected in normal utterance. Normally, the syllable is several tens of milliseconds when it is short, and 200 milliseconds or less when it is long, so it is better to set Na to a value equivalent to 200 milliseconds. Specifically, since the frame length of the CS-ACELP coding method is 10 ms, it is appropriate to set Na = 200/10 = 20.

The sound detection unit 5 determines the sound / silence of the voice signal output from the voice decoding unit 1, and outputs information indicating the determination result, that is, the sound / silence information Lm.
This determination is made every few milliseconds to several tens of milliseconds, for example, every frame period or an integral multiple thereof. Hereinafter, it will be described that this determination is performed for each frame period.
The sound detection unit 5 determines whether or not there is sound based on the amplitude of the voice signal Db output from the voice decoding unit 1.

FIG. 3 shows a configuration example of the sound detection unit 5.
The illustrated sound detection unit 5 includes a low level detection unit 51, a high level detection unit 52, a disjunction calculation unit 53, a hangover addition unit 54, a noise level calculation unit 55, and a threshold value setting unit 56.

The low level detection unit 51 compares the audio signal Db with the adaptation threshold value D56, and outputs the signal D51 based on the comparison result. The adaptive threshold value D56 is supplied from the threshold value setting unit 56. The low level detection unit 51 includes a comparison unit 511 and a determination unit 513.

The comparison unit 511 compares the audio signal Db with the adaptation threshold value D56, and outputs a signal D511 indicating the comparison result. The signal D511 is High if the audio signal Db is greater than the threshold D56 (that is, if the absolute value of the sample value of the audio signal Db is greater than the threshold D56), otherwise it is Low. The comparison is made for each sample cycle.

The determination unit 513 outputs the signal D51 based on the signal D511. The signal D51 becomes High when the signal D511 continues in the High state for a certain period of time or longer, and becomes Low immediately when the signal D511 becomes Low.

The high level detection unit 52 compares the audio signal Db with a predetermined threshold value D50, and outputs signals D52 and D521 based on the comparison result.
The threshold D50 is set to a value higher than the maximum background noise level normally expected. The high level detection unit 52 includes a comparison unit 521 and a determination unit 523.

The comparison unit 521 compares the audio signal Db with the threshold value D50, and outputs a signal D521 indicating the comparison result. The signal D521 is High if the audio signal Db is larger than the threshold value D50, and Low otherwise. The comparison is made for each sample cycle.

The determination unit 523 outputs the signal D52 based on the signal D521. The signal D52 becomes High when the signal D521 continues to be in the High state for a certain period of time or longer, and becomes Low immediately when the signal D521 becomes Low.

The OR unit 53 performs an operation to obtain the OR of the signal D51 and the signal D52. The output signal D53 of the OR unit 53 is High if at least one of the signals D51 or D52 is High, and Low otherwise.

The hangover addition unit 54 performs a hangover addition process on the output signal D53 of the OR unit 53, and outputs the signal obtained as a result as sound / silence information Lm.
The hangover process immediately changes from Low to High when the input signal (D53) changes from Low to High, and changes from High to Low after a certain delay time when the input signal (D53) changes from High to Low. This is a process for outputting a signal (Lm).

The noise level calculation unit 55 calculates the added average value D55a of the absolute value of the sample value of the audio signal Db at regular intervals, and obtains the noise level value D55 based on the calculated average value D55a. For example, the calculated average value D55a is updated with the moving average for each relatively long period as the noise level value D55. However, the average value D55a during the period when the signal D521 is High is not used for calculating the moving average, and the moving average calculated before that is maintained during that period.

The threshold value setting unit 56 adjusts the adaptation threshold value D56 according to the background noise level value D55 output from the noise level calculation unit 55. The adaptation threshold D56 is adjusted to a value slightly larger than the calculated noise level value D55. The adaptive threshold value D56 is changed following a change in the calculated noise level value D55.

The operation of the sound detection unit 5 will be described below with reference to FIGS. 4A to 4H.
In the illustrated example, it is assumed that the threshold value D50 is set to the value shown in FIG. 4A, and the audio signal Db changes as shown in FIG. 4A.

During the period when the audio signal Db is larger than the threshold value D50, the output signal D521 of the comparison unit 521 becomes High as shown in FIG. 4C, and the output signal D52 of the determination unit 523 becomes as shown in FIG. 4D. , It becomes High with a slight delay.

The average value D55a and the noise level value D55 calculated by the noise level calculation unit 55 change as shown in FIG. 4B, and the adaptation threshold value D56 calculated by the threshold value setting unit 56 is shown in FIG. 4A. It changes like.
The noise level value D55 shown in FIG. 4B and the adaptive threshold value D56 shown in FIG. 4A change according to the average value D55a, but the period during which the audio signal Db is larger than the threshold value D50 (signal D521 is High). (For a certain period of time) does not change and is maintained at the value immediately before it.

When the audio signal Db becomes larger than the threshold value D56 (time ta), the output signal D511 of the comparison unit 511 becomes High as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 becomes High as shown in FIG. 4 (f). As shown, it becomes High with a slight delay.

When the audio signal Db becomes the threshold value D56 or less (time tb), the output signal D511 of the comparison unit 511 becomes Low as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 is also shown in FIG. 4 (f). It becomes Low like.

After that, when the audio signal Db becomes larger than the threshold value D56 (time tc), the output signal D511 of the comparison unit 511 becomes High as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 becomes FIG. 4 (f). ), It becomes High with a slight delay.

As shown in FIG. 4 (g), the output signal D53 of the OR unit 53 rises with the rising edge of the signal D51 and falls with the falling edge of the signal D51.

As shown in FIG. 4H, the output signal Lm of the hangover addition unit 54 rises with the rise of the signal D53 and falls with a slight delay from the fall of the signal D53.

The signal Lm indicates that there is sound when it is High, and that it is silent when it is Low.

As described above, the sound detection unit 5 generates an adaptive threshold value D56 that changes according to the noise level value, and if the audio signal Db is larger than the adaptive threshold value D56 or the audio signal Db is larger than the threshold value D50. By determining that there is sound and determining that there is no sound if none of them is present, it is possible to appropriately determine whether there is sound or no sound.

The syllable transition determination unit 6 determines the presence or absence of a syllable transition only when the maximum value Mx and the minimum value Mn are detected by the extreme value detection unit 4 and sound is detected by the sound detection unit 5. The judgment result Sy is output.
In the syllable transition determination unit 6, the maximum value Mx is larger than the predetermined threshold value (first threshold value) Th1, and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value (second threshold value) Th2. In some cases, it is determined that there is a syllable transition, and if not, it is determined that there is no syllable transition.

The speech speed calculation unit 7 calculates the speech speed Ss at regular calculation cycles based on the determination result Sy of the syllable transition determination unit 6 and the sound / silence information Lm from the sound detection unit 5.
The constant calculation period is, for example, 1 second.
To calculate the speaking speed, the number of syllables per unit time is calculated from the number of syllables during a certain period that was the most recent sounded sound at each time point and the length of the certain period, and output as the speaking speed Ss. do. Here, the "constant period" is an integral multiple of the frame period, and is, for example, about 3 seconds to 10 seconds.

"A certain period of time that was sounded in the latest past" means that only the time determined to be sounded from the current time is traced back to a certain period of time, and therefore, there is no sound in the latest past. Period is excluded.
For example, if a certain period is 3 seconds, if the sounded time and the silent time are half and half, it goes back to the past 6 seconds, and if the number of syllables in the 6 seconds is 30, the speaking speed is 30/3 = 10 syllables / It will be seconds.

In order to perform averaging only at the time determined to be sound in this way, if the time when the speaker is silent (not speaking) is included in the averaging time, a speech speed different from the actual one is required. This is because it will be lost.

After calculating the speaking speed once, the calculated speaking speed may be used as it is as long as the silence continues.

The conversion rate determination unit 8 determines the speech speed conversion rate Rc based on the speech speed Ss calculated by the speech speed calculation unit 7.
The speaking speed conversion rate Rc is also determined in the same cycle as the calculation of the speaking speed Ss. In other words, the conversion rate is calculated each time the speaking speed is calculated.
For example, when the target speech speed after the speech speed conversion is St syllable / sec and the speech speed calculated by the speech speed calculation unit 7 is Sr syllable / sec, St / Sr is the speech speed conversion rate Rc of the voice in the sound state. And.
However, since it is necessary to reduce the speaking speed in order to make the voice easier to hear, the speaking speed conversion rate Rc is set to 1 when St / Sr is larger than 1.

The speech speed conversion unit 9 converts the speech speed according to the speech speed conversion rate Rc from the conversion rate determination unit 8, and outputs the converted audio signal Dc.
In the conversion, the conversion rate determined by the conversion rate determination unit 8 for each calculation cycle (for example, every second) is applied until the next calculation cycle or until a new conversion rate is determined next.

The speech speed conversion can be realized by using, for example, a well-known algorithm such as PICOLA (Pointer Interval Control Overlap and Add) or TDHS (Time Domain Harmonic Scaling).

Here, as described above, the speech speed conversion rate Rc input from the conversion rate determination unit 8 is 1 or less, but if the speech speed is always slowed down, the processing delay of the speech speed conversion continues to increase, and the real time of voice communication is realized. I can't maintain my sex.
Therefore, the sound / silence information Lm by the sound detection unit 5 is input to the speech speed conversion unit 9, and the speaking speed is increased for the portion determined to be silent, or the portion determined to be silent is deleted. , Make sure that the audio signal is not delayed beyond a certain level.
Further, when the delay exceeds a certain value, the voice may be output without converting the speech speed even if there is sound.

As described above, in the speech speed conversion device of FIG. 1, the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals. The amount of change in the voice signal is obtained, the syllable transition of the voice signal is detected based on this amount of change, and the speech speed is converted based on the detection result.

The above-mentioned speech speed conversion device can be modified in various ways.
For example, in the above example, the information change amount calculation unit 3 regards the 10th-order LSP coefficient as one 10-dimensional vector, and obtains the inter-vector distance d (n). However, it is not always necessary to calculate the amount of change using all of the 10th-order LSP coefficients.
For example, among the LSP coefficients f1 (n), f2 (n), ..., F10 (n) at time n, a vector consisting of only the LSP coefficients f1 (n), f2 (n), f3 (n) is used. , The inter-vector distance d (n) may be obtained by the following equation (2).
d (n)
= {F1 (n) -f1 (n-1)} x {f1 (n) -f1 (n-1)}
+ {F2 (n) -f2 (n-1)} x {f2 (n) -f2 (n-1)}
+ {F3 (n) -f3 (n-1)} x {f3 (n) -f3 (n-1)}
… (2)

Of the LSP coefficients, the low-order coefficient corresponds to the low-frequency component and the high-order coefficient corresponds to the high-frequency component. Therefore, by doing the above, only the change in the low-frequency component was focused on. The amount of change can be calculated.
Since the change in vocal tract frequency characteristics during utterance is larger on the low frequency side, this method is also advantageous in that it is possible to detect syllable transitions and the amount of calculation is smaller.
Further, there is an advantage that it is easy to exclude the influence of background noise superimposed on the audio signal.

Further, in the above example, the information change amount calculation unit 3 obtains the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it, but obtains the distance from the LSP coefficient vector one frame before. You may do so.
For example, when the distance from the LSP coefficient vector two frames before is obtained, the inter-vector distance d (n) is obtained by the calculation of the following equation (3).
d (n)
= {F1 (n) -f1 (n-2)} x {f1 (n) -f1 (n-2)}
+ {F2 (n) -f2 (n-2)} x {f2 (n) -f2 (n-2)}
:
:
+ {F10 (n) -f10 (n-2)} x {f10 (n) -f10 (n-2)}
… (3)

When detecting the time change amount of the vocal tract frequency characteristic of voice by the inter-vector distance d (n), if the time difference between the two vectors for calculating the distance is too long, it becomes difficult to detect the time change amount.
However, as described above, the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and even if the distance from the vector before a plurality of frames is used. No problem arises. Specifically, the time difference between the two vectors for which the distance is calculated may be set to be shorter than the period of the syllable transition in the utterance.

As described above, according to the first embodiment, the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals. The amount of change was obtained, and the syllable transition of the voice signal was detected based on this amount of change. Therefore, it is not necessary to perform spectral analysis of the decoded voice signal by linear prediction method (LPC) or fast Fourier transform (FFT) based on the omnipolar model, and further obtain a spectral feature amount vector. Therefore, the speech speed conversion can be performed with a small amount of calculation.

Further, the LSP coefficient used for detecting the syllable transition is calculated in the process of encoding the voice, and since this is the information transmitted to the voice decoding unit 1, the spectrum analysis is performed using the voice signal after the voice decoding. Compared with the frequency characteristics obtained by doing this, it is closer to the voice path frequency characteristics of the voice represented by the voice signal before being encoded. Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data.

Embodiment 2.
In the first embodiment, the LSP coefficient was extracted as the frequency information Fa, and the syllable transition was detected based on the time change amount. It is also possible to extract the LPC coefficient instead of the LSP coefficient, calculate the LPC mel cepstrum or LPC cepstrum from the extracted LPC coefficient, and detect the syllable transition using the calculated LPC mel kepstram or LPC cepstrum.

FIG. 5 shows the configuration of the speech speed conversion device according to the second embodiment. The speech speed conversion device shown in FIG. 5 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2 and the information change amount calculation unit 3, the frequency information generation unit 2b and the information change amount calculation unit 3b are provided. The frequency information generation unit 2b includes an LPC coefficient extraction unit 22 and a mer cepstrum calculation unit 23.

The LPC coefficient extraction unit 22 extracts the LPC coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1.
For example, when the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method, a part of the output of the LPC coefficient conversion unit 109 in FIG. 2 is extracted.

Specifically, the LPC coefficient conversion unit 109 of the audio decoding unit 1 in FIG. 2 converts the LSP coefficient of each frame and the LSP coefficient generated by the interpolation by the interpolation unit 108 into an LPC coefficient and outputs the LPC coefficient. The LPC coefficient extraction unit 22 extracts the LPC coefficient generated by converting the LSP coefficient of each frame. For example, the 10th-order LPC coefficient is extracted.

The mel cepstrum calculation unit 23 converts the LPC coefficient extracted by the LPC coefficient extraction unit 22 into an LPC mel cepstrum.
In the voice analysis synthesis process, a 10th to 25th order LPC mel cepstrum is generally used, but it is meaningless to make it much larger than the 10th order which is the order of the original LPC coefficient. Therefore, the order of the LPC cepstrum generated by the mel cepstrum calculation unit 23 is appropriately about 10 to 15. Hereinafter, the order of the LPC mel cepstrum will be described as 10.

It can be seen that the 10th-order LPC mel cepstrum of each frame generated by the mel cepstrum calculation unit 23 constitutes a 10-dimensional LPC mel cepstrum vector.

The information change amount calculation unit 3b of FIG. 5 is the same as the information change amount calculation unit 3 of FIG. 1, but is based on the output of the frequency information generation unit 2b instead of the output of the frequency information generation unit 2 of FIG. The amount of information change Vf is calculated.

That is, the information change amount calculation unit 3b uses the distance between the LPC mel cepstrum vector of the current frame output from the mel cepstrum calculation unit 23 of the frequency information generation unit 2b and the LPC mel cepstrum vector one frame before as the information change amount Vf. calculate.

The operation of the information change amount calculation unit 3b is basically the same as that of the information change amount calculation unit 3 of the first embodiment, except that the input is not the LSP coefficient vector but the LPC mel cepstrum vector.

Except for the above, the operation of the speech speed conversion device of the second embodiment is the same as the operation of the speech speed conversion device of the first embodiment.

In the above description, the information change amount calculation unit 3b considers the 10th-order LPC mel cepstrum as one 10-dimensional vector and obtains the inter-vector distance d (n). It is not always necessary to calculate the amount of change using all of the 10th-order LPC mel cepstrums.
Also, instead of finding the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it, the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector several frames before it is calculated. Is also good.

Further, in the above description, the LPC cepstrum is calculated, the amount of change in the LPC cepstrum at regular time intervals is obtained, and the syllable transition is detected based on the amount of change, but the LPC cepstrum is calculated instead of the LPC cepstrum. Then, the amount of change in the LPC cepstrum at regular time intervals may be obtained, and the syllable transition may be detected based on the amount of change. As the amount of change of the LPC cepstrum at regular intervals, for example, the distance between the LPC cepstrum vectors composed of the LPC cepstrum can be used.

As described above, according to the second embodiment, the frequency information generation unit 2b extracts the LPC coefficient obtained in the voice decoding process in the voice decoding unit 1, calculates the LPC mel cepstrum or the LPC cepstrum from the LPC coefficient, and then calculates the LPC mel cepstrum or the LPC cepstrum. The information change amount calculation unit 3b obtains the change amount of the LPC mel cepstrum or the LPC cepstrum at regular time intervals, and detects the syllable transition of the voice signal based on this change amount. Therefore, it is not necessary to perform a spectral analysis of the decoded audio signal by a linear prediction method (LPC) based on a omnipolar model. Therefore, the speech speed conversion can be performed with a small amount of calculation.

Further, the LPC mel cepstrum or LPC cepstrum used for detecting the syllable transition is calculated based on the LSP coefficient calculated in the process of coding the voice, and this LSP coefficient has been transmitted to the voice decoding unit 1. Because it is information, it is closer to the vocal tract frequency characteristics of the voice represented by the voice signal before encoding than the frequency characteristics obtained by performing spectral analysis using the voice signal after voice decoding. ..
Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data generated by the high efficiency coding.

Further, LPC cepstrum or LPC cepstrum is generally used in speech recognition, and determines syllable transition with higher accuracy than determining syllable transition using the LSP coefficient shown in the first embodiment. Can be done.

Embodiment 3.
In the first embodiment, the LSP coefficient was extracted as the frequency information of the voice, and the syllable transition was detected based on the time change amount. After thinning out the extracted LSP coefficient, the syllable transition may be detected based on the amount of time change.

FIG. 6 shows the configuration of the speech speed conversion device according to the third embodiment. The speech speed conversion device shown in FIG. 6 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2, the information change amount calculation unit 3, and the extreme value detection unit 4, the frequency information generation unit 2c, the information change amount calculation unit 3c, and the extreme value detection unit 4c are provided. The frequency information generation unit 2c includes an LSP coefficient extraction unit 21 and a thinning unit 24.

The LSP coefficient extraction unit 21 is the same as the LSP coefficient extraction unit 21 of the first embodiment, and operates in the same manner.

The thinning unit 24 thins out the LSP coefficient (frequency information) extracted by the LSP coefficient extracting unit 21 at a thinning rate M. In this thinning, the LSP coefficient output from the LSP coefficient extraction unit 21 for each frame is extracted only once in the M frame (M is an integer of 2 or more).

For example, as in the first embodiment, when one frame length is 10 milliseconds, the voice decoding unit 1 decodes the LSP coefficient every 10 milliseconds, and the LSP coefficient extraction unit 21 decodes the LSP coefficient every 10 milliseconds. Extract to.

The thinning unit 24 extracts the LSP coefficient extracted by the LSP coefficient extraction unit 21 every M frame, and therefore every 10 × M milliseconds.

The operation of the information change amount calculation unit 3c is basically the same as that of the information change amount calculation unit 3 of the first embodiment, but the processing is performed not every frame but every M frame.
The information change amount calculation unit 3c also calculates the distance between the LSP coefficient vectors output one after the other from the thinning unit 24. As a result, the information change amount calculation unit 3c obtains the distance between the latest (current frame) LSP coefficient vector and the LSP coefficient vector M frame before that.

As described above, the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and no problem occurs unless M becomes excessively large. Specifically, the value of M may be set so that 10 × M milliseconds is shorter than the period of syllable transition in utterance.

The operation of the extreme value detecting unit 4c is basically the same as that of the extreme value detecting unit 4 of the first embodiment.
However, similarly to the information change amount calculation unit 3c, the processing is performed not for each frame but for each M frame.

The extreme value detection unit 4c detects the maximum value Mx and the minimum value Mn of the information change amount Vf of the latest past Nb frame input to the extreme value detection unit 4c. Nb is smaller than Na in the description of the extremum detection unit 4 of the first embodiment. This is because the information change amount Vf is input to the extreme value detection unit 4c of the third embodiment not every frame but every M frame. For example, Nb is set to a value equal to Na / M. When Nb is a value equivalent to 200 milliseconds, Nb = Na / M = 20 / M.

Further, in performing the maximum detection, the extreme value detection unit 4c determines whether or not the information change amount Vf at the time (n-1) one frame before the current time n is the maximum, instead of determining whether or not the current frame is the maximum. It is determined whether or not the amount of information change Vf at the time (nm) before the M frame is maximum.

For example, when d (n) is smaller than d (n-M) and d (n-M) is larger than d (n-2M), d (n-M) is determined to be maximum.
If this condition is not satisfied, it is determined that d (nm) is not the maximum.

Further, in specifying the minimum, for example, d (m) is larger than d (m-M), d (m-M) is smaller than d (m-2M), and n−Nb + 2M ≦ m ≦ n−. Search for the latest time m (largest value) that satisfies M. When m satisfying these conditions exists, it is determined that d (m-M) is the minimum.
When m satisfying this condition does not exist, d (n−Nb + M) is set to the minimum for convenience. The minimum for this convenience corresponds to the minimum value among d (n-Nb + M), d (n-Nb + 2M), ..., D (n-2M).

Except for the above, the operation of the speech speed conversion device of the third embodiment is the same as the operation of the speech speed conversion device of the first embodiment.

The same effect as that of the first embodiment can be obtained in the third embodiment.
Further, since the thinning unit 24 extracts the LSP coefficient only once in the M frame, the operation frequency of the information change amount calculation unit 3c and the extreme value detection unit 4c is reduced, and the calculation amount is smaller than that of the first embodiment. It is possible to further reduce the number.

Embodiment 4.
The same thinning as in the third embodiment may be performed on the speech speed conversion device that detects the syllable transition by obtaining the amount of change in the LPC mel cepstrum shown in the second embodiment.

FIG. 7 shows the configuration of the speech speed conversion device according to the fourth embodiment. The speech speed conversion device shown in FIG. 7 is generally the same as the speech speed conversion device of the second embodiment described with reference to FIG. 5, but differs in the following points. That is, instead of the frequency information generation unit 2b, the information change amount calculation unit 3b, and the extreme value detection unit 4, the frequency information generation unit 2d, the information change amount calculation unit 3d, and the extreme value detection unit 4c are provided. The frequency information generation unit 2d includes an LPC coefficient extraction unit 22, a thinning unit 24d, and a mer cepstrum calculation unit 23d.

The LPC coefficient extraction unit 22 extracts the LPC coefficient from the information generated by the decoding operation of the voice decoding unit 1 in the same manner as the LPC coefficient extraction unit 22 of the second embodiment shown in FIG.

The thinning section 24d is the same as the thinning section 24 in FIG. However, the output of the LPC coefficient extraction unit 22 is thinned out instead of the output of the LSP coefficient extraction unit 21. In this thinning, the LPC coefficient output from the LPC coefficient extraction unit 22 for each frame is extracted only once in the M frame (M is an integer of 2 or more).

For example, as in the first embodiment, when one frame length is 10 milliseconds, the audio decoding unit 1 decodes the LSP coefficient every 10 milliseconds, and the LPC coefficient extraction unit 22 decodes the LPC coefficient every 10 milliseconds. Extract to.

The thinning unit 24d extracts the LPC coefficient extracted by the LPC coefficient extraction unit 22 every M frame, and therefore every 10 × M milliseconds.

The mel cepstrum calculation unit 23d converts the LPC coefficient output from the thinning unit 24d into an LPC mel cepstrum. The operation of the mel cepstrum calculation unit 23d is basically the same as the operation of the mel cepstrum calculation unit 23 of FIG. 6, but the processing is performed not every frame but every M frame.

The information change amount calculation unit 3d of FIG. 7 is the same as the information change amount calculation unit 3b of FIG. 5, but is based on the output of the mel cepstrum calculation unit 23d instead of the output of the mer cepstrum calculation unit 23 of FIG. The amount of information change Vf is calculated.

That is, the information change amount calculation unit 3d considers the 10th-order LPC mel cepstrum output from the mel cepstrum calculation unit 23d of the frequency information generation unit 2d as one 10-dimensional vector, and obtains the inter-vector distance d (n).

The operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3c of FIG. 6, except that the input is not the LSP coefficient but the LPC mel cepstrum.
Further, the operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3b in FIG. 5, except that the processing is performed not for each frame but for each M frame.

The extremum detection unit 4c of the fourth embodiment is the same as the extremum detection unit 4c of FIG. 6, and operates in the same manner.

Except for the above, the operation of the speech speed conversion device of the fourth embodiment is the same as the operation of the speech speed conversion device of the second embodiment.

The same effect as that of the second embodiment can be obtained in the fourth embodiment.
Further, since the thinning unit 24 extracts the LPC coefficient only once in the M frame, the operation frequency of the mer cepstrum calculation unit 23d, the information change amount calculation unit 3d, and the extreme value detection unit 4c is reduced, and the implementation is carried out. The amount of calculation can be reduced as compared with the second form.

Embodiment 5.
In the syllable transitions of the first to fourth embodiments, a function of detecting whether the voice is a voiced sound or an unvoiced sound is added, and the syllable transition can be detected by using the voiced / unvoiced information together.
The fifth embodiment is a modification of the first embodiment in which voiced / unvoiced information is used in combination.

FIG. 8 shows the configuration of the speech speed conversion device according to the fifth embodiment. The speech speed conversion device shown in FIG. 8 is generally the same as the speech speed conversion device of FIG. 1, but differs in the following points. That is, a voiced information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6. The configuration of the frequency information generation unit 2 in FIG. 8 is the same as that in FIG. 1, and the illustration thereof is omitted.

The voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing voice decoding in the voice decoding unit 1.
As described above, when the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method, the configuration is as shown in FIG. 2, for example, and the voiced / unvoiced information Vc from the post filter unit 111 in FIG. Is obtained.

As described above, the long-term post filter in the post filter unit 111 is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.

If the voice is a voiced sound, the sound quality can be improved by emphasizing the pitch component, but if the voice is not a voiced sound, the sound quality deteriorates. Therefore, the gain coefficient is set to zero and the pitch component is not emphasized. Therefore, the gain coefficient of this long-term post filter can be used as voiced / unvoiced information Vc. That is, it can be said that the decoded voice when this coefficient is not 0 is voiced, and the decoded voice when this coefficient is 0 is unvoiced.
The voiced information extraction unit 10 extracts and outputs the above gain coefficient as voiced / unvoiced information Vc.

Similar to the syllable transition determination unit 6 of FIG. 1, the syllable transition determination unit 6e detected the maximum value Mx and the minimum value Mn in the extreme value detection unit 4 and detected as sound in the sound detection unit 5. The presence or absence of syllable transition is determined only in this case.
The syllable transition determination unit 6e determines that there is a syllable transition when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2.
The syllable transition determination unit 6e also determines that there is a syllable transition when the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from one indicating "unvoiced" to one indicating "voiced".
If none of these applies, the syllable transition determination unit 6e determines that there is no syllable transition.

When the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from the one indicating "unvoiced" to the one indicating "voiced", the syllable transition determination unit 6e determines that there is a syllable transition. , When the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from the one indicating "voiced" to the one indicating "unvoiced", it may be determined that there is a syllable transition.
In short, the syllable transition determination unit 6e determines that there is a voice transition based on the maximum value Mx and the minimum value Mn, and also detects that the state has changed from the unvoiced state to the voiced state based on the voiced / unvoiced information Vc. It may be determined that there is a syllable transition when it is performed, or it may be determined that there is a syllable transition when it is detected that the state has changed from a voiced state to an unvoiced state.

The determination of the presence or absence of the syllable transition based on the voiced / unvoiced information Vc in the syllable transition determination unit 6e may be performed independently of the detection of the maximum value Mx and the minimum value Mn in the extreme value detection unit 4.
That is, even if the maximum value Mx and the minimum value Mn are not detected by the extreme value detection unit 4, it may be determined that the syllable has transitioned based on the voiced / unvoiced information Vc.

The detection of the syllable transition by the maximum value Mx and the minimum value Mn in the extreme value detection unit 4, that is, the detection of the syllable transition due to the time change of the voice tract frequency characteristic may miss the syllable transition when the pronunciation is not clear.
This oversight can be compensated for by the combined use of changes in the voiced / unvoiced information Vc output from the voiced information extraction unit 10.

The same effect as that of the first embodiment can be obtained in the fifth embodiment.
Further, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the first embodiment.

Embodiment 6.
The detection of the syllable transition using the LPC mel cepstrum or the LPC cepstrum shown in the second embodiment and the detection of the syllable transition using the voiced / unvoiced information Vc shown in the fifth embodiment can also be used in combination.

The speech speed conversion device shown in FIG. 9 is generally the same as the speech speed conversion device of FIG. 5, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided. The configuration of the frequency information generation unit 2b in FIG. 9 is the same as that in FIG. 5, and the illustration thereof is omitted.

The voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.

The same effect as that of the second embodiment can be obtained in the sixth embodiment.
Further, the additional effect described with respect to the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the second embodiment.

Embodiment 7.
The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the third embodiment described with reference to FIG.

The speech speed conversion device of FIG. 10 is generally the same as the speech speed conversion device of FIG. 6, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6. It is different. The configuration of the frequency information generation unit 2c in FIG. 10 is the same as that in FIG. 6, and the illustration thereof is omitted.

The same effect as that of the third embodiment can be obtained in the seventh embodiment.
Further, the same additional effect as in the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the third embodiment.

Embodiment 8.
The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the fourth embodiment described with reference to FIG. 7.

The speech speed conversion device shown in FIG. 11 is generally the same as the speech speed conversion device of FIG. 7, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided. The configuration of the frequency information generation unit 2d in FIG. 11 is the same as that in FIG. 7, and the illustration thereof is omitted.

The same effect as that of the fourth embodiment can be obtained in the eighth embodiment.
Further, the same additional effect as in the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the fourth embodiment.

Each of the speech speed conversion devices of the first to eighth embodiments may be composed of a part or all of the processing circuit.
For example, the functions of each part of the speech speed converter may be realized by separate processing circuits, or the functions of a plurality of parts may be collectively realized by one processing circuit.
The processing circuit may be composed of hardware or software, that is, a programmed computer.
Of the functions of each part of the speech speed converter, a part may be realized by hardware and the other part may be realized by software.

FIG. 12 shows the hardware configuration of the computer 90 that realizes all the functions of the speech speed converter.
In the illustrated example, the computer 90 has a processor 91 and a memory 92.
The memory 92 stores a program for realizing the functions of each part of the speech speed conversion device.

The processor 91 uses, for example, a CPU (Central Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.

The memory 92 includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Lead Only Memory), an EEPROM (Electrically Memory Memory, etc.) Alternatively, a photomagnetic disk or the like is used.

The processor 91 and the memory 92 may be realized by an LSI (Large Scale Integration) integrated with each other.

The processor 91 realizes the function of the speech speed converter by executing the program stored in the memory 92.
The program may be provided over a network or may be recorded and provided on a recording medium, such as a non-temporary recording medium. That is, the program may be provided, for example, as a program product.

The computer of FIG. 12 includes a single processor, but may include two or more processors.

The processing procedure by the processor 91 when the speech speed conversion device of FIG. 1 is configured by the computer of FIG. 12 will be described with reference to FIG.

The process of FIG. 13 is started every time one frame of voice code data is received.
Therefore, when the speech speed conversion device processes the voice code data encoded with high efficiency by the CS-ACELP coding method, the process shown in FIG. 13 is started every 10 milliseconds.

In step ST1, the processor 91 performs voice decoding of the received voice code data and outputs the decoded voice signal. The process of step ST1 has the same contents as the process in the voice decoding unit 1 of FIG. For example, the same processing as the operation of the voice decoding unit described with reference to FIG. 2 is performed.

In step ST2, the processor 91 determines whether the decoded audio signal is audible or silent. The process of step ST2 has the same content as the process of the sound detection unit 5 of FIG.
The processing in the sound detection unit 5 is performed every period shorter than the frame period, for example, comparison with the threshold value D56 or D50, calculation of the average value D55a, update of the noise level value D55, and update of the threshold value D56. However, these shall be performed separately.

In step ST3, the processor 91 generates frequency information Fa at regular time intervals based on the information generated in the voice decoding process in step ST1. Specifically, the LSP coefficient generated by the voice decoding process is extracted. For example, the 10th-order LSP coefficient is extracted for each frame. The processing in step ST3 has the same contents as the processing in the frequency information generation unit 2 of FIG.

In step ST4, the processor 91 calculates the amount of time change of the frequency information Fa generated in step ST3 at regular time intervals. Specifically, the 10th-order LSP coefficient is regarded as one vector, and the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it is calculated as the amount of frequency change. The process of step ST4 has the same contents as the process in the information change amount calculation unit 3 of FIG.

In step ST5, the processor 91 refers to the determination result in step ST2, and if it is not determined to be sound (No in ST5), the processor 91 proceeds to step ST12 and determines that it is sound. If it is done (yes in ST5), the process proceeds to step ST6.

In step ST6, the processor 91 detects the maximum value Mx and the minimum value Mn for the information change amount Vf calculated in step ST4.
For that purpose, first, it is determined whether or not there is a maximum and a minimum during the most recent past Na frame period. Specifically, it is determined whether or not the amount of information change Vf for the time n-1 one frame before the current time n is the maximum, and if it is the maximum, the value (maximum value) Mx is acquired. , The minimum value in the information change amount Vf for the latest past Na frame is specified, and the value (minimum value) Mn is acquired.
The process in step ST6 has the same content as the process in the extreme value detection unit 4 of FIG.

In step ST7, the processor 91 determines whether or not the maximum and minimum are detected in step ST6.
If the maximum and minimum are not detected, the process proceeds to step ST12.
If the maximum and the minimum are detected, the process proceeds to step ST8.

In step ST8, the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6.
For example, if the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, it is determined that there is a syllable transition, and if not, there is no syllable transition. judge.
The process of step ST8 has the same content as the process of the syllable transition determination unit 6 of FIG.

In step ST9, the processor 91 determines whether or not it is the timing for calculating the speaking speed. For example, when the speaking speed is calculated every fixed calculation cycle, in step ST9, it is determined whether or not a time corresponding to a certain calculation cycle has elapsed from the previous calculation. If it is not the calculation timing, the process proceeds to step ST12. If it is the calculation timing, the process proceeds to step ST10.

Next, in step ST10, the processor 91 calculates the speaking speed Ss.
For example, the number of syllables per unit time is obtained from the number of syllables in a certain period that was the most recent sound at each time point and the length of the certain period, and this is defined as the speaking speed Ss.
Here, as described above, the time when the voice speaker is silent is not included in the averaging time. Therefore, averaging is performed only for the time determined to be sound in step ST2.
The process in step ST10 has the same content as the process in the speech speed calculation unit 7 of FIG.

Next, in step ST11, the processor 91 determines the speaking speed conversion rate Rc. The speaking speed conversion rate is determined by obtaining the ratio St / Sr of the target speaking speed St and the speaking speed Sr calculated in step ST10. The process in step ST11 has the same content as the process in the conversion rate determination unit 8 of FIG.

In step ST12, the processor 91 converts the audio signal decoded in step ST1 into speech speed using the speech speed conversion rate obtained in step ST11. The processing in step ST12 has the same contents as the processing in the speech speed conversion unit 9 of FIG.

As described above, if No in step ST5, step ST7 or step ST9, the process proceeds to step ST12 without going through step ST11 and the like. In these cases, no new speech speed conversion rate is calculated.
In this case, the speaking speed conversion is performed based on the latest speaking speed conversion rate calculated in the past in step ST12.
Further, as described above, by increasing the speaking speed of the audio signal determined to be silent, or deleting a part or all of the audio signal, it is possible to prevent the processing delay of the speaking speed conversion from continuing to increase.
If the delay exceeds a certain value, the voice is output without converting the speech speed even if there is sound.

Next, when the speech speed conversion device of FIG. 5 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 14 is generally the same as the processing procedure of FIG. 13, except that steps ST3 and ST4 are replaced by steps ST13 and ST4b, and the processing of step ST14 is performed after step ST13.

In step ST13, the processor 91 extracts the LPC coefficient generated by the voice decoding process in step ST1.
When voice decoding is performed by the CS-ACELP coding method in step ST1, there are an LPC coefficient converted from the LSP coefficient of each frame and an LPC coefficient converted from the LSP coefficient generated by interpolation. The converted LPC coefficient is extracted from the LSP coefficient.
The process of step ST13 has the same contents as the process in the LPC coefficient extraction unit 22 of FIG.

In step ST14, the processor 91 converts the LPC coefficient extracted in step ST13 into an LPC mel cepstrum. For example, a 10th order LPC mel cepstrum is generated by the conversion. The LPC mel cepstrum is used as frequency information Fa.
The process of step ST14 has the same contents as the process in the mer cepstrum calculation unit 23 of FIG.

The process of step ST4b in FIG. 14 is the same as the process of step ST4 of FIG.
However, the frequency information Fa used for calculating the information change amount Vf is different.
That is, in step ST4b, the processor 91 regards the 10th-order LPC mel cepstrum as one vector, and calculates the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it as the amount of information change Vf. do.
The process of step ST4b has the same content as the process in the information change amount calculation unit 3b of FIG.

Even when the speech speed conversion device of FIG. 5 is configured by the computer of FIG. 12, the same effect as described in the second embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 6 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 15 is generally the same as the processing procedure of FIG. 13, but the processing of step ST15 is performed after step ST3, the processing of step ST16 is performed before step ST4, and step ST6 is performed. It differs in that it has been replaced in step ST6c.

In step ST15, the processor 91 determines whether or not it is the extraction timing in the thinning process. For example, it is assumed that the thinning rate is M and the extraction is performed only once in the M frame. In that case, it is determined whether or not M frames have passed since the previous extraction. At the time of the first extraction after the start of the operation of the processor 91, it is determined that the extraction timing is set even if M frames have not passed since the previous extraction.

If it is not the extraction timing in step ST15 (if it is No in ST15), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST16.
In step ST16, the processor 91 extracts the frequency information generated in step ST3. The processes of steps ST15 and ST16 have the same contents as the processes in the thinning section 24 of FIG.
After step ST16, the process proceeds to step ST4.

The processing of steps ST4 and ST5 of FIG. 15 is the same as the processing of steps ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.

Further, in step ST6c of FIG. 15, the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
Nb is smaller than Na in the description of step ST6 in FIG. This is because in step ST4 of FIG. 15, the amount of information change Vf is calculated not for each frame but for each M frame. For example, Nb is set to a value equal to Na / M.
The process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.

Even when the speech speed conversion device of FIG. 6 is configured by the computer of FIG. 12, the same effect as described in the third embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 7 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 16 is generally the same as the processing procedure of FIG. 14, but the processing of step ST15 is performed after step ST3, the processing of step ST17 is performed before step ST14, and step ST6 is performed. It differs in that it has been replaced by step ST6c.

In step ST15, the processor 91 determines whether or not it is the timing of extraction in the thinning process.
If it is not the extraction timing (if ST15 is No), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST17.
In step ST17, the processor 91 extracts the LPC coefficient extracted in step ST13. The processes of steps ST15 and ST17 have the same contents as the processes in the thinning section 24d of FIG.
After step ST17, the process proceeds to step ST14.

The processing of steps ST14, ST4 and ST5 of FIG. 16 is the same as the processing of steps ST14, ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.

Further, in step ST6c of FIG. 16, the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
Nb is similar to Nb in the description for step ST6c in FIG. 15 and smaller than Na in the description for step ST6 in FIG.
The process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.

Even when the speech speed conversion device of FIG. 7 is configured by the computer of FIG. 12, the same effect as described in the fourth embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 8 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 17 is generally the same as the processing procedure shown in FIG. 13, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.

In step ST18, the processor 91 determines whether the received voice is voiced or unvoiced from the information obtained in the process of voice decoding processing in step ST1.
When voice decoding is performed by the CS-ACELP coding method in step ST1, it is determined whether the voice is voiced or unvoiced based on the gain coefficient used for controlling the degree of emphasis of the pitch component in the long-term post-filter processing.
The process of step ST18 has the same content as the process in the voiced information extraction unit 10 of FIG.

In step ST8e, the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6, and the result of the determination of voiced or unvoiced in step ST18.
For example, when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, or the result of the determination in step ST18 is from "unvoiced" to "voiced". If it changes to, it is determined that there is a syllable transition.
If none of these apply, it is determined that there is no syllable transition.
The process of step ST8e has the same content as the process of the syllable transition determination unit 6e of FIG.

When the result of the determination in step ST18 changes from "unvoiced" to "voiced", instead of determining that there is a syllable transition, the result of the determination in step ST18 changes from "voiced" to "unvoiced". , It may be determined that there is a syllable transition.

Further, in the processing procedure of FIG. 17, the process proceeds to step ST8e only when it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7, but it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7. Even if this is not the case, it may be determined whether or not there is a syllable transition based on the result of the determination of whether it is voiced or unvoiced in step ST18.

Even when the speech speed conversion device of FIG. 8 is configured by the computer of FIG. 12, the same effect as described in the fifth embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 9 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 18 is generally the same as the processing procedure shown in FIG. 14, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.

The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.

Even when the speech speed conversion device of FIG. 9 is configured by the computer of FIG. 12, the same effect as described in the sixth embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 10 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 19 is generally the same as the processing procedure shown in FIG. 15, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.

Even when the speech speed conversion device of FIG. 10 is configured by the computer of FIG. 12, the same effect as described in the seventh embodiment can be obtained.

Next, when the speech speed conversion device of FIG. 11 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 20 is generally the same as the processing procedure shown in FIG. 16, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.

Even when the speech speed conversion device of FIG. 11 is configured by the computer of FIG. 12, the same effect as described in the eighth embodiment can be obtained.

Various modifications are possible to the above embodiments.
For example, the modifications described with respect to Embodiment 1 are also applicable to Embodiments 2-8. The modifications described with respect to embodiment 5 are also applicable to embodiments 6-8.
Further, the modifications described with respect to the processing procedure of FIG. 17 can also be applied to the processing procedures of FIGS. 18 to 20.

In the above embodiment, the case of lowering the speaking speed in order to make the voice easier to hear has been described, but the configuration of the above embodiment can also be applied to the case of increasing the speaking speed. For example, when St / Sr calculated by the conversion rate determination unit 8 is larger than 1, the speaking speed conversion may be performed using the St / Sr as the speaking speed conversion rate Rc.

Although the speech speed conversion device has been described above, it is also possible to implement the speech speed conversion method by using the speech speed conversion device, and the computer is made to execute the processing in the speech speed conversion device or the speech speed conversion method by a program. It is also possible.

1 voice decoding unit, 2 frequency information generation unit, 3 information change amount calculation unit, 4 extreme value detection unit, 5 sound detection unit, 6 syllable transition determination unit, 7 speech speed calculation unit, 8 conversion rate determination unit, 9 episodes Speed conversion unit, 10 voice information extraction unit, 21 LSP coefficient extraction unit, 22 LPC coefficient extraction unit, 23 mer cepstrum calculation unit, 24, 24d thinning unit, 51 low level detection unit, 52 high level detection unit, 53 OR operation Unit, 54 hangover addition part, 55 noise level calculation part, 56 threshold setting part, 90 computer, 91 processor, 92 memory, 101 adaptive codebook vector decoding part, 102 gain decoding part, 103 fixed codebook vector decoding part, 104 adaptation Pre-filter unit, 105 predictive gain calculation unit, 106 excitation signal generation unit, 107 LSP coefficient decoding unit, 108 interpolation unit, 109 LPC coefficient conversion unit, 110 composite filter unit, 111 post-filter unit, 511,521 comparison unit, 513, 523 Judgment unit.

Claims

In a speech speed converter that converts speech speed in a voice communication device,
A voice decoding unit that decodes high-efficiency encoded voice code data and outputs a voice signal,
A frequency information generation unit that generates frequency information from information obtained in the process of decoding the voice code data in the voice decoding unit, and a frequency information generation unit.
An information change amount calculation unit that obtains the time change amount of the generated frequency information at regular time intervals as the information change amount, and
Based on the voice signal, a sound detection unit that determines whether the received voice represented by the voice code data is sound or no sound, and a sound detection unit.
A syllable transition determination unit that determines that a syllable of the received voice has transitioned when the amount of change in information while the received voice is determined to be sound by the sound detection unit satisfies a predetermined condition. When,
A speech speed calculation unit that calculates the speech speed based on the determination result by the syllable transition determination unit, and
A conversion rate determination unit that determines the conversion rate based on the speech speed calculated by the speech speed calculation unit,
A speech speed conversion device including a speech speed conversion unit that converts the speech speed of the audio signal at a conversion rate determined by the conversion rate determination unit.
The predetermined conditions are
The amount of change in information has a maximum value and a minimum value within a certain period of time.
The maximum value is larger than the predetermined first threshold value,
The speech speed conversion device according to claim 1, wherein the difference between the maximum value and the minimum value is larger than the second threshold value.
The voice decoding unit further includes a voiced information extraction unit that extracts information indicating whether or not the received voice is a voiced sound or an unvoiced sound from the information obtained in the process of decoding the voice code data.
Based on the information extracted by the voiced information extraction unit, the syllable transition determination unit determines that the syllable of the received voice has transitioned even when it is detected that the voiced state has changed from the unvoiced state, or The speech speed conversion device according to claim 1 or 2, wherein it is determined that the syllable of the received voice has changed even when it is detected that the voiced state has changed to the unvoiced state.
The frequency information generation unit extracts the LSP coefficient in the voice decoding unit at regular time intervals, and then extracts the LSP coefficient.
The information change amount calculation unit calculates the distance between the LSP coefficient vector composed of the latest extracted LSP coefficient and the LSP coefficient vector composed of the LSP coefficient extracted in the past as the information change amount. The speech speed conversion device according to any one of 1 to 3.
The frequency information generator thins out the LSP coefficient,
The speech speed conversion device according to claim 4, wherein the information change amount calculation unit calculates the information change amount based on an LSP coefficient vector composed of the LSP coefficient after thinning.
The frequency information generation unit extracts the LPC coefficient in the voice decoding unit at regular time intervals, converts the extracted LPC coefficient into LPC cepstrum or LPC cepstrum, and converts the extracted LPC coefficient into LPC cepstrum or LPC cepstrum.
The information change amount calculation unit is the distance between the LPC cepstrum vector composed of the latest converted LPC cepstrum and the LPC cepstrum vector composed of the LPC cepstrum converted in the past, or the latest converted LPC cepstrum. The story according to any one of claims 1 to 3, wherein the distance between the LPC mel cepstrum vector composed of the above and the LPC mel cepstrum vector composed of the LPC mel cepstrum converted in the past is calculated as the amount of change in the information. Speed converter.
The speech speed conversion device according to claim 6, wherein the frequency information generation unit thins out the LPC coefficient and converts the thinned-out LPC coefficient into the LPC cepstrum or the LPC cepstrum.
In the speech speed conversion method for converting the speech speed in the voice communication device,
Decodes highly efficient coded audio code data, outputs an audio signal,
Frequency information is generated from the information obtained in the process of decoding the voice code data.
The amount of time change of the generated frequency information is obtained as the amount of information change.
Based on the voice signal, it is determined whether the received voice represented by the voice code data is sounded or silent.
When the amount of change in information while the received voice is determined to be sound satisfies a predetermined condition, it is determined that the syllable of the received voice has changed.
The speaking speed is calculated based on the judgment result of the syllable transition, and the speaking speed is calculated.
The conversion rate is determined based on the calculated speech speed,
A speaking speed conversion method for converting the speaking speed of the audio signal at the determined conversion rate.
A program for causing a computer to execute the process according to the speech speed conversion method according to claim 8.
A computer-readable recording medium on which the program according to claim 9 is recorded.