WO2021166158A1 - Speaking speed conversion device, speaking speed conversion method, program, and storage medium - Google Patents

Speaking speed conversion device, speaking speed conversion method, program, and storage medium Download PDF

Info

Publication number
WO2021166158A1
WO2021166158A1 PCT/JP2020/006780 JP2020006780W WO2021166158A1 WO 2021166158 A1 WO2021166158 A1 WO 2021166158A1 JP 2020006780 W JP2020006780 W JP 2020006780W WO 2021166158 A1 WO2021166158 A1 WO 2021166158A1
Authority
WO
WIPO (PCT)
Prior art keywords
unit
voice
information
speech speed
lpc
Prior art date
Application number
PCT/JP2020/006780
Other languages
French (fr)
Japanese (ja)
Inventor
茂明 鈴木
木村 勝
Original Assignee
三菱電機株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 三菱電機株式会社 filed Critical 三菱電機株式会社
Priority to JP2021570271A priority Critical patent/JP7019117B2/en
Priority to PCT/JP2020/006780 priority patent/WO2021166158A1/en
Priority to TW109129092A priority patent/TW202133149A/en
Publication of WO2021166158A1 publication Critical patent/WO2021166158A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/04Time compression or expansion
    • G10L21/043Time compression or expansion by changing speed

Definitions

  • This disclosure relates to a speech speed conversion device, a speech speed conversion method, a program, and a recording medium.
  • voice communication In voice communication that sends and receives highly efficient encoded voice data, in order to improve the ease of hearing the voice, a speech speed conversion technology that slows down or speeds up the playback speed without changing the voice quality is used. It is being developed. In voice communication, when converting the speaking speed, the speaking speed is lowered to convert the voice in the sounded section to make it easier to hear, and part or all of the silent section is deleted or the speaking speed is increased. It is often done to prevent an increase in delay.
  • lowering the speaking speed of a fast-spoken voice can improve the ease of hearing the voice, but lowering the speaking speed of a slowly spoken voice makes it difficult to understand the rhythm of the speech, and rather listens. Ease may be compromised. Therefore, a mechanism for measuring the speech speed of the voice before the speech speed conversion is required.
  • Patent Document 1 a technique for measuring a speech speed by obtaining a spectral feature of a spoken voice has been disclosed (Patent Document 1).
  • the speech voice is subjected to spectral analysis by linear prediction method (LPC) or fast Fourier transform (FFT) based on a full pole model every 10 ms, and a spectral feature amount vector is obtained based on the spectral analysis result.
  • LPC linear prediction method
  • FFT fast Fourier transform
  • This disclosure is made to solve the above-mentioned problems, makes it possible to reduce the amount of calculation, and accurately measure the speaking speed of the voice signal obtained by decoding the voice code data.
  • the purpose is to make it possible to perform an appropriate speech speed conversion according to the speech speed.
  • the speech speed converter of the present disclosure is In a speech speed converter that converts speech speed in a voice communication device, A voice decoding unit that decodes high-efficiency encoded voice code data and outputs a voice signal, A frequency information generation unit that generates frequency information from information obtained in the process of decoding the voice code data in the voice decoding unit, and a frequency information generation unit. An information change amount calculation unit that obtains the time change amount of the generated frequency information at regular time intervals as the information change amount, and Based on the voice signal, a sound detection unit that determines whether the received voice represented by the voice code data is sound or no sound, and a sound detection unit.
  • a syllable transition determination unit that determines that a syllable of the received voice has transitioned when the amount of change in information while the received voice is determined to be sound by the sound detection unit satisfies a predetermined condition.
  • a speech speed calculation unit that calculates the speech speed based on the determination result by the syllable transition determination unit, and A conversion rate determination unit that determines the conversion rate based on the speech speed calculated by the speech speed calculation unit, It has a speaking speed conversion unit that converts the speaking speed of the audio signal at the conversion rate determined by the conversion rate determining unit.
  • FIG. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 1.
  • FIG. It is a block diagram which shows the structural example of the audio decoding part of FIG.
  • FIG. It is a block diagram which shows the structural example of the sound detection part of FIG. (A) to (h) are time charts showing signals appearing in each part of the sound detection unit of FIG.
  • FIG. It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 5.
  • FIG. 1 shows the configuration of the speech speed conversion device according to the first embodiment.
  • the illustrated speech speed conversion device converts the speech speed of the received voice in the voice communication device, and has a voice decoding unit 1, a frequency information generation unit 2, an information change amount calculation unit 3, an extreme value detection unit 4, and a presence. It has a sound detection unit 5, a syllable transition determination unit 6, a speech speed calculation unit 7, a conversion rate determination unit 8, and a speech speed conversion unit 9.
  • the voice code data Da includes pitch cycle information of voice, information representing a fixed codebook vector, gain information, and information representing an LSP coefficient for each voice frame. Audio frames are simply referred to as frames below.
  • the voice decoding unit 1 decodes the voice code data Da and generates a voice signal (decoded voice signal) Db representing a linear PCM (Pulse Code Modulation) code.
  • the frequency information generation unit 2 extracts and outputs the frequency information Fa from the information generated in the decoding process in the voice decoding unit 1 at regular intervals.
  • the frequency information Fa represents the vocal tract frequency characteristics when each phoneme is emitted.
  • the information change amount calculation unit 3 calculates the time change amount (information change amount) Vf of the frequency information Fa output from the frequency information generation unit 2 at regular time intervals.
  • the extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf calculated by the information change amount calculation unit 3.
  • the sound detection unit 5 determines whether the voice (received voice) represented by the voice code data Da is sound or no sound based on the voice signal Db output from the voice decoding unit 1, and determines whether the voice (received voice) is sound or no sound. Information indicating that, that is, sound / silence information Lm is output.
  • the syllable transition determination unit 6 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected by the extreme value detection unit 4 and the sound / silence information Lm output from the sound detection unit 5. Then, the determination result Sy is output.
  • the speaking speed calculation unit 7 calculates the speaking speed Ss based on the determination result Sy of the syllable transition determination unit 6.
  • the speaking speed Ss is represented by the number of syllables per unit time.
  • the conversion rate determination unit 8 determines the speech speed conversion rate Rc of the received voice based on the speech speed Ss calculated by the speech speed calculation unit 7.
  • the speech speed conversion unit 9 performs a speech speed conversion process on the audio signal Db based on the speech speed conversion rate Rc determined by the conversion rate determination unit 8, and outputs the converted audio signal Dc.
  • the audio decoding unit 1 receives the highly efficient coded audio code data Da, decodes it into a linear PCM code, and outputs an audio signal (decoded audio signal) Db representing the linear PCM code.
  • FIG. 2 shows a configuration example of the audio decoding unit 1 of FIG.
  • the voice decoding unit 1 shown in FIG. 2 is described in an ITU-T (International Telecommunication Union Telecommunication Standardization Sector) recommendation G.I. It conforms to the CS-ACELP (Conjugate Structure Algebraic Code Excited Liner Edition) coding method specified in 729.
  • ITU-T International Telecommunication Union Telecommunication Standardization Sector
  • CS-ACELP Conjugate Structure Algebraic Code Excited Liner Edition
  • the audio decoding unit 1 shown in FIG. 2 includes an adaptive codebook vector decoding unit 101, a gain decoding unit 102, a fixed codebook vector decoding unit 103, an adaptive prefilter unit 104, a predicted gain calculation unit 105, and an excitation signal generation unit 106. It has an LSP coefficient decoding unit 107, an interpolation unit 108, an LPC coefficient conversion unit 109, a composite filter unit 110, and a post filter unit 111.
  • the adaptive codebook vector decoding unit 101 decodes the pitch period information of the voice from the voice code data Da of each received frame and generates the adaptive codebook vector.
  • the adaptive codebook vector represents an excitation signal generated in the past. Considering that the audio signal has a strong periodicity, it can be said that the excitation signal generated in the past is stored and reused based on the pitch period information.
  • the fixed code book vector decoding unit 103 decodes the fixed code book vector from the voice code data Da of each received frame.
  • the adaptive prefilter unit 104 emphasizes the pitch component of the decoded fixed codebook vector.
  • the gain decoding unit 102 decodes the gain information from the received voice code data Da of each frame, and outputs the gain of the adaptive codebook vector and the gain of the fixed codebook vector.
  • the predicted gain calculation unit 105 is based on the gain of the fixed codebook vector of each frame output from the gain decoding unit 102 and the past fixed codebook vector output from the adaptive prefilter unit 104. Find the predicted gain of the vector.
  • the excitation signal generation unit 106 is output from the adaptive codebook vector of each frame output from the adaptive codebook vector decoding unit 101, the fixed codebook vector of each frame output from the adaptive prefilter unit 104, and the gain decoding unit 102.
  • the excitation signal Se is generated using the gain of the adaptive codebook vector of each frame and the predicted gain of the fixed codebook vector output from the predicted gain calculation unit 105.
  • the LSP coefficient decoding unit 107 decodes the LSP coefficient from the voice code data Da of each received frame.
  • the frame length is 10 milliseconds
  • the 10th-order LSP coefficient is decoded every 10 milliseconds.
  • the interpolation unit 108 uses the LSP coefficient of the current frame and the LSP coefficient of the previous frame to generate an LSP coefficient at an intermediate timing between them, that is, 5 milliseconds before the current frame by interpolation.
  • the LPC coefficient conversion unit 109 converts the LSP coefficient of the current frame and the LSP coefficient generated by interpolation into an LPC (Linear Predictive Coding) coefficient.
  • the synthetic filter unit 110 is a full-pole filter having an LPC coefficient output from the LPC coefficient conversion unit 109 as a filter coefficient, and generates a synthetic voice signal Sf by inputting an excitation signal Se generated by the excitation signal generation unit 106. ..
  • the post filter unit 111 emphasizes the pitch component of the synthetic audio signal Sf generated by the synthetic filter unit 110 to improve the audible quality.
  • the post filter unit 111 is a series of a plurality of filters.
  • the long-term post filter among the plurality of filters is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.
  • the gain coefficient is generated by the processing by the above long-term post filter. Specifically, the delay in which the autocorrelation of the composite signal output by the composite filter unit 110 becomes large is searched, and if the autocorrelation in the delay is small, the gain coefficient is set to 0, and if not, the delay component (pitch) is set. A coefficient (greater than 0 and less than 1) is set to emphasize.
  • the output of the post filter unit 111 is output as the output of the audio decoding unit 1, that is, the decoded audio signal Db.
  • the frequency information generation unit 2 extracts and outputs the frequency information Fa from the information of each frame generated in the decoding process in the voice decoding unit 1.
  • the frequency information generation unit 2 shown in FIG. 1 includes an LSP coefficient extraction unit 21.
  • the LSP coefficient extraction unit 21 extracts the LSP (Line Spectral Pair) coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1, and outputs it as the frequency information Fa.
  • the 10th-order LSP coefficient is decoded for each frame, and the LSP coefficient extraction unit 21 in FIG. 1 has this information, that is, the LSP coefficient decoding unit.
  • the 10th-order LSP coefficient output from 107 is extracted for each frame and output as frequency information Fa.
  • the 10th-order LSP coefficient of each frame can be seen as constructing one 10-dimensional LSP coefficient vector.
  • the information change amount calculation unit 3 calculates the distance (inter-vector distance) between the LSP coefficient vector of the current frame and the LSP coefficient vector one frame before as the information change amount Vf.
  • n (n is an integer) represents the current time (encoded frame number) and n-i (i is an integer) represents the time i frames before the current time n
  • the LSP coefficient vector at time n Is f1 (n), f2 (n), ..., F10 (n) obtains the inter-vector distance d (n) by the calculation according to the following equation (1).
  • d (n) ⁇ F1 (n) -f1 (n-1) ⁇ x ⁇ f1 (n) -f1 (n-1) ⁇ + ⁇ F2 (n) -f2 (n-1) ⁇ x ⁇ f2 (n) -f2 (n-1) ⁇ : : + ⁇ F10 (n) -f10 (n-1) ⁇ x ⁇ f10 (n) -f10 (n-1) ⁇ ... (1)
  • n will be described as indicating the current time.
  • the extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf within a certain period in the latest past.
  • the most recent past fixed period referred to here is the period of the most recent past Na frame, that is, the period from the current time n to the time n-Na + 1 before the (Na-1) frame (Na is an integer of 4 or more). ..
  • the extremum detection unit 4 determines whether or not the inter-vector distance d (n-1) at time n-1 one frame before the current time n is maximum. Then, if the inter-vector distance d (n-1) is maximum, the minimum in the latest past Na frame is detected.
  • d (n) is smaller than d (n-1) and d (n-1) is larger than d (n-2), d (n-1) is determined to be maximum. If this condition is not satisfied, it is determined that d (n-1) is not maximum.
  • d (n-1) When d (n-1) is the maximum, the minimum is continuously specified. For example, the latest time m in which d (m) is larger than d (m-1), d (m-1) is smaller than d (m-2), and n—Na + 2 ⁇ m ⁇ n-1 is satisfied. Search for (largest value). When m satisfying these conditions exists, it is determined that d (m-1) is the minimum. When m satisfying this condition does not exist, d (n—Na + 1) is set to the minimum for convenience. The minimum for this convenience corresponds to the minimum value among d (n-Na + 1), d (n-Na + 2), ..., D (n-2).
  • the extreme value detection unit 4 acquires the maximum value (maximum value) Mx and the minimum value (minimum value) Mn detected as described above.
  • the Na value is set so that Na times the frame length is equal to or greater than the maximum value of the syllable length expected in normal utterance.
  • the syllable is several tens of milliseconds when it is short, and 200 milliseconds or less when it is long, so it is better to set Na to a value equivalent to 200 milliseconds.
  • the sound detection unit 5 determines the sound / silence of the voice signal output from the voice decoding unit 1, and outputs information indicating the determination result, that is, the sound / silence information Lm. This determination is made every few milliseconds to several tens of milliseconds, for example, every frame period or an integral multiple thereof. Hereinafter, it will be described that this determination is performed for each frame period.
  • the sound detection unit 5 determines whether or not there is sound based on the amplitude of the voice signal Db output from the voice decoding unit 1.
  • FIG. 3 shows a configuration example of the sound detection unit 5.
  • the illustrated sound detection unit 5 includes a low level detection unit 51, a high level detection unit 52, a disjunction calculation unit 53, a hangover addition unit 54, a noise level calculation unit 55, and a threshold value setting unit 56.
  • the low level detection unit 51 compares the audio signal Db with the adaptation threshold value D56, and outputs the signal D51 based on the comparison result.
  • the adaptive threshold value D56 is supplied from the threshold value setting unit 56.
  • the low level detection unit 51 includes a comparison unit 511 and a determination unit 513.
  • the comparison unit 511 compares the audio signal Db with the adaptation threshold value D56, and outputs a signal D511 indicating the comparison result.
  • the signal D511 is High if the audio signal Db is greater than the threshold D56 (that is, if the absolute value of the sample value of the audio signal Db is greater than the threshold D56), otherwise it is Low. The comparison is made for each sample cycle.
  • the determination unit 513 outputs the signal D51 based on the signal D511.
  • the signal D51 becomes High when the signal D511 continues in the High state for a certain period of time or longer, and becomes Low immediately when the signal D511 becomes Low.
  • the high level detection unit 52 compares the audio signal Db with a predetermined threshold value D50, and outputs signals D52 and D521 based on the comparison result.
  • the threshold D50 is set to a value higher than the maximum background noise level normally expected.
  • the high level detection unit 52 includes a comparison unit 521 and a determination unit 523.
  • the comparison unit 521 compares the audio signal Db with the threshold value D50, and outputs a signal D521 indicating the comparison result.
  • the signal D521 is High if the audio signal Db is larger than the threshold value D50, and Low otherwise. The comparison is made for each sample cycle.
  • the determination unit 523 outputs the signal D52 based on the signal D521.
  • the signal D52 becomes High when the signal D521 continues to be in the High state for a certain period of time or longer, and becomes Low immediately when the signal D521 becomes Low.
  • the OR unit 53 performs an operation to obtain the OR of the signal D51 and the signal D52.
  • the output signal D53 of the OR unit 53 is High if at least one of the signals D51 or D52 is High, and Low otherwise.
  • the hangover addition unit 54 performs a hangover addition process on the output signal D53 of the OR unit 53, and outputs the signal obtained as a result as sound / silence information Lm.
  • the hangover process immediately changes from Low to High when the input signal (D53) changes from Low to High, and changes from High to Low after a certain delay time when the input signal (D53) changes from High to Low. This is a process for outputting a signal (Lm).
  • the noise level calculation unit 55 calculates the added average value D55a of the absolute value of the sample value of the audio signal Db at regular intervals, and obtains the noise level value D55 based on the calculated average value D55a. For example, the calculated average value D55a is updated with the moving average for each relatively long period as the noise level value D55. However, the average value D55a during the period when the signal D521 is High is not used for calculating the moving average, and the moving average calculated before that is maintained during that period.
  • the threshold value setting unit 56 adjusts the adaptation threshold value D56 according to the background noise level value D55 output from the noise level calculation unit 55.
  • the adaptation threshold D56 is adjusted to a value slightly larger than the calculated noise level value D55.
  • the adaptive threshold value D56 is changed following a change in the calculated noise level value D55.
  • the sound detection unit 5 will be described below with reference to FIGS. 4A to 4H.
  • the threshold value D50 is set to the value shown in FIG. 4A, and the audio signal Db changes as shown in FIG. 4A.
  • the output signal D521 of the comparison unit 521 becomes High as shown in FIG. 4C, and the output signal D52 of the determination unit 523 becomes as shown in FIG. 4D. , It becomes High with a slight delay.
  • the average value D55a and the noise level value D55 calculated by the noise level calculation unit 55 change as shown in FIG. 4B, and the adaptation threshold value D56 calculated by the threshold value setting unit 56 is shown in FIG. 4A. It changes like.
  • the noise level value D55 shown in FIG. 4B and the adaptive threshold value D56 shown in FIG. 4A change according to the average value D55a, but the period during which the audio signal Db is larger than the threshold value D50 (signal D521 is High). (For a certain period of time) does not change and is maintained at the value immediately before it.
  • the output signal D511 of the comparison unit 511 becomes High as shown in FIG. 4 (e)
  • the output signal D51 of the determination unit 513 becomes High as shown in FIG. 4 (f). As shown, it becomes High with a slight delay.
  • the output signal D511 of the comparison unit 511 becomes Low as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 is also shown in FIG. 4 (f). It becomes Low like.
  • the output signal D53 of the OR unit 53 rises with the rising edge of the signal D51 and falls with the falling edge of the signal D51.
  • the output signal Lm of the hangover addition unit 54 rises with the rise of the signal D53 and falls with a slight delay from the fall of the signal D53.
  • the signal Lm indicates that there is sound when it is High, and that it is silent when it is Low.
  • the sound detection unit 5 generates an adaptive threshold value D56 that changes according to the noise level value, and if the audio signal Db is larger than the adaptive threshold value D56 or the audio signal Db is larger than the threshold value D50. By determining that there is sound and determining that there is no sound if none of them is present, it is possible to appropriately determine whether there is sound or no sound.
  • the syllable transition determination unit 6 determines the presence or absence of a syllable transition only when the maximum value Mx and the minimum value Mn are detected by the extreme value detection unit 4 and sound is detected by the sound detection unit 5. The judgment result Sy is output.
  • the maximum value Mx is larger than the predetermined threshold value (first threshold value) Th1
  • the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value (second threshold value) Th2. In some cases, it is determined that there is a syllable transition, and if not, it is determined that there is no syllable transition.
  • the speech speed calculation unit 7 calculates the speech speed Ss at regular calculation cycles based on the determination result Sy of the syllable transition determination unit 6 and the sound / silence information Lm from the sound detection unit 5.
  • the constant calculation period is, for example, 1 second.
  • the number of syllables per unit time is calculated from the number of syllables during a certain period that was the most recent sounded sound at each time point and the length of the certain period, and output as the speaking speed Ss. do.
  • the "constant period" is an integral multiple of the frame period, and is, for example, about 3 seconds to 10 seconds.
  • the calculated speaking speed may be used as it is as long as the silence continues.
  • the conversion rate determination unit 8 determines the speech speed conversion rate Rc based on the speech speed Ss calculated by the speech speed calculation unit 7.
  • the speaking speed conversion rate Rc is also determined in the same cycle as the calculation of the speaking speed Ss. In other words, the conversion rate is calculated each time the speaking speed is calculated. For example, when the target speech speed after the speech speed conversion is St syllable / sec and the speech speed calculated by the speech speed calculation unit 7 is Sr syllable / sec, St / Sr is the speech speed conversion rate Rc of the voice in the sound state. And. However, since it is necessary to reduce the speaking speed in order to make the voice easier to hear, the speaking speed conversion rate Rc is set to 1 when St / Sr is larger than 1.
  • the speech speed conversion unit 9 converts the speech speed according to the speech speed conversion rate Rc from the conversion rate determination unit 8, and outputs the converted audio signal Dc.
  • the conversion rate determined by the conversion rate determination unit 8 for each calculation cycle (for example, every second) is applied until the next calculation cycle or until a new conversion rate is determined next.
  • the speech speed conversion can be realized by using, for example, a well-known algorithm such as PICOLA (Pointer Interval Control Overlap and Add) or TDHS (Time Domain Harmonic Scaling).
  • PICOLA Pointer Interval Control Overlap and Add
  • TDHS Time Domain Harmonic Scaling
  • the speech speed conversion rate Rc input from the conversion rate determination unit 8 is 1 or less, but if the speech speed is always slowed down, the processing delay of the speech speed conversion continues to increase, and the real time of voice communication is realized. I can't maintain my sex. Therefore, the sound / silence information Lm by the sound detection unit 5 is input to the speech speed conversion unit 9, and the speaking speed is increased for the portion determined to be silent, or the portion determined to be silent is deleted. , Make sure that the audio signal is not delayed beyond a certain level. Further, when the delay exceeds a certain value, the voice may be output without converting the speech speed even if there is sound.
  • the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals.
  • the amount of change in the voice signal is obtained, the syllable transition of the voice signal is detected based on this amount of change, and the speech speed is converted based on the detection result.
  • the information change amount calculation unit 3 regards the 10th-order LSP coefficient as one 10-dimensional vector, and obtains the inter-vector distance d (n).
  • a vector consisting of only the LSP coefficients f1 (n), f2 (n), f3 (n) is used.
  • the inter-vector distance d (n) may be obtained by the following equation (2).
  • d (n) ⁇ F1 (n) -f1 (n-1) ⁇ x ⁇ f1 (n) -f1 (n-1) ⁇ + ⁇ F2 (n) -f2 (n-1) ⁇ x ⁇ f2 (n) -f2 (n-1) ⁇ + ⁇ F3 (n) -f3 (n-1) ⁇ x ⁇ f3 (n) -f3 (n-1) ⁇ ...
  • the low-order coefficient corresponds to the low-frequency component and the high-order coefficient corresponds to the high-frequency component. Therefore, by doing the above, only the change in the low-frequency component was focused on. The amount of change can be calculated. Since the change in vocal tract frequency characteristics during utterance is larger on the low frequency side, this method is also advantageous in that it is possible to detect syllable transitions and the amount of calculation is smaller. Further, there is an advantage that it is easy to exclude the influence of background noise superimposed on the audio signal.
  • the information change amount calculation unit 3 obtains the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it, but obtains the distance from the LSP coefficient vector one frame before. You may do so. For example, when the distance from the LSP coefficient vector two frames before is obtained, the inter-vector distance d (n) is obtained by the calculation of the following equation (3).
  • d (n) ⁇ F1 (n) -f1 (n-2) ⁇ x ⁇ f1 (n) -f1 (n-2) ⁇ + ⁇ F2 (n) -f2 (n-2) ⁇ x ⁇ f2 (n) -f2 (n-2) ⁇ : : + ⁇ F10 (n) -f10 (n-2) ⁇ x ⁇ f10 (n) -f10 (n-2) ⁇ ... (3)
  • the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and even if the distance from the vector before a plurality of frames is used. No problem arises.
  • the time difference between the two vectors for which the distance is calculated may be set to be shorter than the period of the syllable transition in the utterance.
  • the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals.
  • the amount of change was obtained, and the syllable transition of the voice signal was detected based on this amount of change. Therefore, it is not necessary to perform spectral analysis of the decoded voice signal by linear prediction method (LPC) or fast Fourier transform (FFT) based on the omnipolar model, and further obtain a spectral feature amount vector. Therefore, the speech speed conversion can be performed with a small amount of calculation.
  • LPC linear prediction method
  • FFT fast Fourier transform
  • the LSP coefficient used for detecting the syllable transition is calculated in the process of encoding the voice, and since this is the information transmitted to the voice decoding unit 1, the spectrum analysis is performed using the voice signal after the voice decoding. Compared with the frequency characteristics obtained by doing this, it is closer to the voice path frequency characteristics of the voice represented by the voice signal before being encoded. Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data.
  • Embodiment 2 the LSP coefficient was extracted as the frequency information Fa, and the syllable transition was detected based on the time change amount. It is also possible to extract the LPC coefficient instead of the LSP coefficient, calculate the LPC mel cepstrum or LPC cepstrum from the extracted LPC coefficient, and detect the syllable transition using the calculated LPC mel kepstram or LPC cepstrum.
  • FIG. 5 shows the configuration of the speech speed conversion device according to the second embodiment.
  • the speech speed conversion device shown in FIG. 5 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2 and the information change amount calculation unit 3, the frequency information generation unit 2b and the information change amount calculation unit 3b are provided.
  • the frequency information generation unit 2b includes an LPC coefficient extraction unit 22 and a mer cepstrum calculation unit 23.
  • the LPC coefficient extraction unit 22 extracts the LPC coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1. For example, when the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method, a part of the output of the LPC coefficient conversion unit 109 in FIG. 2 is extracted.
  • the LPC coefficient conversion unit 109 of the audio decoding unit 1 in FIG. 2 converts the LSP coefficient of each frame and the LSP coefficient generated by the interpolation by the interpolation unit 108 into an LPC coefficient and outputs the LPC coefficient.
  • the LPC coefficient extraction unit 22 extracts the LPC coefficient generated by converting the LSP coefficient of each frame. For example, the 10th-order LPC coefficient is extracted.
  • the mel cepstrum calculation unit 23 converts the LPC coefficient extracted by the LPC coefficient extraction unit 22 into an LPC mel cepstrum.
  • a 10th to 25th order LPC mel cepstrum is generally used, but it is meaningless to make it much larger than the 10th order which is the order of the original LPC coefficient. Therefore, the order of the LPC cepstrum generated by the mel cepstrum calculation unit 23 is appropriately about 10 to 15.
  • the order of the LPC mel cepstrum will be described as 10.
  • the 10th-order LPC mel cepstrum of each frame generated by the mel cepstrum calculation unit 23 constitutes a 10-dimensional LPC mel cepstrum vector.
  • the information change amount calculation unit 3b of FIG. 5 is the same as the information change amount calculation unit 3 of FIG. 1, but is based on the output of the frequency information generation unit 2b instead of the output of the frequency information generation unit 2 of FIG.
  • the amount of information change Vf is calculated.
  • the information change amount calculation unit 3b uses the distance between the LPC mel cepstrum vector of the current frame output from the mel cepstrum calculation unit 23 of the frequency information generation unit 2b and the LPC mel cepstrum vector one frame before as the information change amount Vf. calculate.
  • the operation of the information change amount calculation unit 3b is basically the same as that of the information change amount calculation unit 3 of the first embodiment, except that the input is not the LSP coefficient vector but the LPC mel cepstrum vector.
  • the operation of the speech speed conversion device of the second embodiment is the same as the operation of the speech speed conversion device of the first embodiment.
  • the information change amount calculation unit 3b considers the 10th-order LPC mel cepstrum as one 10-dimensional vector and obtains the inter-vector distance d (n). It is not always necessary to calculate the amount of change using all of the 10th-order LPC mel cepstrums. Also, instead of finding the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it, the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector several frames before it is calculated. Is also good.
  • the LPC cepstrum is calculated, the amount of change in the LPC cepstrum at regular time intervals is obtained, and the syllable transition is detected based on the amount of change, but the LPC cepstrum is calculated instead of the LPC cepstrum. Then, the amount of change in the LPC cepstrum at regular time intervals may be obtained, and the syllable transition may be detected based on the amount of change.
  • the amount of change of the LPC cepstrum at regular intervals for example, the distance between the LPC cepstrum vectors composed of the LPC cepstrum can be used.
  • the frequency information generation unit 2b extracts the LPC coefficient obtained in the voice decoding process in the voice decoding unit 1, calculates the LPC mel cepstrum or the LPC cepstrum from the LPC coefficient, and then calculates the LPC mel cepstrum or the LPC cepstrum.
  • the information change amount calculation unit 3b obtains the change amount of the LPC mel cepstrum or the LPC cepstrum at regular time intervals, and detects the syllable transition of the voice signal based on this change amount. Therefore, it is not necessary to perform a spectral analysis of the decoded audio signal by a linear prediction method (LPC) based on a omnipolar model. Therefore, the speech speed conversion can be performed with a small amount of calculation.
  • LPC linear prediction method
  • the LPC mel cepstrum or LPC cepstrum used for detecting the syllable transition is calculated based on the LSP coefficient calculated in the process of coding the voice, and this LSP coefficient has been transmitted to the voice decoding unit 1. Because it is information, it is closer to the vocal tract frequency characteristics of the voice represented by the voice signal before encoding than the frequency characteristics obtained by performing spectral analysis using the voice signal after voice decoding. .. Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data generated by the high efficiency coding.
  • LPC cepstrum or LPC cepstrum is generally used in speech recognition, and determines syllable transition with higher accuracy than determining syllable transition using the LSP coefficient shown in the first embodiment. Can be done.
  • Embodiment 3 In the first embodiment, the LSP coefficient was extracted as the frequency information of the voice, and the syllable transition was detected based on the time change amount. After thinning out the extracted LSP coefficient, the syllable transition may be detected based on the amount of time change.
  • FIG. 6 shows the configuration of the speech speed conversion device according to the third embodiment.
  • the speech speed conversion device shown in FIG. 6 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2, the information change amount calculation unit 3, and the extreme value detection unit 4, the frequency information generation unit 2c, the information change amount calculation unit 3c, and the extreme value detection unit 4c are provided.
  • the frequency information generation unit 2c includes an LSP coefficient extraction unit 21 and a thinning unit 24.
  • the LSP coefficient extraction unit 21 is the same as the LSP coefficient extraction unit 21 of the first embodiment, and operates in the same manner.
  • the thinning unit 24 thins out the LSP coefficient (frequency information) extracted by the LSP coefficient extracting unit 21 at a thinning rate M.
  • M is an integer of 2 or more.
  • the voice decoding unit 1 decodes the LSP coefficient every 10 milliseconds
  • the LSP coefficient extraction unit 21 decodes the LSP coefficient every 10 milliseconds. Extract to.
  • the thinning unit 24 extracts the LSP coefficient extracted by the LSP coefficient extraction unit 21 every M frame, and therefore every 10 ⁇ M milliseconds.
  • the operation of the information change amount calculation unit 3c is basically the same as that of the information change amount calculation unit 3 of the first embodiment, but the processing is performed not every frame but every M frame.
  • the information change amount calculation unit 3c also calculates the distance between the LSP coefficient vectors output one after the other from the thinning unit 24. As a result, the information change amount calculation unit 3c obtains the distance between the latest (current frame) LSP coefficient vector and the LSP coefficient vector M frame before that.
  • the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and no problem occurs unless M becomes excessively large.
  • the value of M may be set so that 10 ⁇ M milliseconds is shorter than the period of syllable transition in utterance.
  • the operation of the extreme value detecting unit 4c is basically the same as that of the extreme value detecting unit 4 of the first embodiment. However, similarly to the information change amount calculation unit 3c, the processing is performed not for each frame but for each M frame.
  • the extreme value detection unit 4c detects the maximum value Mx and the minimum value Mn of the information change amount Vf of the latest past Nb frame input to the extreme value detection unit 4c.
  • Nb is smaller than Na in the description of the extremum detection unit 4 of the first embodiment. This is because the information change amount Vf is input to the extreme value detection unit 4c of the third embodiment not every frame but every M frame.
  • Nb is set to a value equal to Na / M.
  • the extreme value detection unit 4c determines whether or not the information change amount Vf at the time (n-1) one frame before the current time n is the maximum, instead of determining whether or not the current frame is the maximum. It is determined whether or not the amount of information change Vf at the time (nm) before the M frame is maximum.
  • d (n) is smaller than d (n-M) and d (n-M) is larger than d (n-2M)
  • d (n-M) is determined to be maximum. If this condition is not satisfied, it is determined that d (nm) is not the maximum.
  • d (m) is larger than d (m-M)
  • d (m-M) is smaller than d (m-2M)
  • the operation of the speech speed conversion device of the third embodiment is the same as the operation of the speech speed conversion device of the first embodiment.
  • the same effect as that of the first embodiment can be obtained in the third embodiment. Further, since the thinning unit 24 extracts the LSP coefficient only once in the M frame, the operation frequency of the information change amount calculation unit 3c and the extreme value detection unit 4c is reduced, and the calculation amount is smaller than that of the first embodiment. It is possible to further reduce the number.
  • Embodiment 4 The same thinning as in the third embodiment may be performed on the speech speed conversion device that detects the syllable transition by obtaining the amount of change in the LPC mel cepstrum shown in the second embodiment.
  • FIG. 7 shows the configuration of the speech speed conversion device according to the fourth embodiment.
  • the speech speed conversion device shown in FIG. 7 is generally the same as the speech speed conversion device of the second embodiment described with reference to FIG. 5, but differs in the following points. That is, instead of the frequency information generation unit 2b, the information change amount calculation unit 3b, and the extreme value detection unit 4, the frequency information generation unit 2d, the information change amount calculation unit 3d, and the extreme value detection unit 4c are provided.
  • the frequency information generation unit 2d includes an LPC coefficient extraction unit 22, a thinning unit 24d, and a mer cepstrum calculation unit 23d.
  • the LPC coefficient extraction unit 22 extracts the LPC coefficient from the information generated by the decoding operation of the voice decoding unit 1 in the same manner as the LPC coefficient extraction unit 22 of the second embodiment shown in FIG.
  • the thinning section 24d is the same as the thinning section 24 in FIG. However, the output of the LPC coefficient extraction unit 22 is thinned out instead of the output of the LSP coefficient extraction unit 21. In this thinning, the LPC coefficient output from the LPC coefficient extraction unit 22 for each frame is extracted only once in the M frame (M is an integer of 2 or more).
  • the audio decoding unit 1 decodes the LSP coefficient every 10 milliseconds
  • the LPC coefficient extraction unit 22 decodes the LPC coefficient every 10 milliseconds. Extract to.
  • the thinning unit 24d extracts the LPC coefficient extracted by the LPC coefficient extraction unit 22 every M frame, and therefore every 10 ⁇ M milliseconds.
  • the mel cepstrum calculation unit 23d converts the LPC coefficient output from the thinning unit 24d into an LPC mel cepstrum.
  • the operation of the mel cepstrum calculation unit 23d is basically the same as the operation of the mel cepstrum calculation unit 23 of FIG. 6, but the processing is performed not every frame but every M frame.
  • the information change amount calculation unit 3d of FIG. 7 is the same as the information change amount calculation unit 3b of FIG. 5, but is based on the output of the mel cepstrum calculation unit 23d instead of the output of the mer cepstrum calculation unit 23 of FIG.
  • the amount of information change Vf is calculated.
  • the information change amount calculation unit 3d considers the 10th-order LPC mel cepstrum output from the mel cepstrum calculation unit 23d of the frequency information generation unit 2d as one 10-dimensional vector, and obtains the inter-vector distance d (n).
  • the operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3c of FIG. 6, except that the input is not the LSP coefficient but the LPC mel cepstrum. Further, the operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3b in FIG. 5, except that the processing is performed not for each frame but for each M frame.
  • the extremum detection unit 4c of the fourth embodiment is the same as the extremum detection unit 4c of FIG. 6, and operates in the same manner.
  • the operation of the speech speed conversion device of the fourth embodiment is the same as the operation of the speech speed conversion device of the second embodiment.
  • the same effect as that of the second embodiment can be obtained in the fourth embodiment. Further, since the thinning unit 24 extracts the LPC coefficient only once in the M frame, the operation frequency of the mer cepstrum calculation unit 23d, the information change amount calculation unit 3d, and the extreme value detection unit 4c is reduced, and the implementation is carried out. The amount of calculation can be reduced as compared with the second form.
  • Embodiment 5 In the syllable transitions of the first to fourth embodiments, a function of detecting whether the voice is a voiced sound or an unvoiced sound is added, and the syllable transition can be detected by using the voiced / unvoiced information together.
  • the fifth embodiment is a modification of the first embodiment in which voiced / unvoiced information is used in combination.
  • FIG. 8 shows the configuration of the speech speed conversion device according to the fifth embodiment.
  • the speech speed conversion device shown in FIG. 8 is generally the same as the speech speed conversion device of FIG. 1, but differs in the following points. That is, a voiced information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6.
  • the configuration of the frequency information generation unit 2 in FIG. 8 is the same as that in FIG. 1, and the illustration thereof is omitted.
  • the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing voice decoding in the voice decoding unit 1.
  • the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method
  • the configuration is as shown in FIG. 2, for example, and the voiced / unvoiced information Vc from the post filter unit 111 in FIG. Is obtained.
  • the long-term post filter in the post filter unit 111 is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.
  • the voice is a voiced sound
  • the sound quality can be improved by emphasizing the pitch component, but if the voice is not a voiced sound, the sound quality deteriorates. Therefore, the gain coefficient is set to zero and the pitch component is not emphasized. Therefore, the gain coefficient of this long-term post filter can be used as voiced / unvoiced information Vc. That is, it can be said that the decoded voice when this coefficient is not 0 is voiced, and the decoded voice when this coefficient is 0 is unvoiced.
  • the voiced information extraction unit 10 extracts and outputs the above gain coefficient as voiced / unvoiced information Vc.
  • the syllable transition determination unit 6e Similar to the syllable transition determination unit 6 of FIG. 1, the syllable transition determination unit 6e detected the maximum value Mx and the minimum value Mn in the extreme value detection unit 4 and detected as sound in the sound detection unit 5. The presence or absence of syllable transition is determined only in this case. The syllable transition determination unit 6e determines that there is a syllable transition when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2.
  • the syllable transition determination unit 6e also determines that there is a syllable transition when the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from one indicating "unvoiced” to one indicating "voiced". If none of these applies, the syllable transition determination unit 6e determines that there is no syllable transition.
  • the syllable transition determination unit 6e determines that there is a syllable transition.
  • the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from the one indicating "voiced” to the one indicating "unvoiced”
  • the syllable transition determination unit 6e determines that there is a voice transition based on the maximum value Mx and the minimum value Mn, and also detects that the state has changed from the unvoiced state to the voiced state based on the voiced / unvoiced information Vc. It may be determined that there is a syllable transition when it is performed, or it may be determined that there is a syllable transition when it is detected that the state has changed from a voiced state to an unvoiced state.
  • the determination of the presence or absence of the syllable transition based on the voiced / unvoiced information Vc in the syllable transition determination unit 6e may be performed independently of the detection of the maximum value Mx and the minimum value Mn in the extreme value detection unit 4. That is, even if the maximum value Mx and the minimum value Mn are not detected by the extreme value detection unit 4, it may be determined that the syllable has transitioned based on the voiced / unvoiced information Vc.
  • the detection of the syllable transition by the maximum value Mx and the minimum value Mn in the extreme value detection unit 4, that is, the detection of the syllable transition due to the time change of the voice tract frequency characteristic may miss the syllable transition when the pronunciation is not clear.
  • This oversight can be compensated for by the combined use of changes in the voiced / unvoiced information Vc output from the voiced information extraction unit 10.
  • the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the first embodiment.
  • Embodiment 6 The detection of the syllable transition using the LPC mel cepstrum or the LPC cepstrum shown in the second embodiment and the detection of the syllable transition using the voiced / unvoiced information Vc shown in the fifth embodiment can also be used in combination.
  • the speech speed conversion device shown in FIG. 9 is generally the same as the speech speed conversion device of FIG. 5, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided.
  • the configuration of the frequency information generation unit 2b in FIG. 9 is the same as that in FIG. 5, and the illustration thereof is omitted.
  • the voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
  • the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the second embodiment.
  • Embodiment 7 The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the third embodiment described with reference to FIG.
  • the speech speed conversion device of FIG. 10 is generally the same as the speech speed conversion device of FIG. 6, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6. It is different.
  • the configuration of the frequency information generation unit 2c in FIG. 10 is the same as that in FIG. 6, and the illustration thereof is omitted.
  • the voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
  • the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the third embodiment.
  • Embodiment 8 The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the fourth embodiment described with reference to FIG. 7.
  • the speech speed conversion device shown in FIG. 11 is generally the same as the speech speed conversion device of FIG. 7, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided.
  • the configuration of the frequency information generation unit 2d in FIG. 11 is the same as that in FIG. 7, and the illustration thereof is omitted.
  • the voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
  • the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the fourth embodiment.
  • Each of the speech speed conversion devices of the first to eighth embodiments may be composed of a part or all of the processing circuit.
  • the functions of each part of the speech speed converter may be realized by separate processing circuits, or the functions of a plurality of parts may be collectively realized by one processing circuit.
  • the processing circuit may be composed of hardware or software, that is, a programmed computer. Of the functions of each part of the speech speed converter, a part may be realized by hardware and the other part may be realized by software.
  • FIG. 12 shows the hardware configuration of the computer 90 that realizes all the functions of the speech speed converter.
  • the computer 90 has a processor 91 and a memory 92.
  • the memory 92 stores a program for realizing the functions of each part of the speech speed conversion device.
  • the processor 91 uses, for example, a CPU (Central Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.
  • a CPU Central Processing Unit
  • a microprocessor a microcontroller
  • DSP Digital Signal Processor
  • the memory 92 includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Lead Only Memory), an EEPROM (Electrically Memory Memory, etc.) Alternatively, a photomagnetic disk or the like is used.
  • the processor 91 and the memory 92 may be realized by an LSI (Large Scale Integration) integrated with each other.
  • the processor 91 realizes the function of the speech speed converter by executing the program stored in the memory 92.
  • the program may be provided over a network or may be recorded and provided on a recording medium, such as a non-temporary recording medium. That is, the program may be provided, for example, as a program product.
  • the computer of FIG. 12 includes a single processor, but may include two or more processors.
  • the process of FIG. 13 is started every time one frame of voice code data is received. Therefore, when the speech speed conversion device processes the voice code data encoded with high efficiency by the CS-ACELP coding method, the process shown in FIG. 13 is started every 10 milliseconds.
  • step ST1 the processor 91 performs voice decoding of the received voice code data and outputs the decoded voice signal.
  • the process of step ST1 has the same contents as the process in the voice decoding unit 1 of FIG. For example, the same processing as the operation of the voice decoding unit described with reference to FIG. 2 is performed.
  • step ST2 the processor 91 determines whether the decoded audio signal is audible or silent.
  • the process of step ST2 has the same content as the process of the sound detection unit 5 of FIG.
  • the processing in the sound detection unit 5 is performed every period shorter than the frame period, for example, comparison with the threshold value D56 or D50, calculation of the average value D55a, update of the noise level value D55, and update of the threshold value D56. However, these shall be performed separately.
  • step ST3 the processor 91 generates frequency information Fa at regular time intervals based on the information generated in the voice decoding process in step ST1. Specifically, the LSP coefficient generated by the voice decoding process is extracted. For example, the 10th-order LSP coefficient is extracted for each frame.
  • the processing in step ST3 has the same contents as the processing in the frequency information generation unit 2 of FIG.
  • step ST4 the processor 91 calculates the amount of time change of the frequency information Fa generated in step ST3 at regular time intervals. Specifically, the 10th-order LSP coefficient is regarded as one vector, and the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it is calculated as the amount of frequency change.
  • the process of step ST4 has the same contents as the process in the information change amount calculation unit 3 of FIG.
  • step ST5 the processor 91 refers to the determination result in step ST2, and if it is not determined to be sound (No in ST5), the processor 91 proceeds to step ST12 and determines that it is sound. If it is done (yes in ST5), the process proceeds to step ST6.
  • step ST6 the processor 91 detects the maximum value Mx and the minimum value Mn for the information change amount Vf calculated in step ST4. For that purpose, first, it is determined whether or not there is a maximum and a minimum during the most recent past Na frame period. Specifically, it is determined whether or not the amount of information change Vf for the time n-1 one frame before the current time n is the maximum, and if it is the maximum, the value (maximum value) Mx is acquired. , The minimum value in the information change amount Vf for the latest past Na frame is specified, and the value (minimum value) Mn is acquired.
  • the process in step ST6 has the same content as the process in the extreme value detection unit 4 of FIG.
  • step ST7 the processor 91 determines whether or not the maximum and minimum are detected in step ST6. If the maximum and minimum are not detected, the process proceeds to step ST12. If the maximum and the minimum are detected, the process proceeds to step ST8.
  • step ST8 the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6. For example, if the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, it is determined that there is a syllable transition, and if not, there is no syllable transition. judge.
  • the process of step ST8 has the same content as the process of the syllable transition determination unit 6 of FIG.
  • step ST9 the processor 91 determines whether or not it is the timing for calculating the speaking speed. For example, when the speaking speed is calculated every fixed calculation cycle, in step ST9, it is determined whether or not a time corresponding to a certain calculation cycle has elapsed from the previous calculation. If it is not the calculation timing, the process proceeds to step ST12. If it is the calculation timing, the process proceeds to step ST10.
  • step ST10 the processor 91 calculates the speaking speed Ss.
  • the number of syllables per unit time is obtained from the number of syllables in a certain period that was the most recent sound at each time point and the length of the certain period, and this is defined as the speaking speed Ss.
  • the time when the voice speaker is silent is not included in the averaging time. Therefore, averaging is performed only for the time determined to be sound in step ST2.
  • the process in step ST10 has the same content as the process in the speech speed calculation unit 7 of FIG.
  • step ST11 the processor 91 determines the speaking speed conversion rate Rc.
  • the speaking speed conversion rate is determined by obtaining the ratio St / Sr of the target speaking speed St and the speaking speed Sr calculated in step ST10.
  • the process in step ST11 has the same content as the process in the conversion rate determination unit 8 of FIG.
  • step ST12 the processor 91 converts the audio signal decoded in step ST1 into speech speed using the speech speed conversion rate obtained in step ST11.
  • the processing in step ST12 has the same contents as the processing in the speech speed conversion unit 9 of FIG.
  • step ST5 if No in step ST5, step ST7 or step ST9, the process proceeds to step ST12 without going through step ST11 and the like. In these cases, no new speech speed conversion rate is calculated. In this case, the speaking speed conversion is performed based on the latest speaking speed conversion rate calculated in the past in step ST12. Further, as described above, by increasing the speaking speed of the audio signal determined to be silent, or deleting a part or all of the audio signal, it is possible to prevent the processing delay of the speaking speed conversion from continuing to increase. If the delay exceeds a certain value, the voice is output without converting the speech speed even if there is sound.
  • the processing procedure of FIG. 14 is generally the same as the processing procedure of FIG. 13, except that steps ST3 and ST4 are replaced by steps ST13 and ST4b, and the processing of step ST14 is performed after step ST13.
  • step ST13 the processor 91 extracts the LPC coefficient generated by the voice decoding process in step ST1.
  • voice decoding is performed by the CS-ACELP coding method in step ST1
  • the converted LPC coefficient is extracted from the LSP coefficient.
  • the process of step ST13 has the same contents as the process in the LPC coefficient extraction unit 22 of FIG.
  • step ST14 the processor 91 converts the LPC coefficient extracted in step ST13 into an LPC mel cepstrum. For example, a 10th order LPC mel cepstrum is generated by the conversion. The LPC mel cepstrum is used as frequency information Fa.
  • the process of step ST14 has the same contents as the process in the mer cepstrum calculation unit 23 of FIG.
  • step ST4b in FIG. 14 is the same as the process of step ST4 of FIG.
  • the frequency information Fa used for calculating the information change amount Vf is different. That is, in step ST4b, the processor 91 regards the 10th-order LPC mel cepstrum as one vector, and calculates the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it as the amount of information change Vf. do.
  • the process of step ST4b has the same content as the process in the information change amount calculation unit 3b of FIG.
  • step ST15 is performed after step ST3
  • step ST16 is performed before step ST4
  • step ST6 is performed. It differs in that it has been replaced in step ST6c.
  • step ST15 the processor 91 determines whether or not it is the extraction timing in the thinning process. For example, it is assumed that the thinning rate is M and the extraction is performed only once in the M frame. In that case, it is determined whether or not M frames have passed since the previous extraction. At the time of the first extraction after the start of the operation of the processor 91, it is determined that the extraction timing is set even if M frames have not passed since the previous extraction.
  • step ST15 If it is not the extraction timing in step ST15 (if it is No in ST15), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST16.
  • step ST16 the processor 91 extracts the frequency information generated in step ST3.
  • the processes of steps ST15 and ST16 have the same contents as the processes in the thinning section 24 of FIG. After step ST16, the process proceeds to step ST4.
  • steps ST4 and ST5 of FIG. 15 is the same as the processing of steps ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.
  • step ST6c of FIG. 15 the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
  • Nb is smaller than Na in the description of step ST6 in FIG. This is because in step ST4 of FIG. 15, the amount of information change Vf is calculated not for each frame but for each M frame. For example, Nb is set to a value equal to Na / M.
  • the process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.
  • step ST15 is performed after step ST3
  • step ST17 is performed before step ST14
  • step ST6 is performed. It differs in that it has been replaced by step ST6c.
  • step ST15 the processor 91 determines whether or not it is the timing of extraction in the thinning process. If it is not the extraction timing (if ST15 is No), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST17. In step ST17, the processor 91 extracts the LPC coefficient extracted in step ST13. The processes of steps ST15 and ST17 have the same contents as the processes in the thinning section 24d of FIG. After step ST17, the process proceeds to step ST14.
  • steps ST14, ST4 and ST5 of FIG. 16 is the same as the processing of steps ST14, ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.
  • step ST6c of FIG. 16 the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
  • Nb is similar to Nb in the description for step ST6c in FIG. 15 and smaller than Na in the description for step ST6 in FIG.
  • the process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.
  • step ST18 is performed after step ST2
  • step ST8e is performed instead of step ST8. It differs in that.
  • step ST18 the processor 91 determines whether the received voice is voiced or unvoiced from the information obtained in the process of voice decoding processing in step ST1.
  • voice decoding is performed by the CS-ACELP coding method in step ST1
  • the process of step ST18 has the same content as the process in the voiced information extraction unit 10 of FIG.
  • step ST8e the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6, and the result of the determination of voiced or unvoiced in step ST18. For example, when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, or the result of the determination in step ST18 is from "unvoiced" to "voiced". If it changes to, it is determined that there is a syllable transition. If none of these apply, it is determined that there is no syllable transition.
  • the process of step ST8e has the same content as the process of the syllable transition determination unit 6e of FIG.
  • step ST18 When the result of the determination in step ST18 changes from “unvoiced” to "voiced”, instead of determining that there is a syllable transition, the result of the determination in step ST18 changes from "voiced” to "unvoiced”. , It may be determined that there is a syllable transition.
  • step ST8e the process proceeds to step ST8e only when it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7, but it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7. Even if this is not the case, it may be determined whether or not there is a syllable transition based on the result of the determination of whether it is voiced or unvoiced in step ST18.
  • step ST18 is performed after step ST2
  • step ST8e is performed instead of step ST8. It differs in that.
  • steps ST18 and ST8e are the same as the processing of steps ST18 and ST8e described with reference to FIG.
  • FIG. 19 is generally the same as the processing procedure shown in FIG. 15, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
  • the processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.
  • FIG. 20 is generally the same as the processing procedure shown in FIG. 16, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
  • the processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.
  • the configuration of the above embodiment can also be applied to the case of increasing the speaking speed.
  • the speaking speed conversion may be performed using the St / Sr as the speaking speed conversion rate Rc.
  • the speech speed conversion device has been described above, it is also possible to implement the speech speed conversion method by using the speech speed conversion device, and the computer is made to execute the processing in the speech speed conversion device or the speech speed conversion method by a program. It is also possible.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Telephone Function (AREA)

Abstract

This speaking speed conversion device for converting speaking speed in a voice communication device decrypts voice encoded data, outputs a voice signal, generates frequency information from information obtained in the course of decryption, obtains the amount of variation over time of the frequency information as information change volume, determines, on the basis of the voice signal, whether a reception voice represented by the voice encoded data has a sound or no sound, determines that a syllable has transitioned if the information change volume during the time of determining that the reception voice has a sound satisfies a predetermined condition, calculates the speaking speed on the basis of the result of determination about the transition of the syllable, determines the conversion rate on the basis of the speaking speed, and converts the speaking speed with the determined conversion rate. The invention makes it possible to reduce the amount of calculation and carry out appropriate speaking speed conversion in accordance with the speaking speed.

Description

話速変換装置、話速変換方法、プログラム及び記録媒体Speaking speed converter, speaking speed conversion method, program and recording medium
 本開示は、話速変換装置、話速変換方法、プログラム及び記録媒体に関する。 This disclosure relates to a speech speed conversion device, a speech speed conversion method, a program, and a recording medium.
 高能率符号化された音声データを送受信する音声通信において、音声の聞き取りやすさを向上させるため、声質を変えずに再生速度を遅くし、若しくは速くして音声データを再生する話速変換技術が開発されている。
 音声通信において、話速変換を行う場合には、有音区間の音声については話速を下げて聞き取り易く変換し、無音区間についてはその一部又は全部を削除し、又は話速を上げることで遅延の増大を防止することがしばしば行われる。
In voice communication that sends and receives highly efficient encoded voice data, in order to improve the ease of hearing the voice, a speech speed conversion technology that slows down or speeds up the playback speed without changing the voice quality is used. It is being developed.
In voice communication, when converting the speaking speed, the speaking speed is lowered to convert the voice in the sounded section to make it easier to hear, and part or all of the silent section is deleted or the speaking speed is increased. It is often done to prevent an increase in delay.
 話速変換においては、早口で話された音声の話速を下げると音声の聞き取り易さを向上させ得るが、ゆっくり話された音声の話速を下げると発話のリズムが分かりにくくなり、かえって聞き取り易さが損なわれる場合がある。
 そのため話速変換前の音声の話速を測定する仕組みが必要となる。
In speech speed conversion, lowering the speaking speed of a fast-spoken voice can improve the ease of hearing the voice, but lowering the speaking speed of a slowly spoken voice makes it difficult to understand the rhythm of the speech, and rather listens. Ease may be compromised.
Therefore, a mechanism for measuring the speech speed of the voice before the speech speed conversion is required.
 従来、発話音声のスペクトル特徴を求めることで話速を測定する技術が開示されている(特許文献1)。この技術では、発話音声に対し、10ms毎に全極モデルに基づく線形予測法(LPC)又は高速フーリエ変換(FFT)によってスペクトル分析を行い、更にスペクトル分析結果を基にスペクトル特徴量ベクトルを求める。
 そして、スペクトル特徴量ベクトルの変化を観測することによって音節の遷移を検出し、話速を測定する。
Conventionally, a technique for measuring a speech speed by obtaining a spectral feature of a spoken voice has been disclosed (Patent Document 1). In this technique, the speech voice is subjected to spectral analysis by linear prediction method (LPC) or fast Fourier transform (FFT) based on a full pole model every 10 ms, and a spectral feature amount vector is obtained based on the spectral analysis result.
Then, the transition of the syllable is detected by observing the change of the spectral feature vector, and the speaking speed is measured.
特開2005-331589号公報Japanese Unexamined Patent Publication No. 2005-331589
 このような話速変換装置が、高能率符号化された音声符号データを送受信する音声通信装置内で受話音声の話速を変換するために用いられる場合、復号処理と話速変換処理とを同時に行う必要があるため、演算量が多いという課題がある。 When such a speech speed conversion device is used to convert the speech speed of the received voice in a voice communication device that transmits and receives high-efficiency encoded voice code data, the decoding process and the speech speed conversion process are performed at the same time. Since it is necessary to perform it, there is a problem that the amount of calculation is large.
 更に、高能率符号化された音声符号データを復号することで得られる音声信号に基づいて話速の測定を行う場合、復号することで得られる音声信号には歪みが重畳しており、このため話速の測定精度が低いという課題がある。 Furthermore, when measuring the speaking speed based on the audio signal obtained by decoding the highly efficient coded audio code data, distortion is superimposed on the audio signal obtained by decoding, which is why. There is a problem that the measurement accuracy of speaking speed is low.
 本開示は上記の問題点を解決するためになされたものであり、演算量を低減することを可能にし、更に音声符号データを復号することで得られる音声信号についても正確に話速を測定し、話速に応じた適切な話速変換を行うことを可能にすることを目的とする。 This disclosure is made to solve the above-mentioned problems, makes it possible to reduce the amount of calculation, and accurately measure the speaking speed of the voice signal obtained by decoding the voice code data. , The purpose is to make it possible to perform an appropriate speech speed conversion according to the speech speed.
 本開示の話速変換装置は、
 音声通信装置内で話速を変換する話速変換装置において、
 高能率符号化された音声符号データを復号して、音声信号を出力する音声復号部と、
 前記音声復号部において前記音声符号データを復号する過程で得られる情報から周波数情報を生成する周波数情報生成部と、
 生成された前記周波数情報の一定時間毎の時間変化量を情報変化量として求める情報変化量算出部と、
 前記音声信号に基づいて、前記音声符号データで表される受信音声が有音であるか無音であるかを判定する有音検出部と、
 前記有音検出部により前記受信音声が有音であると判定されている間の前記情報変化量が予め定められた条件を満たす場合に前記受信音声の音節が遷移したと判定する音節遷移判定部と、
 前記音節遷移判定部による判定結果に基づいて話速を算出する話速算出部と、
 前記話速算出部で算出された話速に基づいて変換率を決定する変換率決定部と、
 前記変換率決定部で決定された変換率で、前記音声信号の話速を変換する話速変換部とを有する。
The speech speed converter of the present disclosure is
In a speech speed converter that converts speech speed in a voice communication device,
A voice decoding unit that decodes high-efficiency encoded voice code data and outputs a voice signal,
A frequency information generation unit that generates frequency information from information obtained in the process of decoding the voice code data in the voice decoding unit, and a frequency information generation unit.
An information change amount calculation unit that obtains the time change amount of the generated frequency information at regular time intervals as the information change amount, and
Based on the voice signal, a sound detection unit that determines whether the received voice represented by the voice code data is sound or no sound, and a sound detection unit.
A syllable transition determination unit that determines that a syllable of the received voice has transitioned when the amount of change in information while the received voice is determined to be sound by the sound detection unit satisfies a predetermined condition. When,
A speech speed calculation unit that calculates the speech speed based on the determination result by the syllable transition determination unit, and
A conversion rate determination unit that determines the conversion rate based on the speech speed calculated by the speech speed calculation unit,
It has a speaking speed conversion unit that converts the speaking speed of the audio signal at the conversion rate determined by the conversion rate determining unit.
 本開示によれば、演算量を低減すること、及び話速に応じた適切な話速変換を行うことが可能になる。 According to the present disclosure, it is possible to reduce the amount of calculation and perform appropriate speech speed conversion according to the speech speed.
実施の形態1に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 1. FIG. 図1の音声復号部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the audio decoding part of FIG. 図1の有音検出部の構成例を示すブロック図である。It is a block diagram which shows the structural example of the sound detection part of FIG. (a)~(h)は、図3の有音検出部の各部に現れる信号を示すタイムチャートである。(A) to (h) are time charts showing signals appearing in each part of the sound detection unit of FIG. 実施の形態2に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 2. 実施の形態3に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 3. 実施の形態4に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 4. FIG. 実施の形態5に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 5. 実施の形態6に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 6. 実施の形態7に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 7. 実施の形態8に係る話速変換装置の構成を示すブロック図である。It is a block diagram which shows the structure of the speech speed conversion apparatus which concerns on Embodiment 8. 話速変換装置の全ての機能を実現するコンピュータのハードウェア構成を示すブロック図である。It is a block diagram which shows the hardware structure of the computer which realizes all the functions of a speech speed converter. 図1の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 1 is composed of the computer of FIG. 図5の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 5 is composed of the computer of FIG. 図6の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 6 is composed of the computer of FIG. 図7の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of processing by a processor when the speech speed conversion apparatus of FIG. 7 is composed of the computer of FIG. 図8の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 8 is composed of the computer of FIG. 図9の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 9 is composed of the computer of FIG. 図10の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 10 is composed of the computer of FIG. 図11の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサによる処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of the processing by a processor when the speech speed conversion apparatus of FIG. 11 is composed of the computer of FIG.
実施の形態1.
 図1は、実施の形態1に係る話速変換装置の構成を示す。
 図示の話速変換装置は、音声通信装置内で受信音声の話速を変換するものであり、音声復号部1、周波数情報生成部2、情報変化量算出部3、極値検出部4、有音検出部5、音節遷移判定部6、話速算出部7、変換率決定部8、及び話速変換部9を有する。
Embodiment 1.
FIG. 1 shows the configuration of the speech speed conversion device according to the first embodiment.
The illustrated speech speed conversion device converts the speech speed of the received voice in the voice communication device, and has a voice decoding unit 1, a frequency information generation unit 2, an information change amount calculation unit 3, an extreme value detection unit 4, and a presence. It has a sound detection unit 5, a syllable transition determination unit 6, a speech speed calculation unit 7, a conversion rate determination unit 8, and a speech speed conversion unit 9.
 まず、各部の動作の概略を説明する。
 図示の話速変換装置には、高能率符号化された音声符号データDaが入力される。音声符号データDaは、各音声フレーム毎に、音声のピッチ周期情報、固定コードブックベクトルを表す情報、利得情報、及びLSP係数を表す情報を含む。音声フレームを以下では単にフレームと言う。
First, the outline of the operation of each part will be described.
Highly efficient coded voice code data Da is input to the illustrated speech speed conversion device. The voice code data Da includes pitch cycle information of voice, information representing a fixed codebook vector, gain information, and information representing an LSP coefficient for each voice frame. Audio frames are simply referred to as frames below.
 音声復号部1は、音声符号データDaを復号して線形PCM(Pulse Code Modulation)符号を表す音声信号(復号音声信号)Dbを生成する。 The voice decoding unit 1 decodes the voice code data Da and generates a voice signal (decoded voice signal) Db representing a linear PCM (Pulse Code Modulation) code.
 周波数情報生成部2は、音声復号部1内で復号の過程で生成される情報から一定時間毎に周波数情報Faを抽出して出力する。周波数情報Faは、各音素を発するときの声道周波数特性を表すものである。 The frequency information generation unit 2 extracts and outputs the frequency information Fa from the information generated in the decoding process in the voice decoding unit 1 at regular intervals. The frequency information Fa represents the vocal tract frequency characteristics when each phoneme is emitted.
 情報変化量算出部3は、周波数情報生成部2から出力された周波数情報Faの一定時間毎の時間変化量(情報変化量)Vfを算出する。 The information change amount calculation unit 3 calculates the time change amount (information change amount) Vf of the frequency information Fa output from the frequency information generation unit 2 at regular time intervals.
 極値検出部4は、情報変化量算出部3により算出された情報変化量Vfの極大値Mx及び極小値Mnの検出を行う。 The extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf calculated by the information change amount calculation unit 3.
 有音検出部5は、音声復号部1から出力される音声信号Dbに基づき、音声符号データDaで表される音声(受信音声)が有音であるか無音であるかを判定し、判定結果を示す情報、即ち有音/無音情報Lmを出力する。 The sound detection unit 5 determines whether the voice (received voice) represented by the voice code data Da is sound or no sound based on the voice signal Db output from the voice decoding unit 1, and determines whether the voice (received voice) is sound or no sound. Information indicating that, that is, sound / silence information Lm is output.
 音節遷移判定部6は、極値検出部4で検出された極大値Mx及び極小値Mnと、有音検出部5から出力された有音/無音情報Lmを基に、音節遷移の有無を判定し、判定結果Syを出力する。 The syllable transition determination unit 6 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected by the extreme value detection unit 4 and the sound / silence information Lm output from the sound detection unit 5. Then, the determination result Sy is output.
 話速算出部7は、音節遷移判定部6での判定結果Syを基に話速Ssを算出する。話速Ssは単位時間当りの音節数で表される。 The speaking speed calculation unit 7 calculates the speaking speed Ss based on the determination result Sy of the syllable transition determination unit 6. The speaking speed Ss is represented by the number of syllables per unit time.
 変換率決定部8は、話速算出部7により算出された話速Ssを基に受信音声の話速変換率Rcを決定する。 The conversion rate determination unit 8 determines the speech speed conversion rate Rc of the received voice based on the speech speed Ss calculated by the speech speed calculation unit 7.
 話速変換部9は、変換率決定部8により決定された話速変換率Rcに基づき、音声信号Dbに対して話速変換処理を行い、変換後の音声信号Dcを出力する。 The speech speed conversion unit 9 performs a speech speed conversion process on the audio signal Db based on the speech speed conversion rate Rc determined by the conversion rate determination unit 8, and outputs the converted audio signal Dc.
 以下、各部の動作をより詳しく説明する。
 音声復号部1は、高能率符号化された音声符号データDaを受信し、線形PCM符号に復号し、線形PCM符号を表す音声信号(復号音声信号)Dbを出力する。
The operation of each part will be described in more detail below.
The audio decoding unit 1 receives the highly efficient coded audio code data Da, decodes it into a linear PCM code, and outputs an audio signal (decoded audio signal) Db representing the linear PCM code.
 図2は、図1の音声復号部1の構成例を示す。図2に示される音声復号部1は、ITU-T(International Telecomunication Union Telecomunitation Standardization Sector)勧告G.729に規定されるCS-ACELP(Conjugate Structure Algebraic Code Excited Linear Prediction)符号化方式に準拠するものである。 FIG. 2 shows a configuration example of the audio decoding unit 1 of FIG. The voice decoding unit 1 shown in FIG. 2 is described in an ITU-T (International Telecommunication Union Telecommunication Standardization Sector) recommendation G.I. It conforms to the CS-ACELP (Conjugate Structure Algebraic Code Excited Liner Edition) coding method specified in 729.
 図2に示される音声復号部1は、適応コードブックベクトル復号部101、利得復号部102、固定コードブックベクトル復号部103、適応プリフィルタ部104、予測利得算出部105、励振信号生成部106、LSP係数復号部107、補間部108、LPC係数変換部109、合成フィルタ部110、及びポストフィルタ部111を有する。 The audio decoding unit 1 shown in FIG. 2 includes an adaptive codebook vector decoding unit 101, a gain decoding unit 102, a fixed codebook vector decoding unit 103, an adaptive prefilter unit 104, a predicted gain calculation unit 105, and an excitation signal generation unit 106. It has an LSP coefficient decoding unit 107, an interpolation unit 108, an LPC coefficient conversion unit 109, a composite filter unit 110, and a post filter unit 111.
 適応コードブックベクトル復号部101は、受信した各フレームの音声符号データDaから音声のピッチ周期情報を復号し、適応コードブックベクトルを生成する。適応コードブックベクトルは過去に生成された励振信号を表す。音声信号は周期性が強いことを考慮し、過去に生成した励振信号を保存しておき、ピッチ周期情報に基づいて再利用していると言える。 The adaptive codebook vector decoding unit 101 decodes the pitch period information of the voice from the voice code data Da of each received frame and generates the adaptive codebook vector. The adaptive codebook vector represents an excitation signal generated in the past. Considering that the audio signal has a strong periodicity, it can be said that the excitation signal generated in the past is stored and reused based on the pitch period information.
 固定コードブックベクトル復号部103は受信した各フレームの音声符号データDaから固定コードブックベクトルを復号する。 The fixed code book vector decoding unit 103 decodes the fixed code book vector from the voice code data Da of each received frame.
 適応プリフィルタ部104は、復号された固定コードブックベクトルに対し、ピッチ成分の強調を行う。 The adaptive prefilter unit 104 emphasizes the pitch component of the decoded fixed codebook vector.
 利得復号部102は受信した各フレームの音声符号データDaから利得情報を復号し、適応コードブックベクトルの利得及び固定コードブックベクトルの利得を出力する。 The gain decoding unit 102 decodes the gain information from the received voice code data Da of each frame, and outputs the gain of the adaptive codebook vector and the gain of the fixed codebook vector.
 予測利得算出部105は、利得復号部102から出力される各フレームの固定コードブックベクトルの利得と、適応プリフィルタ部104から出力される、過去の固定コードブックベクトルとを基に、固定コードブックベクトルの予測利得を求める。 The predicted gain calculation unit 105 is based on the gain of the fixed codebook vector of each frame output from the gain decoding unit 102 and the past fixed codebook vector output from the adaptive prefilter unit 104. Find the predicted gain of the vector.
 励振信号生成部106は、適応コードブックベクトル復号部101から出力された各フレームの適応コードブックベクトル、適応プリフィルタ部104から出力された各フレームの固定コードブックベクトル、利得復号部102から出力された各フレームの適応コードブックベクトルの利得、及び予測利得算出部105から出力された固定コードブックベクトルの予測利得を用いて励振信号Seを生成する。 The excitation signal generation unit 106 is output from the adaptive codebook vector of each frame output from the adaptive codebook vector decoding unit 101, the fixed codebook vector of each frame output from the adaptive prefilter unit 104, and the gain decoding unit 102. The excitation signal Se is generated using the gain of the adaptive codebook vector of each frame and the predicted gain of the fixed codebook vector output from the predicted gain calculation unit 105.
 LSP係数復号部107は、受信した各フレームの音声符号データDaからLSP係数を復号する。
 CS-ACELP符号化方式においてフレーム長は10ミリ秒であり、10次のLSP係数が10ミリ秒毎に復号される。
The LSP coefficient decoding unit 107 decodes the LSP coefficient from the voice code data Da of each received frame.
In the CS-ACELP coding method, the frame length is 10 milliseconds, and the 10th-order LSP coefficient is decoded every 10 milliseconds.
 補間部108は、現フレームのLSP係数と前フレームのLSP係数を用いて、それらの中間のタイミングの、即ち現フレームの5ミリ秒前のLSP係数を補間により生成する。 The interpolation unit 108 uses the LSP coefficient of the current frame and the LSP coefficient of the previous frame to generate an LSP coefficient at an intermediate timing between them, that is, 5 milliseconds before the current frame by interpolation.
 LPC係数変換部109は現フレームのLSP係数と補間により生成されたLSP係数とをLPC(Linear Predictive Coding)係数に変換する。 The LPC coefficient conversion unit 109 converts the LSP coefficient of the current frame and the LSP coefficient generated by interpolation into an LPC (Linear Predictive Coding) coefficient.
 合成フィルタ部110は、LPC係数変換部109から出力されるLPC係数をフィルタ係数とする全極フィルタであり、励振信号生成部106で生成された励振信号Seを入力として合成音声信号Sfを生成する。 The synthetic filter unit 110 is a full-pole filter having an LPC coefficient output from the LPC coefficient conversion unit 109 as a filter coefficient, and generates a synthetic voice signal Sf by inputting an excitation signal Se generated by the excitation signal generation unit 106. ..
 ポストフィルタ部111は、合成フィルタ部110で生成された合成音声信号Sfに対し、ピッチ成分強調などを行ない、聴感上の品質改善を図る。 The post filter unit 111 emphasizes the pitch component of the synthetic audio signal Sf generated by the synthetic filter unit 110 to improve the audible quality.
 ポストフィルタ部111は複数のフィルタが縦続されたものである。その複数のフィルタの中の長期ポストフィルタはピッチ成分を強調するフィルタであり、この長期ポストフィルタでは、ピッチ成分の強調度合いを制御する利得係数が用いられる。 The post filter unit 111 is a series of a plurality of filters. The long-term post filter among the plurality of filters is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.
 利得係数は、上記の長期ポストフィルタによる処理で生成される。具体的には、合成フィルタ部110が出力する合成信号の自己相関が大きくなる遅延を検索し、その遅延における自己相関が小さい場合は利得係数を0とし、そうでない場合はその遅延成分(ピッチ)を強調する係数(0より大きく1以下)が設定される。 The gain coefficient is generated by the processing by the above long-term post filter. Specifically, the delay in which the autocorrelation of the composite signal output by the composite filter unit 110 becomes large is searched, and if the autocorrelation in the delay is small, the gain coefficient is set to 0, and if not, the delay component (pitch) is set. A coefficient (greater than 0 and less than 1) is set to emphasize.
 ポストフィルタ部111の出力は、音声復号部1の出力、即ち復号された音声信号Dbとして出力される。 The output of the post filter unit 111 is output as the output of the audio decoding unit 1, that is, the decoded audio signal Db.
 周波数情報生成部2は、音声復号部1内で復号の過程で生成される各フレームの情報から周波数情報Faを抽出して出力する。
 図1に示される周波数情報生成部2は、LSP係数抽出部21を含む。
 LSP係数抽出部21は、音声復号部1の復号動作で生成される情報から、LSP(Line Spectral Pair)係数を1フレーム毎に抽出し、周波数情報Faとして出力する。
The frequency information generation unit 2 extracts and outputs the frequency information Fa from the information of each frame generated in the decoding process in the voice decoding unit 1.
The frequency information generation unit 2 shown in FIG. 1 includes an LSP coefficient extraction unit 21.
The LSP coefficient extraction unit 21 extracts the LSP (Line Spectral Pair) coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1, and outputs it as the frequency information Fa.
 上記の通り、CS-ACELP符号化方式による音声復号処理においては、1フレーム毎に10次のLSP係数が復号されており、図1のLSP係数抽出部21はこの情報、即ち、LSP係数復号部107から出力される10次のLSP係数を1フレーム毎に抽出し、周波数情報Faとして出力する。各フレームの10次のLSP係数は、1つの10次元のLSP係数ベクトルを構成すると見ることができる。 As described above, in the audio decoding process by the CS-ACELP coding method, the 10th-order LSP coefficient is decoded for each frame, and the LSP coefficient extraction unit 21 in FIG. 1 has this information, that is, the LSP coefficient decoding unit. The 10th-order LSP coefficient output from 107 is extracted for each frame and output as frequency information Fa. The 10th-order LSP coefficient of each frame can be seen as constructing one 10-dimensional LSP coefficient vector.
 情報変化量算出部3は、現フレームのLSP係数ベクトルと1フレーム前のLSP係数ベクトルとの距離(ベクトル間距離)を情報変化量Vfとして算出する。 The information change amount calculation unit 3 calculates the distance (inter-vector distance) between the LSP coefficient vector of the current frame and the LSP coefficient vector one frame before as the information change amount Vf.
 例えば、n(nは整数)は現在時刻(符号化フレーム番号)を表し、n-i(iは整数)は現在時刻nよりもiフレーム前の時刻を表すとすると、時刻nにおけるLSP係数ベクトルをf1(n)、f2(n)、・・・、f10(n)とすると、情報変化量算出部3はベクトル間距離d(n)を下記の式(1)による演算で求める。 For example, assuming that n (n is an integer) represents the current time (encoded frame number) and n-i (i is an integer) represents the time i frames before the current time n, the LSP coefficient vector at time n Is f1 (n), f2 (n), ..., F10 (n), and the information change amount calculation unit 3 obtains the inter-vector distance d (n) by the calculation according to the following equation (1).
 d(n)
 ={f1(n)-f1(n-1)}×{f1(n)-f1(n-1)}
 +{f2(n)-f2(n-1)}×{f2(n)-f2(n-1)}
                :
                :
 +{f10(n)-f10(n-1)}×{f10(n)-f10(n-1)}
                            …(1)
d (n)
= {F1 (n) -f1 (n-1)} x {f1 (n) -f1 (n-1)}
+ {F2 (n) -f2 (n-1)} x {f2 (n) -f2 (n-1)}
:
:
+ {F10 (n) -f10 (n-1)} x {f10 (n) -f10 (n-1)}
… (1)
 なお、以降の説明においても、nが現在時刻を示すものとして説明する。 In the following description, n will be described as indicating the current time.
 極値検出部4は、直近の過去の一定期間内において、情報変化量Vfの極大値Mx及び極小値Mnの検出を行う。ここで言う直近の過去の一定期間は、直近の過去のNaフレームの期間、即ち現在時刻nから(Na-1)フレーム前(Naは4以上の整数)の時刻n-Na+1までの期間である。 The extreme value detection unit 4 detects the maximum value Mx and the minimum value Mn of the information change amount Vf within a certain period in the latest past. The most recent past fixed period referred to here is the period of the most recent past Na frame, that is, the period from the current time n to the time n-Na + 1 before the (Na-1) frame (Na is an integer of 4 or more). ..
 以下、情報変化量Vfがベクトル間距離d(n)である場合の、極大値Mx及び極小値Mnを求める手順を説明する。
 極値検出部4はまず、現在時刻nよりも1フレーム前の時刻n-1のベクトル間距離d(n-1)が極大か否かを判定する。
 そして、ベクトル間距離d(n-1)が極大であれば、直近の過去のNaフレーム中の極小を検出する。
Hereinafter, a procedure for obtaining the maximum value Mx and the minimum value Mn when the information change amount Vf is the inter-vector distance d (n) will be described.
First, the extremum detection unit 4 determines whether or not the inter-vector distance d (n-1) at time n-1 one frame before the current time n is maximum.
Then, if the inter-vector distance d (n-1) is maximum, the minimum in the latest past Na frame is detected.
 例えば、d(n)がd(n-1)より小さく、かつd(n-1)がd(n-2)より大きい場合、d(n-1)を極大と判定する。
 この条件が満たされない場合、d(n-1)は極大ではないと判定する。
For example, when d (n) is smaller than d (n-1) and d (n-1) is larger than d (n-2), d (n-1) is determined to be maximum.
If this condition is not satisfied, it is determined that d (n-1) is not maximum.
 d(n-1)が極大である場合は、引き続き極小の特定を行う。
 例えば、d(m)がd(m-1)より大きく、かつd(m-1)がd(m-2)より小さく、かつn-Na+2≦m≦n-1を満たす、最新の時刻m(最も大きい値)を検索する。これらの条件を満たすmが存在する場合はd(m-1)が極小であると判定する。
 これの条件を満たすmが存在しない場合、d(n-Na+1)を便宜上の極小とする。この便宜上の極小は、d(n-Na+1)、d(n-Na+2)、・・・、d(n-2)の中の最小値に相当する。
When d (n-1) is the maximum, the minimum is continuously specified.
For example, the latest time m in which d (m) is larger than d (m-1), d (m-1) is smaller than d (m-2), and n—Na + 2 ≦ m ≦ n-1 is satisfied. Search for (largest value). When m satisfying these conditions exists, it is determined that d (m-1) is the minimum.
When m satisfying this condition does not exist, d (n—Na + 1) is set to the minimum for convenience. The minimum for this convenience corresponds to the minimum value among d (n-Na + 1), d (n-Na + 2), ..., D (n-2).
 極値検出部4は、上記のようにして検出された極大の値(極大値)Mx及び極小の値(極小値)Mnを取得する。 The extreme value detection unit 4 acquires the maximum value (maximum value) Mx and the minimum value (minimum value) Mn detected as described above.
 Naの値は、フレーム長のNa倍が通常の発話において想定される音節の長さの最大値以上となるように設定しておく。通常、音節は短い場合で数10ミリ秒、長い場合でも200ミリ秒以下となるので、Naを200ミリ秒相当の値にするのが良い。具体的には、CS-ACELP符号化方式のフレーム長は10msであるので、Na=200/10=20程度にするのが適切である。 The Na value is set so that Na times the frame length is equal to or greater than the maximum value of the syllable length expected in normal utterance. Normally, the syllable is several tens of milliseconds when it is short, and 200 milliseconds or less when it is long, so it is better to set Na to a value equivalent to 200 milliseconds. Specifically, since the frame length of the CS-ACELP coding method is 10 ms, it is appropriate to set Na = 200/10 = 20.
 有音検出部5は、音声復号部1から出力される音声信号の有音/無音を判定し、判定結果を示す情報、即ち有音/無音情報Lmを出力する。
 この判定は、数ミリ秒~数10ミリ秒毎に、例えばフレーム期間又はその整数倍の期間毎に行われる。以下では、この判定がフレーム期間毎に行われるものとして説明する。
 有音検出部5は、音声復号部1から出力される音声信号Dbの振幅に基づいて、有音かどうかを判断する。
The sound detection unit 5 determines the sound / silence of the voice signal output from the voice decoding unit 1, and outputs information indicating the determination result, that is, the sound / silence information Lm.
This determination is made every few milliseconds to several tens of milliseconds, for example, every frame period or an integral multiple thereof. Hereinafter, it will be described that this determination is performed for each frame period.
The sound detection unit 5 determines whether or not there is sound based on the amplitude of the voice signal Db output from the voice decoding unit 1.
 図3は、有音検出部5の構成例を示す。
 図示の有音検出部5は、低レベル検出部51、高レベル検出部52、論理和演算部53、ハングオーバ付加部54、雑音レベル算出部55、及び閾値設定部56を有する。
FIG. 3 shows a configuration example of the sound detection unit 5.
The illustrated sound detection unit 5 includes a low level detection unit 51, a high level detection unit 52, a disjunction calculation unit 53, a hangover addition unit 54, a noise level calculation unit 55, and a threshold value setting unit 56.
 低レベル検出部51は、音声信号Dbを適応閾値D56と比較し、比較結果に基づく信号D51を出力する。適応閾値D56は閾値設定部56から供給される。低レベル検出部51は、比較部511と判定部513とを有する。 The low level detection unit 51 compares the audio signal Db with the adaptation threshold value D56, and outputs the signal D51 based on the comparison result. The adaptive threshold value D56 is supplied from the threshold value setting unit 56. The low level detection unit 51 includes a comparison unit 511 and a determination unit 513.
 比較部511は、音声信号Dbを適応閾値D56と比較し、比較結果を示す信号D511を出力する。信号D511は、音声信号Dbが閾値D56よりも大きければ(即ち、音声信号Dbのサンプル値の絶対値が閾値D56よりも大きければ)Highであり、そうでなければLowである。比較はサンプル周期毎に行われる。 The comparison unit 511 compares the audio signal Db with the adaptation threshold value D56, and outputs a signal D511 indicating the comparison result. The signal D511 is High if the audio signal Db is greater than the threshold D56 (that is, if the absolute value of the sample value of the audio signal Db is greater than the threshold D56), otherwise it is Low. The comparison is made for each sample cycle.
 判定部513は、信号D511に基づいて信号D51を出力する。信号D51は、信号D511が一定期間以上Highの状態を続けるとHighとなり、信号D511がLowとなると、直ちにLowとなる。 The determination unit 513 outputs the signal D51 based on the signal D511. The signal D51 becomes High when the signal D511 continues in the High state for a certain period of time or longer, and becomes Low immediately when the signal D511 becomes Low.
 高レベル検出部52は、音声信号Dbを予め定められた閾値D50と比較し、比較結果に基づく信号D52及びD521を出力する。
 閾値D50は、通常予想される最大の背景雑音レベルよりも高い値に設定されている。高レベル検出部52は、比較部521と判定部523とを有する。
The high level detection unit 52 compares the audio signal Db with a predetermined threshold value D50, and outputs signals D52 and D521 based on the comparison result.
The threshold D50 is set to a value higher than the maximum background noise level normally expected. The high level detection unit 52 includes a comparison unit 521 and a determination unit 523.
 比較部521は、音声信号Dbを閾値D50と比較し、比較結果を示す信号D521を出力する。信号D521は、音声信号Dbが閾値D50よりも大きければHighであり、そうでなければLowである。比較はサンプル周期毎に行われる。 The comparison unit 521 compares the audio signal Db with the threshold value D50, and outputs a signal D521 indicating the comparison result. The signal D521 is High if the audio signal Db is larger than the threshold value D50, and Low otherwise. The comparison is made for each sample cycle.
 判定部523は、信号D521に基づいて信号D52を出力する。信号D52は、信号D521が一定期間以上Highの状態を続けるとHighとなり、信号D521がLowとなると、直ちにLowとなる。 The determination unit 523 outputs the signal D52 based on the signal D521. The signal D52 becomes High when the signal D521 continues to be in the High state for a certain period of time or longer, and becomes Low immediately when the signal D521 becomes Low.
 論理和演算部53は、信号D51と信号D52との論理和を求める演算を行う。論理和演算部53の出力信号D53は、信号D51又はD52の少なくとも一方がHighであればHighであり、そうでなければLowである。 The OR unit 53 performs an operation to obtain the OR of the signal D51 and the signal D52. The output signal D53 of the OR unit 53 is High if at least one of the signals D51 or D52 is High, and Low otherwise.
 ハングオーバ付加部54は、論理和演算部53の出力信号D53に対してハングオーバ付加処理を行い、その結果得られる信号を有音/無音情報Lmとして出力する。
 ハングオーバ処理は入力信号(D53)がLowからHighに変化したときは直ちにLowからHighに変化し、入力信号(D53)がHighからLowに変化したときは一定の遅延時間後にHighからLowに変化する信号(Lm)を出力する処理である。
The hangover addition unit 54 performs a hangover addition process on the output signal D53 of the OR unit 53, and outputs the signal obtained as a result as sound / silence information Lm.
The hangover process immediately changes from Low to High when the input signal (D53) changes from Low to High, and changes from High to Low after a certain delay time when the input signal (D53) changes from High to Low. This is a process for outputting a signal (Lm).
 雑音レベル算出部55は、音声信号Dbのサンプル値の絶対値の一定期間毎の加算平均値D55aを算出し、算出した平均値D55aに基づいて雑音レベル値D55を求める。例えば、算出した平均値D55aの、比較的長い期間毎の移動平均を雑音レベル値D55として更新を行う。但し、信号D521がHighである期間中の平均値D55aは、移動平均の算出に用いず、その期間中は、それ以前に算出した移動平均を維持する。 The noise level calculation unit 55 calculates the added average value D55a of the absolute value of the sample value of the audio signal Db at regular intervals, and obtains the noise level value D55 based on the calculated average value D55a. For example, the calculated average value D55a is updated with the moving average for each relatively long period as the noise level value D55. However, the average value D55a during the period when the signal D521 is High is not used for calculating the moving average, and the moving average calculated before that is maintained during that period.
 閾値設定部56は、雑音レベル算出部55から出力される背景雑音レベル値D55に応じて適応閾値D56を調整する。適応閾値D56は、算出された雑音レベル値D55よりも若干大きい値に調整される。適応閾値D56は、算出された雑音レベル値D55の変化に追従して変更される。 The threshold value setting unit 56 adjusts the adaptation threshold value D56 according to the background noise level value D55 output from the noise level calculation unit 55. The adaptation threshold D56 is adjusted to a value slightly larger than the calculated noise level value D55. The adaptive threshold value D56 is changed following a change in the calculated noise level value D55.
 以下図4(a)~(h)を参照して有音検出部5の動作を説明する。
 図示の例では、閾値D50が図4(a)に示される値に設定されており、音声信号Dbは図4(a)に示すように変化する場合を想定している。
The operation of the sound detection unit 5 will be described below with reference to FIGS. 4A to 4H.
In the illustrated example, it is assumed that the threshold value D50 is set to the value shown in FIG. 4A, and the audio signal Db changes as shown in FIG. 4A.
 音声信号Dbが閾値D50よりも大きい期間には、比較部521の出力信号D521は図4(c)に示すようにHighとなり、判定部523の出力信号D52は、図4(d)に示す如く、若干遅れてHighとなる。 During the period when the audio signal Db is larger than the threshold value D50, the output signal D521 of the comparison unit 521 becomes High as shown in FIG. 4C, and the output signal D52 of the determination unit 523 becomes as shown in FIG. 4D. , It becomes High with a slight delay.
 雑音レベル算出部55で算出される平均値D55a及び雑音レベル値D55は、図4(b)に示すように変化し、閾値設定部56で算出される適応閾値D56は図4(a)に示すように変化する。
 図4(b)に示される雑音レベル値D55及び図4(a)に示される適応閾値D56は、平均値D55aに従って変化するが、音声信号Dbが閾値D50よりも大きい期間(信号D521がHighである期間)は変化せずその直前の値に維持される。
The average value D55a and the noise level value D55 calculated by the noise level calculation unit 55 change as shown in FIG. 4B, and the adaptation threshold value D56 calculated by the threshold value setting unit 56 is shown in FIG. 4A. It changes like.
The noise level value D55 shown in FIG. 4B and the adaptive threshold value D56 shown in FIG. 4A change according to the average value D55a, but the period during which the audio signal Db is larger than the threshold value D50 (signal D521 is High). (For a certain period of time) does not change and is maintained at the value immediately before it.
 音声信号Dbが閾値D56よりも大きくなると(時刻ta)、比較部511の出力信号D511は図4(e)に示すようにHighとなり、判定部513の出力信号D51は、図4(f)に示す如く、若干遅れてHighとなる。 When the audio signal Db becomes larger than the threshold value D56 (time ta), the output signal D511 of the comparison unit 511 becomes High as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 becomes High as shown in FIG. 4 (f). As shown, it becomes High with a slight delay.
 音声信号Dbが閾値D56以下になると(時刻tb)、比較部511の出力信号D511は図4(e)に示すようにLowとなり、判定部513の出力信号D51も、図4(f)に示す如くLowとなる。 When the audio signal Db becomes the threshold value D56 or less (time tb), the output signal D511 of the comparison unit 511 becomes Low as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 is also shown in FIG. 4 (f). It becomes Low like.
 その後、音声信号Dbが閾値D56よりも大きくなると(時刻tc)、比較部511の出力信号D511は図4(e)に示すようにHighとなり、判定部513の出力信号D51は、図4(f)に示す如く、若干遅れてHighとなる。 After that, when the audio signal Db becomes larger than the threshold value D56 (time tc), the output signal D511 of the comparison unit 511 becomes High as shown in FIG. 4 (e), and the output signal D51 of the determination unit 513 becomes FIG. 4 (f). ), It becomes High with a slight delay.
 論理和演算部53の出力信号D53は、図4(g)に示すように、信号D51の立ち上がりとともに立ち上がり、信号D51の立下りとともに立ち下がる。 As shown in FIG. 4 (g), the output signal D53 of the OR unit 53 rises with the rising edge of the signal D51 and falls with the falling edge of the signal D51.
 ハングオーバ付加部54の出力信号Lmは、図4(h)に示すように、信号D53の立ち上がりとともに立ち上がり、信号D53の立下りから若干遅れて立ち下がる。 As shown in FIG. 4H, the output signal Lm of the hangover addition unit 54 rises with the rise of the signal D53 and falls with a slight delay from the fall of the signal D53.
 信号Lmは、Highときに有音であることを示し、Lowであるときに無音であることを示す。 The signal Lm indicates that there is sound when it is High, and that it is silent when it is Low.
 以上のように、有音検出部5は、雑音レベル値に応じて変化する適応閾値D56を生成し、音声信号Dbが適応閾値D56よりも大きいか、或いは音声信号Dbが閾値D50よりも大きければ有音と判断し、いずれでもなければ無音と判断することで、有音か無音かの判定を適切に行うことができる。 As described above, the sound detection unit 5 generates an adaptive threshold value D56 that changes according to the noise level value, and if the audio signal Db is larger than the adaptive threshold value D56 or the audio signal Db is larger than the threshold value D50. By determining that there is sound and determining that there is no sound if none of them is present, it is possible to appropriately determine whether there is sound or no sound.
 音節遷移判定部6は、極値検出部4において極大値Mx及び極小値Mnが検出され、かつ、有音検出部5において有音と検出された場合にのみ音節遷移の有無の判定を行い、判定結果Syを出力する。
 音節遷移判定部6は、極大値Mxが予め定めた閾値(第1の閾値)Th1より大きく、かつ極大値Mxと極小値Mnとの差が予め定めた閾値(第2の閾値)Th2より大きい場合に「音節遷移あり」と判定し、そうでない場合には「音節遷移なし」と判定する。
The syllable transition determination unit 6 determines the presence or absence of a syllable transition only when the maximum value Mx and the minimum value Mn are detected by the extreme value detection unit 4 and sound is detected by the sound detection unit 5. The judgment result Sy is output.
In the syllable transition determination unit 6, the maximum value Mx is larger than the predetermined threshold value (first threshold value) Th1, and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value (second threshold value) Th2. In some cases, it is determined that there is a syllable transition, and if not, it is determined that there is no syllable transition.
 話速算出部7は、音節遷移判定部6の判定結果Syと有音検出部5からの有音/無音情報Lmとに基づき、一定の算出周期毎に話速Ssを算出する。
 一定の算出周期は例えば1秒である。
 話速の算出は、各時点で直近の過去の有音であった一定の期間中の音節数と、該一定の期間の長さとから、単位時間当たりの音節数を求め、話速Ssとして出力する。ここで、「一定の期間」は、フレーム期間の整数倍であり、例えば、3秒から10秒程度である。
The speech speed calculation unit 7 calculates the speech speed Ss at regular calculation cycles based on the determination result Sy of the syllable transition determination unit 6 and the sound / silence information Lm from the sound detection unit 5.
The constant calculation period is, for example, 1 second.
To calculate the speaking speed, the number of syllables per unit time is calculated from the number of syllables during a certain period that was the most recent sounded sound at each time point and the length of the certain period, and output as the speaking speed Ss. do. Here, the "constant period" is an integral multiple of the frame period, and is, for example, about 3 seconds to 10 seconds.
 「直近の過去の有音であった一定の期間」とは、現在時刻から有音と判定された時間のみを遡って一定の期間と言う意味であり、従って、直近の過去のうち無音であった期間は除外される。
 例えば一定の期間が3秒であり、有音の時間と無音の時間が半々であれば過去6秒まで遡り、その6秒間の音節数が30ならば、話速は30÷3=10音節/秒となる。
"A certain period of time that was sounded in the latest past" means that only the time determined to be sounded from the current time is traced back to a certain period of time, and therefore, there is no sound in the latest past. Period is excluded.
For example, if a certain period is 3 seconds, if the sounded time and the silent time are half and half, it goes back to the past 6 seconds, and if the number of syllables in the 6 seconds is 30, the speaking speed is 30/3 = 10 syllables / It will be seconds.
 このように有音と判定された時間のみでの平均化を行うのは、話者が沈黙している(発話していない)時間が平均化時間に含まれると実際とは異なる話速が求められてしまうためである。 In order to perform averaging only at the time determined to be sound in this way, if the time when the speaker is silent (not speaking) is included in the averaging time, a speech speed different from the actual one is required. This is because it will be lost.
 一旦話速を算出した後、無音が続いている限り、算出された話速をそのまま使い付けても良い。 After calculating the speaking speed once, the calculated speaking speed may be used as it is as long as the silence continues.
 変換率決定部8は、話速算出部7が算出した話速Ssを基に話速変換率Rcを決定する。
 話速変換率Rcの決定も、話速Ssの算出と同じ周期で行われる。言い換えれば、話速が算出される毎に変換率が算出される。
 例えば、話速変換後の目標話速がSt音節/秒、話速算出部7が算出した話速がSr音節/秒であるとき、St/Srを有音状態における音声の話速変換率Rcとする。
 但し、音声を聞き取り易くするためには話速を下げることが必要であるため、St/Srが1より大きい場合は話速変換率Rcを1とする。
The conversion rate determination unit 8 determines the speech speed conversion rate Rc based on the speech speed Ss calculated by the speech speed calculation unit 7.
The speaking speed conversion rate Rc is also determined in the same cycle as the calculation of the speaking speed Ss. In other words, the conversion rate is calculated each time the speaking speed is calculated.
For example, when the target speech speed after the speech speed conversion is St syllable / sec and the speech speed calculated by the speech speed calculation unit 7 is Sr syllable / sec, St / Sr is the speech speed conversion rate Rc of the voice in the sound state. And.
However, since it is necessary to reduce the speaking speed in order to make the voice easier to hear, the speaking speed conversion rate Rc is set to 1 when St / Sr is larger than 1.
 話速変換部9は、変換率決定部8からの話速変換率Rcに従って話速を変換し、変換後の音声信号Dcを出力する。
 変換に当たって、変換率決定部8で算出周期(例えば1秒毎)毎に決定される変換率が次の算出周期まで、或いは次に新たな変換率が決定されるまで適用される。
The speech speed conversion unit 9 converts the speech speed according to the speech speed conversion rate Rc from the conversion rate determination unit 8, and outputs the converted audio signal Dc.
In the conversion, the conversion rate determined by the conversion rate determination unit 8 for each calculation cycle (for example, every second) is applied until the next calculation cycle or until a new conversion rate is determined next.
 話速変換は、例えば周知のPICOLA(Pointer Interval Controlled Overlap and Add)、TDHS(Time Domain Harmonic Scaling)等のアルゴリズムを用いて実現することができる。 The speech speed conversion can be realized by using, for example, a well-known algorithm such as PICOLA (Pointer Interval Control Overlap and Add) or TDHS (Time Domain Harmonic Scaling).
 ここで、上記の通り変換率決定部8から入力する話速変換率Rcは1以下であるが、常に話速を遅くしていると話速変換の処理遅延が増大し続け、音声通信のリアルタイム性が維持できない。
 そこで、有音検出部5による有音/無音情報Lmを話速変換部9に入力し、無音と判定された部分については話速を上げる、若しくは、無音と判定された部分を削除することによって、音声信号が一定以上遅延しないようにする。
 また、遅延が一定値を超えた場合は、有音であっても話速変換をせずに音声を出力することとしても良い。
Here, as described above, the speech speed conversion rate Rc input from the conversion rate determination unit 8 is 1 or less, but if the speech speed is always slowed down, the processing delay of the speech speed conversion continues to increase, and the real time of voice communication is realized. I can't maintain my sex.
Therefore, the sound / silence information Lm by the sound detection unit 5 is input to the speech speed conversion unit 9, and the speaking speed is increased for the portion determined to be silent, or the portion determined to be silent is deleted. , Make sure that the audio signal is not delayed beyond a certain level.
Further, when the delay exceeds a certain value, the voice may be output without converting the speech speed even if there is sound.
 以上のように、図1の話速変換装置では、周波数情報生成部2が音声復号部1における音声復号過程で得られるLSP係数を抽出し、情報変化量算出部3がLSP係数の一定時間毎の変化量を求め、この変化量を基に音声信号の音節遷移を検出し、検出結果に基づいて、話速変換を行っている。 As described above, in the speech speed conversion device of FIG. 1, the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals. The amount of change in the voice signal is obtained, the syllable transition of the voice signal is detected based on this amount of change, and the speech speed is converted based on the detection result.
 上記の話速変換装置には種々の変形が可能である。
 例えば、上記の例では、情報変化量算出部3は、10次のLSP係数を1つの10次元ベクトルと見なし、ベクトル間距離d(n)を求めた。しかしながら、必ずしも10次のLSP係数の全てを用いて変化量を算出しなくても良い。
 例えば、時刻nにおけるLSP係数f1(n)、f2(n)、・・・、f10(n)のうち、LSP係数f1(n)、f2(n)、f3(n)のみから成るベクトルを用い、ベクトル間距離d(n)を下記の式(2)により求めても良い。
 d(n)
 ={f1(n)-f1(n-1)}×{f1(n)-f1(n-1)}
 +{f2(n)-f2(n-1)}×{f2(n)-f2(n-1)}
 +{f3(n)-f3(n-1)}×{f3(n)-f3(n-1)}
                           …(2)
The above-mentioned speech speed conversion device can be modified in various ways.
For example, in the above example, the information change amount calculation unit 3 regards the 10th-order LSP coefficient as one 10-dimensional vector, and obtains the inter-vector distance d (n). However, it is not always necessary to calculate the amount of change using all of the 10th-order LSP coefficients.
For example, among the LSP coefficients f1 (n), f2 (n), ..., F10 (n) at time n, a vector consisting of only the LSP coefficients f1 (n), f2 (n), f3 (n) is used. , The inter-vector distance d (n) may be obtained by the following equation (2).
d (n)
= {F1 (n) -f1 (n-1)} x {f1 (n) -f1 (n-1)}
+ {F2 (n) -f2 (n-1)} x {f2 (n) -f2 (n-1)}
+ {F3 (n) -f3 (n-1)} x {f3 (n) -f3 (n-1)}
… (2)
 LSP係数のうち、低次の係数が低周波数成分に対応し、高次の係数が高周波数成分に対応しているので、上記のようにすることで、低域周波数成分の変化のみに着目した変化量を求めることができる。
 発話時の声道周波数特性の変化は低域側の方がより大きいため、この方法でも音節遷移を検出することが可能であるとともに、演算量がより少なくなると言う点で有利である。
 また、音声信号に重畳する背景ノイズの影響を除外し易いという利点がある。
Of the LSP coefficients, the low-order coefficient corresponds to the low-frequency component and the high-order coefficient corresponds to the high-frequency component. Therefore, by doing the above, only the change in the low-frequency component was focused on. The amount of change can be calculated.
Since the change in vocal tract frequency characteristics during utterance is larger on the low frequency side, this method is also advantageous in that it is possible to detect syllable transitions and the amount of calculation is smaller.
Further, there is an advantage that it is easy to exclude the influence of background noise superimposed on the audio signal.
 また、以上の例では、情報変化量算出部3は、最新のLSP係数ベクトルとそれより1フレーム前のLSP係数ベクトルとの距離を求めたが、複数フレーム前のLSP係数ベクトルとの距離を求めるようにしても良い。
 例えば、2フレーム前のLSP係数ベクトルとの距離を求める場合、ベクトル間距離d(n)は下記の式(3)の演算で求められる。
 d(n)
 ={f1(n)-f1(n-2)}×{f1(n)-f1(n-2)}
 +{f2(n)-f2(n-2)}×{f2(n)-f2(n-2)}
                 :
                 :
 +{f10(n)-f10(n-2)}×{f10(n)-f10(n-2)}
                              …(3)
Further, in the above example, the information change amount calculation unit 3 obtains the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it, but obtains the distance from the LSP coefficient vector one frame before. You may do so.
For example, when the distance from the LSP coefficient vector two frames before is obtained, the inter-vector distance d (n) is obtained by the calculation of the following equation (3).
d (n)
= {F1 (n) -f1 (n-2)} x {f1 (n) -f1 (n-2)}
+ {F2 (n) -f2 (n-2)} x {f2 (n) -f2 (n-2)}
:
:
+ {F10 (n) -f10 (n-2)} x {f10 (n) -f10 (n-2)}
… (3)
 ベクトル間距離d(n)で音声の声道周波数特性の時間変化量を検出する場合、距離を算出する2つのベクトルの時間差が長すぎると上記のの時間変化量を検出し難くなる。
 しかし、上記の通りCS-ACELP符号化方式による音声復号処理のフレーム長は10ミリ秒と音声の声道周波数特性の時間変化に対して十分短く、複数フレーム前のベクトルとの距離を用いても問題は生じない。具体的には、距離が算出される2つのベクトル間の時間差が、発話における音節遷移の周期よりも短い値となるように設定されていれば良い。
When detecting the time change amount of the vocal tract frequency characteristic of voice by the inter-vector distance d (n), if the time difference between the two vectors for calculating the distance is too long, it becomes difficult to detect the time change amount.
However, as described above, the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and even if the distance from the vector before a plurality of frames is used. No problem arises. Specifically, the time difference between the two vectors for which the distance is calculated may be set to be shorter than the period of the syllable transition in the utterance.
 以上のように、実施の形態1によれば、周波数情報生成部2が音声復号部1における音声復号過程で得られるLSP係数を抽出し、情報変化量算出部3がLSP係数の一定時間毎の変化量を求め、この変化量を基に音声信号の音節遷移を検出するようにした。そのため、復号音声信号に対して全極モデルに基づいた線形予測法(LPC)又は高速フーリエ変換(FFT)によるスペクトル分析を行い、更にスペクトル特徴量ベクトルを求めるという処理が不要となる。従って、少ない演算量で話速変換を行うことができる。 As described above, according to the first embodiment, the frequency information generation unit 2 extracts the LSP coefficient obtained in the voice decoding process in the voice decoding unit 1, and the information change amount calculation unit 3 extracts the LSP coefficient at regular time intervals. The amount of change was obtained, and the syllable transition of the voice signal was detected based on this amount of change. Therefore, it is not necessary to perform spectral analysis of the decoded voice signal by linear prediction method (LPC) or fast Fourier transform (FFT) based on the omnipolar model, and further obtain a spectral feature amount vector. Therefore, the speech speed conversion can be performed with a small amount of calculation.
 更に、音節遷移の検出に用いられるLSP係数は音声を符号化する過程で算出され、これが、音声復号部1に伝送されてきた情報であるため、音声復号後の音声信号を用いてスペクトル分析を行って得られる周波数特性に比べ、符号化される前の音声信号で表される音声の声道周波数特性により近いものとなる。従って、音声符号データを復号することで得られた音声信号に対しても高い精度で話速の測定を行える。 Further, the LSP coefficient used for detecting the syllable transition is calculated in the process of encoding the voice, and since this is the information transmitted to the voice decoding unit 1, the spectrum analysis is performed using the voice signal after the voice decoding. Compared with the frequency characteristics obtained by doing this, it is closer to the voice path frequency characteristics of the voice represented by the voice signal before being encoded. Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data.
実施の形態2.
 実施の形態1においては、周波数情報FaとしてLSP係数を抽出し、その時間変化量を基に音節遷移を検出した。LSP係数に代わりにLPC係数を抽出し、抽出したLPC係数からLPCメルケプストラム又はLPCケプストラムを算出し、算出したLPCメルケプストラム又はLPCケプストラムを用いて音節遷移を検出することもできる。
Embodiment 2.
In the first embodiment, the LSP coefficient was extracted as the frequency information Fa, and the syllable transition was detected based on the time change amount. It is also possible to extract the LPC coefficient instead of the LSP coefficient, calculate the LPC mel cepstrum or LPC cepstrum from the extracted LPC coefficient, and detect the syllable transition using the calculated LPC mel kepstram or LPC cepstrum.
 図5は実施の形態2に係る話速変換装置の構成を示す。図5に示される話速変換装置は図1を参照して説明した実施の形態1の話速変換装置と概して同じであるが、以下の点で異なる。即ち、周波数情報生成部2及び情報変化量算出部3の代わりに、周波数情報生成部2b及び情報変化量算出部3bを備える。周波数情報生成部2bは、LPC係数抽出部22とメルケプストラム算出部23とを備える。 FIG. 5 shows the configuration of the speech speed conversion device according to the second embodiment. The speech speed conversion device shown in FIG. 5 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2 and the information change amount calculation unit 3, the frequency information generation unit 2b and the information change amount calculation unit 3b are provided. The frequency information generation unit 2b includes an LPC coefficient extraction unit 22 and a mer cepstrum calculation unit 23.
 LPC係数抽出部22は音声復号部1の復号動作で生成される情報のうち、LPC係数を1フレーム毎に抽出する。
 例えば、音声復号部1がCS-ACELP符号化方式による音声復号を行う場合、図2のLPC係数変換部109の出力の一部を抽出する。
The LPC coefficient extraction unit 22 extracts the LPC coefficient for each frame from the information generated by the decoding operation of the voice decoding unit 1.
For example, when the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method, a part of the output of the LPC coefficient conversion unit 109 in FIG. 2 is extracted.
 具体的には、図2の音声復号部1のLPC係数変換部109は各フレームのLSP係数と、補間部108での補間により生成されたLSP係数とをLPC係数に変換して出力するが、LPC係数抽出部22は各フレームのLSP係数を変換することで生成されたLPC係数を抽出する。例えば10次のLPC係数が抽出される。 Specifically, the LPC coefficient conversion unit 109 of the audio decoding unit 1 in FIG. 2 converts the LSP coefficient of each frame and the LSP coefficient generated by the interpolation by the interpolation unit 108 into an LPC coefficient and outputs the LPC coefficient. The LPC coefficient extraction unit 22 extracts the LPC coefficient generated by converting the LSP coefficient of each frame. For example, the 10th-order LPC coefficient is extracted.
 メルケプストラム算出部23は、LPC係数抽出部22で抽出されたLPC係数をLPCメルケプストラムに変換する。
 音声分析合成処理では、一般に10~25次のLPCメルケプストラムが用いられるが、元のLPC係数の次数である10次よりもあまりに大きくしても意味がない。従って、メルケプストラム算出部23で生成されるLPCメルケプストラムの次数は10~15程度が適当である。以下では、LPCメルケプストラムの次数が10であるとして説明する。
The mel cepstrum calculation unit 23 converts the LPC coefficient extracted by the LPC coefficient extraction unit 22 into an LPC mel cepstrum.
In the voice analysis synthesis process, a 10th to 25th order LPC mel cepstrum is generally used, but it is meaningless to make it much larger than the 10th order which is the order of the original LPC coefficient. Therefore, the order of the LPC cepstrum generated by the mel cepstrum calculation unit 23 is appropriately about 10 to 15. Hereinafter, the order of the LPC mel cepstrum will be described as 10.
 メルケプストラム算出部23で生成される各フレームの10次のLPCメルケプストラムは10次元のLPCメルケプストラムベクトルを構成すると見ることができる。 It can be seen that the 10th-order LPC mel cepstrum of each frame generated by the mel cepstrum calculation unit 23 constitutes a 10-dimensional LPC mel cepstrum vector.
 図5の情報変化量算出部3bは、図1の情報変化量算出部3と同様のものであるが、図1の周波数情報生成部2の出力の代わりに周波数情報生成部2bの出力に基づいて情報変化量Vfを算出する。 The information change amount calculation unit 3b of FIG. 5 is the same as the information change amount calculation unit 3 of FIG. 1, but is based on the output of the frequency information generation unit 2b instead of the output of the frequency information generation unit 2 of FIG. The amount of information change Vf is calculated.
 即ち、情報変化量算出部3bは、周波数情報生成部2bのメルケプストラム算出部23から出力される現フレームのLPCメルケプストラムベクトルと1フレーム前のLPCメルケプストラムベクトルとの距離を情報変化量Vfとして算出する。 That is, the information change amount calculation unit 3b uses the distance between the LPC mel cepstrum vector of the current frame output from the mel cepstrum calculation unit 23 of the frequency information generation unit 2b and the LPC mel cepstrum vector one frame before as the information change amount Vf. calculate.
 情報変化量算出部3bの動作は、実施の形態1の情報変化量算出部3と基本的に同じであるが、入力がLSP係数ベクトルではなくLPCメルケプストラムベクトルである点で異なる。 The operation of the information change amount calculation unit 3b is basically the same as that of the information change amount calculation unit 3 of the first embodiment, except that the input is not the LSP coefficient vector but the LPC mel cepstrum vector.
 上記以外の点で、実施の形態2の話速変換装置の動作は、実施の形態1の話速変換装置の動作と同様である。 Except for the above, the operation of the speech speed conversion device of the second embodiment is the same as the operation of the speech speed conversion device of the first embodiment.
 なお、以上の説明では、情報変化量算出部3bが、10次のLPCメルケプストラムを1つの10次元ベクトルと見なしてベクトル間距離d(n)を求めたが、実施の形態1と同様に、必ずしも10次のLPCメルケプストラムの全てを用いて変化量を算出しなくても良い。
 また、最新のLPCメルケプストラムベクトルとそれより1フレーム前のLPCメルケプストラムベクトルとの距離でなく、最新のLPCメルケプストラムベクトルとそれより複数フレーム前のLPCメルケプストラムベクトルとの距離を求めるようにしても良い。
In the above description, the information change amount calculation unit 3b considers the 10th-order LPC mel cepstrum as one 10-dimensional vector and obtains the inter-vector distance d (n). It is not always necessary to calculate the amount of change using all of the 10th-order LPC mel cepstrums.
Also, instead of finding the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it, the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector several frames before it is calculated. Is also good.
 また、以上の説明では、LPCメルケプストラムを算出し、LPCメルケプストラムの一定時間毎の変化量を求め、該変化量に基づいて音節遷移を検出するが、LPCメルケプストラムの代わりにLPCケプストラムを算出し、LPCケプストラムの一定時間毎の変化量を求め、該変化量に基づいて音節遷移を検出しても良い。LPCケプストラムの一定時間毎の変化量としては、例えば、それぞれLPCケプストラムで構成されるLPCケプストラムベクトル相互間の距離を用いることができる。 Further, in the above description, the LPC cepstrum is calculated, the amount of change in the LPC cepstrum at regular time intervals is obtained, and the syllable transition is detected based on the amount of change, but the LPC cepstrum is calculated instead of the LPC cepstrum. Then, the amount of change in the LPC cepstrum at regular time intervals may be obtained, and the syllable transition may be detected based on the amount of change. As the amount of change of the LPC cepstrum at regular intervals, for example, the distance between the LPC cepstrum vectors composed of the LPC cepstrum can be used.
 以上のように、実施の形態2によれば、周波数情報生成部2bが、音声復号部1における音声復号過程で得られるLPC係数を抽出し、LPC係数からLPCメルケプストラム又はLPCケプストラムを算出し、情報変化量算出部3bがLPCメルケプストラム又はLPCケプストラムの一定時間毎の変化量を求め、この変化量を基に音声信号の音節遷移を検出するようにした。そのため、復号音声信号に対して全極モデルに基づいた線形予測法(LPC)によるスペクトル分析を行うという処理が不要となる。従って、少ない演算量で話速変換を行うことができる。 As described above, according to the second embodiment, the frequency information generation unit 2b extracts the LPC coefficient obtained in the voice decoding process in the voice decoding unit 1, calculates the LPC mel cepstrum or the LPC cepstrum from the LPC coefficient, and then calculates the LPC mel cepstrum or the LPC cepstrum. The information change amount calculation unit 3b obtains the change amount of the LPC mel cepstrum or the LPC cepstrum at regular time intervals, and detects the syllable transition of the voice signal based on this change amount. Therefore, it is not necessary to perform a spectral analysis of the decoded audio signal by a linear prediction method (LPC) based on a omnipolar model. Therefore, the speech speed conversion can be performed with a small amount of calculation.
 更に、音節遷移の検出に用いるLPCメルケプストラム又はLPCケプストラムは音声を符号化する過程で算出されるLSP係数を基に算出されるものであり、このLSP係数が、音声復号部1に伝送されてきた情報であるため、音声復号後の音声信号を用いてスペクトル分析を行って得られる周波数特性に比べ、符号化される前の音声信号で表される音声の声道周波数特性により近いものとなる。
 従って、高能率符号化により生成された音声符号データを復号することで得られた音声信号に対しても高い精度で話速の測定を行える。
Further, the LPC mel cepstrum or LPC cepstrum used for detecting the syllable transition is calculated based on the LSP coefficient calculated in the process of coding the voice, and this LSP coefficient has been transmitted to the voice decoding unit 1. Because it is information, it is closer to the vocal tract frequency characteristics of the voice represented by the voice signal before encoding than the frequency characteristics obtained by performing spectral analysis using the voice signal after voice decoding. ..
Therefore, the speaking speed can be measured with high accuracy even for the voice signal obtained by decoding the voice code data generated by the high efficiency coding.
 また、LPCケプストラム又はLPCメルケプストラムは、音声認識で一般的に用いられており、実施の形態1に示したLSP係数を用いて音節遷移を判定するよりも、より高い精度で音節遷移の判定を行なうことができる。 Further, LPC cepstrum or LPC cepstrum is generally used in speech recognition, and determines syllable transition with higher accuracy than determining syllable transition using the LSP coefficient shown in the first embodiment. Can be done.
実施の形態3.
 実施の形態1においては、音声の周波数情報としてLSP係数を抽出し、その時間変化量を基に音節遷移を検出した。抽出されたLSP係数を間引いた上で、その時間変化量を基に音節遷移の検出を行なっても良い。
Embodiment 3.
In the first embodiment, the LSP coefficient was extracted as the frequency information of the voice, and the syllable transition was detected based on the time change amount. After thinning out the extracted LSP coefficient, the syllable transition may be detected based on the amount of time change.
 図6は実施の形態3に係る話速変換装置の構成を示す。図6に示される話速変換装置は図1を参照して説明した実施の形態1の話速変換装置と概して同じであるが、以下の点で異なる。即ち、周波数情報生成部2、情報変化量算出部3、及び極値検出部4の代わりに、周波数情報生成部2c、情報変化量算出部3c、及び極値検出部4cを備える。周波数情報生成部2cは、LSP係数抽出部21と間引き部24とを備える。 FIG. 6 shows the configuration of the speech speed conversion device according to the third embodiment. The speech speed conversion device shown in FIG. 6 is generally the same as the speech speed conversion device of the first embodiment described with reference to FIG. 1, but differs in the following points. That is, instead of the frequency information generation unit 2, the information change amount calculation unit 3, and the extreme value detection unit 4, the frequency information generation unit 2c, the information change amount calculation unit 3c, and the extreme value detection unit 4c are provided. The frequency information generation unit 2c includes an LSP coefficient extraction unit 21 and a thinning unit 24.
 LSP係数抽出部21は、実施の形態1のLSP係数抽出部21と同じものであり、同様に動作する。 The LSP coefficient extraction unit 21 is the same as the LSP coefficient extraction unit 21 of the first embodiment, and operates in the same manner.
 間引き部24は、LSP係数抽出部21が抽出したLSP係数(周波数情報)に対して間引き率Mで間引きを行う。この間引きにおいては、LSP係数抽出部21から1フレーム毎に出力されるLSP係数を、Mフレーム(Mは2以上の整数)に1回だけ抽出する。 The thinning unit 24 thins out the LSP coefficient (frequency information) extracted by the LSP coefficient extracting unit 21 at a thinning rate M. In this thinning, the LSP coefficient output from the LSP coefficient extraction unit 21 for each frame is extracted only once in the M frame (M is an integer of 2 or more).
 例えば、実施の形態1と同様に、1フレーム長が10ミリ秒である場合、音声復号部1はLSP係数を10ミリ秒毎に復号し、LSP係数抽出部21はLSP係数を10ミリ秒ごとに抽出する。 For example, as in the first embodiment, when one frame length is 10 milliseconds, the voice decoding unit 1 decodes the LSP coefficient every 10 milliseconds, and the LSP coefficient extraction unit 21 decodes the LSP coefficient every 10 milliseconds. Extract to.
 間引き部24は、LSP係数抽出部21で抽出されるLSP係数を、Mフレーム毎に、従って、10×Mミリ秒毎に抽出する。 The thinning unit 24 extracts the LSP coefficient extracted by the LSP coefficient extraction unit 21 every M frame, and therefore every 10 × M milliseconds.
 情報変化量算出部3cの動作は実施の形態1の情報変化量算出部3と基本的に同様であるが、1フレーム毎にではなくMフレーム毎に処理を行う。
 情報変化量算出部3cはまた、間引き部24から相前後して出力されるLSP係数ベクトル相互間の距離を算出する。その結果、情報変化量算出部3cは最新の(現フレームの)LSP係数ベクトルとそれよりMフレーム前のLSP係数ベクトルとの距離を求めることになる。
The operation of the information change amount calculation unit 3c is basically the same as that of the information change amount calculation unit 3 of the first embodiment, but the processing is performed not every frame but every M frame.
The information change amount calculation unit 3c also calculates the distance between the LSP coefficient vectors output one after the other from the thinning unit 24. As a result, the information change amount calculation unit 3c obtains the distance between the latest (current frame) LSP coefficient vector and the LSP coefficient vector M frame before that.
 上記の通りCS-ACELP符号化方式による音声復号処理のフレーム長は10ミリ秒と音声の声道周波数特性の時間変化に対して十分短く、Mが過度に大きくならない限り問題は生じない。具体的には、Mの値は、10×Mミリ秒が発話における音節遷移の周期よりも短い値以下となるように設定しておけば良い。 As described above, the frame length of the voice decoding process by the CS-ACELP coding method is 10 milliseconds, which is sufficiently short with respect to the time change of the vocal tract frequency characteristics of the voice, and no problem occurs unless M becomes excessively large. Specifically, the value of M may be set so that 10 × M milliseconds is shorter than the period of syllable transition in utterance.
 極値検出部4cの動作は、実施の形態1の極値検出部4と基本的に同じである。
 但し、情報変化量算出部3cと同様に、1フレーム毎ではなく、Mフレーム毎に処理を行う。
The operation of the extreme value detecting unit 4c is basically the same as that of the extreme value detecting unit 4 of the first embodiment.
However, similarly to the information change amount calculation unit 3c, the processing is performed not for each frame but for each M frame.
 極値検出部4cは、該極値検出部4cに入力される直近の過去のNbフレームの情報変化量Vfの極大値Mx及び極小値Mnの検出を行なう。Nbは、実施の形態1の極値検出部4に関する説明におけるNaよりも小さい。これは、実施の形態3の極値検出部4cには、情報変化量Vfが1フレーム毎にではなくMフレーム毎に入力されるためである。例えば、NbはNa/Mに等しい値とされる。Nbを200ミリ秒相当の値とする場合、Nb=Na/M=20/Mとする。 The extreme value detection unit 4c detects the maximum value Mx and the minimum value Mn of the information change amount Vf of the latest past Nb frame input to the extreme value detection unit 4c. Nb is smaller than Na in the description of the extremum detection unit 4 of the first embodiment. This is because the information change amount Vf is input to the extreme value detection unit 4c of the third embodiment not every frame but every M frame. For example, Nb is set to a value equal to Na / M. When Nb is a value equivalent to 200 milliseconds, Nb = Na / M = 20 / M.
 また、極大の検出を行なうに当たり、極値検出部4cは、現在時刻nよりも1フレーム前の時刻(n-1)の情報変化量Vfが極大か否かの判定を行なう代わりに、現フレームよりもMフレーム前の時刻(n-M)の情報変化量Vfが極大か否かの判定を行なう。 Further, in performing the maximum detection, the extreme value detection unit 4c determines whether or not the information change amount Vf at the time (n-1) one frame before the current time n is the maximum, instead of determining whether or not the current frame is the maximum. It is determined whether or not the amount of information change Vf at the time (nm) before the M frame is maximum.
 例えば、d(n)がd(n-M)より小さく、かつd(n-M)がd(n-2M)より大きい場合、d(n-M)を極大と判定する。
 この条件が満たされない場合、d(n-M)は極大ではないと判定する。
For example, when d (n) is smaller than d (n-M) and d (n-M) is larger than d (n-2M), d (n-M) is determined to be maximum.
If this condition is not satisfied, it is determined that d (nm) is not the maximum.
 また、極小の特定に当たっては、例えば、d(m)がd(m-M)より大きく、かつd(m-M)がd(m-2M)より小さく、かつn-Nb+2M≦m≦n-Mを満たす、最新の時刻m(最も大きい値)を検索する。これらの条件を満たすmが存在する場合はd(m-M)が極小であると判定する。
 これの条件を満たすmが存在しない場合、d(n-Nb+M)を便宜上の極小とする。この便宜上の極小は、d(n-Nb+M)、d(n-Nb+2M)、・・・、d(n-2M)の中の最小値に相当する。
Further, in specifying the minimum, for example, d (m) is larger than d (m-M), d (m-M) is smaller than d (m-2M), and n−Nb + 2M ≦ m ≦ n−. Search for the latest time m (largest value) that satisfies M. When m satisfying these conditions exists, it is determined that d (m-M) is the minimum.
When m satisfying this condition does not exist, d (n−Nb + M) is set to the minimum for convenience. The minimum for this convenience corresponds to the minimum value among d (n-Nb + M), d (n-Nb + 2M), ..., D (n-2M).
 上記以外の点で、実施の形態3の話速変換装置の動作は、実施の形態1の話速変換装置の動作と同様である。 Except for the above, the operation of the speech speed conversion device of the third embodiment is the same as the operation of the speech speed conversion device of the first embodiment.
 実施の形態3でも実施の形態1と同様の効果が得られる。
 更に、間引き部24がLSP係数をMフレームに1回のみ抽出するようにしたので、情報変化量算出部3c及び極値検出部4cの動作頻度が減少し、実施の形態1よりも演算量を更に少なくすることが可能となる。
The same effect as that of the first embodiment can be obtained in the third embodiment.
Further, since the thinning unit 24 extracts the LSP coefficient only once in the M frame, the operation frequency of the information change amount calculation unit 3c and the extreme value detection unit 4c is reduced, and the calculation amount is smaller than that of the first embodiment. It is possible to further reduce the number.
実施の形態4.
 実施の形態2で示したLPCメルケプストラムの変化量を求めて音節遷移を検出する話速変換装置に対して実施の形態3と同様の間引きを行っても良い。
Embodiment 4.
The same thinning as in the third embodiment may be performed on the speech speed conversion device that detects the syllable transition by obtaining the amount of change in the LPC mel cepstrum shown in the second embodiment.
 図7は、実施の形態4に係る話速変換装置の構成を示す。図7に示される話速変換装置は、図5を参照して説明した実施の形態2の話速変換装置と概して同じであるが、以下の点で異なる。即ち、周波数情報生成部2b、情報変化量算出部3b、及び極値検出部4の代わりに、周波数情報生成部2d、情報変化量算出部3d、及び極値検出部4cを備える。周波数情報生成部2dは、LPC係数抽出部22、間引き部24d及びメルケプストラム算出部23dを備える。 FIG. 7 shows the configuration of the speech speed conversion device according to the fourth embodiment. The speech speed conversion device shown in FIG. 7 is generally the same as the speech speed conversion device of the second embodiment described with reference to FIG. 5, but differs in the following points. That is, instead of the frequency information generation unit 2b, the information change amount calculation unit 3b, and the extreme value detection unit 4, the frequency information generation unit 2d, the information change amount calculation unit 3d, and the extreme value detection unit 4c are provided. The frequency information generation unit 2d includes an LPC coefficient extraction unit 22, a thinning unit 24d, and a mer cepstrum calculation unit 23d.
 LPC係数抽出部22は、図5に示される実施の形態2のLPC係数抽出部22と同様に、音声復号部1の復号動作で生成される情報のうち、LPC係数を抽出する。 The LPC coefficient extraction unit 22 extracts the LPC coefficient from the information generated by the decoding operation of the voice decoding unit 1 in the same manner as the LPC coefficient extraction unit 22 of the second embodiment shown in FIG.
 間引き部24dは、図6の間引き部24と同様のものである。但し、LSP係数抽出部21の出力ではなく、LPC係数抽出部22の出力に対して間引きを行なう。この間引きにおいては、LPC係数抽出部22から1フレーム毎に出力されるLPC係数を、Mフレーム(Mは2以上の整数)に1回だけ抽出する。 The thinning section 24d is the same as the thinning section 24 in FIG. However, the output of the LPC coefficient extraction unit 22 is thinned out instead of the output of the LSP coefficient extraction unit 21. In this thinning, the LPC coefficient output from the LPC coefficient extraction unit 22 for each frame is extracted only once in the M frame (M is an integer of 2 or more).
 例えば、実施の形態1と同様に、1フレーム長が10ミリ秒である場合、音声復号部1はLSP係数を10ミリ秒毎に復号し、LPC係数抽出部22はLPC係数を10ミリ秒毎に抽出する。 For example, as in the first embodiment, when one frame length is 10 milliseconds, the audio decoding unit 1 decodes the LSP coefficient every 10 milliseconds, and the LPC coefficient extraction unit 22 decodes the LPC coefficient every 10 milliseconds. Extract to.
 間引き部24dは、LPC係数抽出部22で抽出されるLPC係数を、Mフレーム毎に、従って、10×Mミリ秒毎に抽出する。 The thinning unit 24d extracts the LPC coefficient extracted by the LPC coefficient extraction unit 22 every M frame, and therefore every 10 × M milliseconds.
 メルケプストラム算出部23dは、間引き部24dから出力されるLPC係数をLPCメルケプストラムに変換する。メルケプストラム算出部23dの動作は図6のメルケプストラム算出部23の動作と基本的に同じであるが、1フレーム毎にではなくMフレーム毎に処理を行う。 The mel cepstrum calculation unit 23d converts the LPC coefficient output from the thinning unit 24d into an LPC mel cepstrum. The operation of the mel cepstrum calculation unit 23d is basically the same as the operation of the mel cepstrum calculation unit 23 of FIG. 6, but the processing is performed not every frame but every M frame.
 図7の情報変化量算出部3dは、図5の情報変化量算出部3bと同様のものであるが、図5のメルケプストラム算出部23の出力の代わりにメルケプストラム算出部23dの出力に基づいて情報変化量Vfを算出する。 The information change amount calculation unit 3d of FIG. 7 is the same as the information change amount calculation unit 3b of FIG. 5, but is based on the output of the mel cepstrum calculation unit 23d instead of the output of the mer cepstrum calculation unit 23 of FIG. The amount of information change Vf is calculated.
 即ち、情報変化量算出部3dは、周波数情報生成部2dのメルケプストラム算出部23dから出力される10次のLPCメルケプストラムを1つの10次元ベクトルと見なしてベクトル間距離d(n)を求める。 That is, the information change amount calculation unit 3d considers the 10th-order LPC mel cepstrum output from the mel cepstrum calculation unit 23d of the frequency information generation unit 2d as one 10-dimensional vector, and obtains the inter-vector distance d (n).
 情報変化量算出部3dの動作は、図6の情報変化量算出部3cと基本的に同じであるが、入力がLSP係数ではなくLPCメルケプストラムである点で異なる。
 また、情報変化量算出部3dの動作は、図5の情報変化量算出部3bと基本的に同じであるが、1フレーム毎ではなくMフレーム毎に処理を行う点で異なる。
The operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3c of FIG. 6, except that the input is not the LSP coefficient but the LPC mel cepstrum.
Further, the operation of the information change amount calculation unit 3d is basically the same as that of the information change amount calculation unit 3b in FIG. 5, except that the processing is performed not for each frame but for each M frame.
 実施の形態4の極値検出部4cは、図6の極値検出部4cと同じものであり、同様に動作する。 The extremum detection unit 4c of the fourth embodiment is the same as the extremum detection unit 4c of FIG. 6, and operates in the same manner.
 上記以外の点で、実施の形態4の話速変換装置の動作は、実施の形態2の話速変換装置の動作と同様である。 Except for the above, the operation of the speech speed conversion device of the fourth embodiment is the same as the operation of the speech speed conversion device of the second embodiment.
 実施の形態4でも実施の形態2と同様の効果が得られる。
 更に、間引き部24がLPC係数をMフレームに1回のみ抽出するようにしたので、メルケプストラム算出部23d、情報変化量算出部3d、及び極値検出部4cの動作頻度が減少し、実施の形態2よりも演算量を少なくすることができる。
The same effect as that of the second embodiment can be obtained in the fourth embodiment.
Further, since the thinning unit 24 extracts the LPC coefficient only once in the M frame, the operation frequency of the mer cepstrum calculation unit 23d, the information change amount calculation unit 3d, and the extreme value detection unit 4c is reduced, and the implementation is carried out. The amount of calculation can be reduced as compared with the second form.
実施の形態5.
 実施の形態1~4における音節遷移において、音声が有声音であるか無声音であるかを検出する機能を付加し、有声/無声情報を併用して音節遷移を検出することもできる。
 実施の形態5は、実施の形態1に対して、有声/無声情報を併用する変形を加えたものである。
Embodiment 5.
In the syllable transitions of the first to fourth embodiments, a function of detecting whether the voice is a voiced sound or an unvoiced sound is added, and the syllable transition can be detected by using the voiced / unvoiced information together.
The fifth embodiment is a modification of the first embodiment in which voiced / unvoiced information is used in combination.
 図8は実施の形態5に係る話速変換装置の構成を示す。図8に示す話速変換装置は、図1の話速変換装置と概して同じであるが、以下の点で異なる。即ち、有声情報抽出部10が付加されており、音節遷移判定部6の代わりに音節遷移判定部6eが設けられている。なお、図8の周波数情報生成部2の構成は図1と同じであり、その図示を省略している。 FIG. 8 shows the configuration of the speech speed conversion device according to the fifth embodiment. The speech speed conversion device shown in FIG. 8 is generally the same as the speech speed conversion device of FIG. 1, but differs in the following points. That is, a voiced information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6. The configuration of the frequency information generation unit 2 in FIG. 8 is the same as that in FIG. 1, and the illustration thereof is omitted.
 有声情報抽出部10は、音声復号部1において音声復号を行う過程で得られる有声/無声情報Vcを抽出する。
 上記のように、音声復号部1がCS-ACELP符号化方式による音声復号を行う場合、その構成は例えば図2に示した通りであり、図2中のポストフィルタ部111から有声/無声情報Vcが得られる。
The voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing voice decoding in the voice decoding unit 1.
As described above, when the voice decoding unit 1 performs voice decoding by the CS-ACELP coding method, the configuration is as shown in FIG. 2, for example, and the voiced / unvoiced information Vc from the post filter unit 111 in FIG. Is obtained.
 上記のように、ポストフィルタ部111内の長期ポストフィルタはピッチ成分を強調するフィルタであり、この長期ポストフィルタでは、ピッチ成分の強調度合いを制御する利得係数が用いられる。 As described above, the long-term post filter in the post filter unit 111 is a filter that emphasizes the pitch component, and in this long-term post filter, a gain coefficient that controls the degree of emphasis of the pitch component is used.
 音声が有声音の場合はピッチ成分を強調することで音質改善が図れるが、有声音でない場合は音質が劣化してしまうため、利得係数をゼロとしてピッチ成分強調を行わない。従って、この長期ポストフィルタの利得係数は有声/無声情報Vcとして用いることができる。即ち、この係数が0でない場合の復号音声は有声、0の場合の復号音声は無声であるとすることができる。
 有声情報抽出部10は上記の利得係数を有声/無声情報Vcとして抽出して出力する。
If the voice is a voiced sound, the sound quality can be improved by emphasizing the pitch component, but if the voice is not a voiced sound, the sound quality deteriorates. Therefore, the gain coefficient is set to zero and the pitch component is not emphasized. Therefore, the gain coefficient of this long-term post filter can be used as voiced / unvoiced information Vc. That is, it can be said that the decoded voice when this coefficient is not 0 is voiced, and the decoded voice when this coefficient is 0 is unvoiced.
The voiced information extraction unit 10 extracts and outputs the above gain coefficient as voiced / unvoiced information Vc.
 音節遷移判定部6eは、図1の音節遷移判定部6と同様に、極値検出部4において極大値Mx及び極小値Mnが検出され、かつ、有音検出部5において有音と検出された場合にのみ音節遷移の有無の判定を行う。
 音節遷移判定部6eは、極大値Mxが予め定めた閾値Th1より大きく、かつ極大値Mxと極小値Mnとの差が予め定めた閾値Th2より大きい場合に音節遷移ありと判定する。
 音節遷移判定部6eはまた、有声情報抽出部10から入力される有声/無声情報Vcが「無声」を示すものから「有声」を示すものに変化した場合にも音節遷移ありと判定する。
 これらのいずれにも該当しない場合は、音節遷移判定部6eは音節遷移なしと判定する。
Similar to the syllable transition determination unit 6 of FIG. 1, the syllable transition determination unit 6e detected the maximum value Mx and the minimum value Mn in the extreme value detection unit 4 and detected as sound in the sound detection unit 5. The presence or absence of syllable transition is determined only in this case.
The syllable transition determination unit 6e determines that there is a syllable transition when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2.
The syllable transition determination unit 6e also determines that there is a syllable transition when the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from one indicating "unvoiced" to one indicating "voiced".
If none of these applies, the syllable transition determination unit 6e determines that there is no syllable transition.
 なお、音節遷移判定部6eは、有声情報抽出部10から入力される有声/無声情報Vcが「無声」を示すものから「有声」を示すものに変化した場合に音節遷移ありと判定する代わりに、有声情報抽出部10から入力される有声/無声情報Vcが「有声」を示すものから「無声」を示すものに変化した場合に音節遷移ありと判定するものであっても良い。
 要するに、音節遷移判定部6eは、極大値Mx及び極小値Mnに基づいて音声遷移ありと判定することに加えて、有声/無声情報Vcに基づいて、無声状態から有声状態に変化したことが検出された場合に音節遷移ありと判定するものであっても良く、有声状態から無声状態に変化したことが検出された場合に音節遷移ありと判定するものであっても良い。
When the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from the one indicating "unvoiced" to the one indicating "voiced", the syllable transition determination unit 6e determines that there is a syllable transition. , When the voiced / unvoiced information Vc input from the voiced information extraction unit 10 changes from the one indicating "voiced" to the one indicating "unvoiced", it may be determined that there is a syllable transition.
In short, the syllable transition determination unit 6e determines that there is a voice transition based on the maximum value Mx and the minimum value Mn, and also detects that the state has changed from the unvoiced state to the voiced state based on the voiced / unvoiced information Vc. It may be determined that there is a syllable transition when it is performed, or it may be determined that there is a syllable transition when it is detected that the state has changed from a voiced state to an unvoiced state.
 音節遷移判定部6eにおける、有声/無声情報Vcに基づく音節遷移の有無の判定は、極値検出部4における極大値Mx及び極小値Mnの検出とは独立に行っても良い。
 即ち、極値検出部4で極大値Mx及び極小値Mnが検出されていなくても、有声/無声情報Vcに基づいて音節が遷移したと判定しても良い。
The determination of the presence or absence of the syllable transition based on the voiced / unvoiced information Vc in the syllable transition determination unit 6e may be performed independently of the detection of the maximum value Mx and the minimum value Mn in the extreme value detection unit 4.
That is, even if the maximum value Mx and the minimum value Mn are not detected by the extreme value detection unit 4, it may be determined that the syllable has transitioned based on the voiced / unvoiced information Vc.
 極値検出部4における極大値Mx及び極小値Mnによる音節遷移の検出、すなわち声道周波数特性の時間変化による音節遷移の検出は、発音が明瞭でない場合などに音節遷移を見逃す可能性がある。
 有声情報抽出部10から出力される有声/無声情報Vcの変化の併用によって、この見逃しを補うことができる。
The detection of the syllable transition by the maximum value Mx and the minimum value Mn in the extreme value detection unit 4, that is, the detection of the syllable transition due to the time change of the voice tract frequency characteristic may miss the syllable transition when the pronunciation is not clear.
This oversight can be compensated for by the combined use of changes in the voiced / unvoiced information Vc output from the voiced information extraction unit 10.
 実施の形態5でも実施の形態1と同様の効果が得られる。
 更に、有声情報抽出部10は音声復号部1において音声復号を行う過程で得られる有声/無声情報Vcを抽出し、音節遷移判定部6eが、有声/無声情報Vcに基づいて音節遷移の有無を判定するので、実施の形態1に比べて話速の測定精度を更に向上させることができる。
The same effect as that of the first embodiment can be obtained in the fifth embodiment.
Further, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the first embodiment.
実施の形態6.
 実施の形態2に示したLPCメルケプストラム又はLPCケプストラムを用いた音節遷移の検出と、実施の形態5に示した有声/無声情報Vcを用いた音節遷移の検出とを併用することもできる。
Embodiment 6.
The detection of the syllable transition using the LPC mel cepstrum or the LPC cepstrum shown in the second embodiment and the detection of the syllable transition using the voiced / unvoiced information Vc shown in the fifth embodiment can also be used in combination.
 図9に示される話速変換装置は、図5の話速変換装置と概して同じであるが、有声情報抽出部10が付加されており、音節遷移判定部6の代わりに音節遷移判定部6eが設けられているで異なる。なお、図9の周波数情報生成部2bの構成は図5と同じであり、その図示を省略している。 The speech speed conversion device shown in FIG. 9 is generally the same as the speech speed conversion device of FIG. 5, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided. The configuration of the frequency information generation unit 2b in FIG. 9 is the same as that in FIG. 5, and the illustration thereof is omitted.
 有声情報抽出部10及び音節遷移判定部6eは、実施の形態5に関し、図8を参照して説明したのと同様のものであり、同様に動作する。 The voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
 実施の形態6でも実施の形態2と同様の効果が得られる。
 更に、実施の形態5に関して説明した付加的な効果が得られる。即ち、有声情報抽出部10は音声復号部1において音声復号を行う過程で得られる有声/無声情報Vcを抽出し、音節遷移判定部6eが、有声/無声情報Vcに基づいて音節遷移の有無を判定するので、実施の形態2に比べて話速の測定精度を更に向上させることができる。
The same effect as that of the second embodiment can be obtained in the sixth embodiment.
Further, the additional effect described with respect to the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the second embodiment.
実施の形態7.
 図6を参照して説明した実施の形態3の話速変換装置に対して、実施の形態5に示した有声/無声情報を用いた音節遷移検出を併用することもできる。
Embodiment 7.
The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the third embodiment described with reference to FIG.
 図10の話速変換装置は、図6の話速変換装置と概して同じであるが、有声情報抽出部10が付加されており、音節遷移判定部6の代わりに音節遷移判定部6eが設けられているで異なる。なお、図10の周波数情報生成部2cの構成は図6と同じであり、その図示を省略している。 The speech speed conversion device of FIG. 10 is generally the same as the speech speed conversion device of FIG. 6, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is provided instead of the syllable transition determination unit 6. It is different. The configuration of the frequency information generation unit 2c in FIG. 10 is the same as that in FIG. 6, and the illustration thereof is omitted.
 有声情報抽出部10及び音節遷移判定部6eは、実施の形態5に関し図8を参照して説明したのと同様のものであり、同様に動作する。 The voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
 実施の形態7でも実施の形態3と同様の効果が得られる。
 更に、実施の形態5と同様の付加的な効果が得られる。即ち、有声情報抽出部10は音声復号部1において音声復号を行う過程で得られる有声/無声情報Vcを抽出し、音節遷移判定部6eが、有声/無声情報Vcに基づいて音節遷移の有無を判定するので、実施の形態3に比べて話速の測定精度を更に向上させることができる。
The same effect as that of the third embodiment can be obtained in the seventh embodiment.
Further, the same additional effect as in the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the third embodiment.
実施の形態8.
 図7を参照して説明した実施の形態4の話速変換装置に対して実施の形態5に示した有声/無声情報を用いた音節遷移検出を併用することもできる。
Embodiment 8.
The syllable transition detection using the voiced / unvoiced information shown in the fifth embodiment can also be used in combination with the speech speed conversion device of the fourth embodiment described with reference to FIG. 7.
 図11に示される話速変換装置は、図7の話速変換装置と概して同じであるが、有声情報抽出部10が付加されており、音節遷移判定部6の代わりに音節遷移判定部6eが設けられているで異なる。なお、図11の周波数情報生成部2dの構成は図7と同じであり、その図示を省略している。 The speech speed conversion device shown in FIG. 11 is generally the same as the speech speed conversion device of FIG. 7, but a voice information extraction unit 10 is added, and a syllable transition determination unit 6e is used instead of the syllable transition determination unit 6. It is different because it is provided. The configuration of the frequency information generation unit 2d in FIG. 11 is the same as that in FIG. 7, and the illustration thereof is omitted.
 有声情報抽出部10及び音節遷移判定部6eは、実施の形態5に関し、図8を参照して説明したのと同様のものであり、同様に動作する。 The voiced information extraction unit 10 and the syllable transition determination unit 6e are the same as those described with reference to FIG. 8 with respect to the fifth embodiment, and operate in the same manner.
 実施の形態8でも実施の形態4と同様の効果が得られる。
 更に、実施の形態5と同様の付加的な効果が得られる。即ち、有声情報抽出部10は音声復号部1において音声復号を行う過程で得られる有声/無声情報Vcを抽出し、音節遷移判定部6eが、有声/無声情報Vcに基づいて音節遷移の有無を判定するので、実施の形態4に比べて話速の測定精度を更に向上させることができる。
The same effect as that of the fourth embodiment can be obtained in the eighth embodiment.
Further, the same additional effect as in the fifth embodiment can be obtained. That is, the voiced information extraction unit 10 extracts the voiced / unvoiced information Vc obtained in the process of performing the voice decoding in the voice decoding unit 1, and the syllable transition determination unit 6e determines the presence / absence of the syllable transition based on the voiced / unvoiced information Vc. Since the determination is made, the measurement accuracy of the speaking speed can be further improved as compared with the fourth embodiment.
 実施の形態1~8の各々の話速変換装置は、その一部又は全部を処理回路で構成し得る。
 例えば、話速変換装置の各部分の機能をそれぞれ別個の処理回路で実現しても良いし、複数の部分の機能を纏めて1つの処理回路で実現しても良い。
 処理回路はハードウェアで構成されていても良く、ソフトウェアで、即ちプログラムされたコンピュータで構成されていても良い。
 話速変換装置の各部分の機能のうち、一部をハードウェアで実現し、他の一部をソフトウェアで実現するようにしても良い。
Each of the speech speed conversion devices of the first to eighth embodiments may be composed of a part or all of the processing circuit.
For example, the functions of each part of the speech speed converter may be realized by separate processing circuits, or the functions of a plurality of parts may be collectively realized by one processing circuit.
The processing circuit may be composed of hardware or software, that is, a programmed computer.
Of the functions of each part of the speech speed converter, a part may be realized by hardware and the other part may be realized by software.
 図12は、話速変換装置の全ての機能を実現するコンピュータ90のハードウェア構成を示す。
 図示の例ではコンピュータ90は、プロセッサ91及びメモリ92を有する。
 メモリ92には、話速変換装置の各部の機能を実現するためのプログラムが記憶されている。
FIG. 12 shows the hardware configuration of the computer 90 that realizes all the functions of the speech speed converter.
In the illustrated example, the computer 90 has a processor 91 and a memory 92.
The memory 92 stores a program for realizing the functions of each part of the speech speed conversion device.
 プロセッサ91は、例えば、CPU(Central Processing Unit)、マイクロプロセッサ、マイクロコントローラ又はDSP(Digital Signal Processor)等を用いたものである。 The processor 91 uses, for example, a CPU (Central Processing Unit), a microprocessor, a microcontroller, a DSP (Digital Signal Processor), or the like.
 メモリ92は、例えばRAM(Random Access Memory)、ROM(Read Only Memory)、フラッシュメモリ、EPROM(Erasable Programmable Read Only Memory)若しくはEEPROM(Electrically Erasable Programmable Read Only Memory)等の半導体メモリ、磁気ディスク、光ディスク、又は光磁気ディスク等を用いたものである。 The memory 92 includes, for example, a RAM (Random Access Memory), a ROM (Read Only Memory), a flash memory, an EPROM (Erasable Programmable Lead Only Memory), an EEPROM (Electrically Memory Memory, etc.) Alternatively, a photomagnetic disk or the like is used.
 プロセッサ91及びメモリ92は、互いに一体化されたLSI(Large Scale Integration)で実現されていても良い。 The processor 91 and the memory 92 may be realized by an LSI (Large Scale Integration) integrated with each other.
 プロセッサ91は、メモリ92に記憶されているプログラムを実行することにより、話速変換装置の機能を実現する。
 プログラムは、ネットワークを通じて提供されてもよく、また、記録媒体、例えば非一時的な記録媒体に記録されて提供されてもよい。即ち、プログラムは、例えば、プログラムプロダクトとして提供されてもよい。
The processor 91 realizes the function of the speech speed converter by executing the program stored in the memory 92.
The program may be provided over a network or may be recorded and provided on a recording medium, such as a non-temporary recording medium. That is, the program may be provided, for example, as a program product.
 図12のコンピュータは単一のプロセッサを含むが、2以上のプロセッサを含んでいても良い。 The computer of FIG. 12 includes a single processor, but may include two or more processors.
 図1の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図13を参照して説明する。 The processing procedure by the processor 91 when the speech speed conversion device of FIG. 1 is configured by the computer of FIG. 12 will be described with reference to FIG.
 図13の処理は、1フレームの音声符号データが受信される毎に開始される。
 従って、話速変換装置がCS-ACELP符号化方式で高能率符号化された音声符号データを処理する場合、10ミリ秒毎に図13に示される処理が開始される。
The process of FIG. 13 is started every time one frame of voice code data is received.
Therefore, when the speech speed conversion device processes the voice code data encoded with high efficiency by the CS-ACELP coding method, the process shown in FIG. 13 is started every 10 milliseconds.
 ステップST1で、プロセッサ91は、受信した音声符号データの音声復号を行い、復号された音声信号を出力する。ステップST1の処理は、図1の音声復号部1における処理と同内容のものである。例えば図2を参照して説明した音声復号部の動作と同内容の処理が行われる。 In step ST1, the processor 91 performs voice decoding of the received voice code data and outputs the decoded voice signal. The process of step ST1 has the same contents as the process in the voice decoding unit 1 of FIG. For example, the same processing as the operation of the voice decoding unit described with reference to FIG. 2 is performed.
 ステップST2で、プロセッサ91は、復号された音声信号が有音であるか無音であるかの判定を行なう。ステップST2の処理は、図1の有音検出部5における処理と同内容のものである。
 なお、有音検出部5における処理は、フレーム周期よりも短い期間毎に行われる処理、例えば閾値D56又はD50との比較、平均値D55aの算出、雑音レベル値D55の更新、及び閾値D56の更新を含むがこれらは別途行われるものとする。
In step ST2, the processor 91 determines whether the decoded audio signal is audible or silent. The process of step ST2 has the same content as the process of the sound detection unit 5 of FIG.
The processing in the sound detection unit 5 is performed every period shorter than the frame period, for example, comparison with the threshold value D56 or D50, calculation of the average value D55a, update of the noise level value D55, and update of the threshold value D56. However, these shall be performed separately.
 ステップST3で、プロセッサ91は、ステップST1における音声復号処理で生成された情報に基づいて一定時間毎に周波数情報Faを生成する。具体的には、音声復号処理で生成されたLSP係数を抽出する。例えば1フレーム毎に10次のLSP係数が抽出される。ステップST3の処理は、図1の周波数情報生成部2における処理と同内容のものである。 In step ST3, the processor 91 generates frequency information Fa at regular time intervals based on the information generated in the voice decoding process in step ST1. Specifically, the LSP coefficient generated by the voice decoding process is extracted. For example, the 10th-order LSP coefficient is extracted for each frame. The processing in step ST3 has the same contents as the processing in the frequency information generation unit 2 of FIG.
 ステップST4で、プロセッサ91は、ステップST3で生成された周波数情報Faの一定時間毎の時間変化量を算出する。具体的には10次のLSP係数を1つのベクトルと見なし、最新のLSP係数ベクトルとそれより1フレーム前のLSP係数ベクトルとの距離を周波数変化量として算出する。ステップST4の処理は、図1の情報変化量算出部3における処理と同内容のものである。 In step ST4, the processor 91 calculates the amount of time change of the frequency information Fa generated in step ST3 at regular time intervals. Specifically, the 10th-order LSP coefficient is regarded as one vector, and the distance between the latest LSP coefficient vector and the LSP coefficient vector one frame before it is calculated as the amount of frequency change. The process of step ST4 has the same contents as the process in the information change amount calculation unit 3 of FIG.
 ステップST5で、プロセッサ91は、ステップST2における判定結果を参照し、有音であるとの判定がなされていない場合(ST5でNoの場合)、ステップST12に進み、有音であるとの判定がなされている場合(ST5でYesの場合)、ステップST6に進む。 In step ST5, the processor 91 refers to the determination result in step ST2, and if it is not determined to be sound (No in ST5), the processor 91 proceeds to step ST12 and determines that it is sound. If it is done (yes in ST5), the process proceeds to step ST6.
 ステップST6で、プロセッサ91は、ステップST4で算出される情報変化量Vfにつき、極大値Mx及び極小値Mnの検出を行う。
 そのためには、まず直近の過去のNaフレーム期間中に極大及び極小が存在するか否かを判定する。具体的には、現在時刻nの1フレーム前の時刻n-1についての情報変化量Vfが極大であるか否かを判定し、極大であれば、その値(極大値)Mxを取得するとともに、直近の過去のNaフレームについての情報変化量Vfの内の極小を特定し、その値(極小値)Mnを取得する。
 ステップST6の処理は、図1の極値検出部4における処理と同内容のものである。
In step ST6, the processor 91 detects the maximum value Mx and the minimum value Mn for the information change amount Vf calculated in step ST4.
For that purpose, first, it is determined whether or not there is a maximum and a minimum during the most recent past Na frame period. Specifically, it is determined whether or not the amount of information change Vf for the time n-1 one frame before the current time n is the maximum, and if it is the maximum, the value (maximum value) Mx is acquired. , The minimum value in the information change amount Vf for the latest past Na frame is specified, and the value (minimum value) Mn is acquired.
The process in step ST6 has the same content as the process in the extreme value detection unit 4 of FIG.
 ステップST7で、プロセッサ91は、ステップST6で極大及び極小が検出されたか否かを判定する。
 極大及び極小が検出されていない場合には、ステップST12に進む。
 極大及び極小が検出されている場合には、ステップST8に進む。
In step ST7, the processor 91 determines whether or not the maximum and minimum are detected in step ST6.
If the maximum and minimum are not detected, the process proceeds to step ST12.
If the maximum and the minimum are detected, the process proceeds to step ST8.
 ステップST8で、プロセッサ91は、ステップST6において検出された極大値Mx及び極小値Mnに基づき音節遷移の有無を判定する。
 例えば、極大値Mxが予め定めた閾値Th1より大きく、かつ極大値Mxと極小値Mnとの差が予め定めた閾値Th2より大きい場合に音節遷移ありと判定し、そうでない場合は音節遷移なしと判定する。
 ステップST8の処理は、図1の音節遷移判定部6における処理と同内容のものである。
In step ST8, the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6.
For example, if the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, it is determined that there is a syllable transition, and if not, there is no syllable transition. judge.
The process of step ST8 has the same content as the process of the syllable transition determination unit 6 of FIG.
 ステップST9で、プロセッサ91は、話速算出のタイミングかどうかの判定を行なう。例えば話速の算出を一定の算出周期毎に行う場合、ステップST9では、前回の算出から一定の算出周期に相当する時間が経過したか否かを判定する。算出のタイミングでなければ、ステップST12に進む。算出のタイミングであれば、ステップST10に進む。 In step ST9, the processor 91 determines whether or not it is the timing for calculating the speaking speed. For example, when the speaking speed is calculated every fixed calculation cycle, in step ST9, it is determined whether or not a time corresponding to a certain calculation cycle has elapsed from the previous calculation. If it is not the calculation timing, the process proceeds to step ST12. If it is the calculation timing, the process proceeds to step ST10.
 次にステップST10で、プロセッサ91は、話速Ssを算出する。
 例えば、各時点で直近の過去の有音であった一定の期間中の音節数と、該一定の期間の長さとから、単位時間当たりの音節数を求めてこれを話速Ssとする。
 ここで、上記の通り、音声の話者が沈黙している時間が平均化時間に含まれないようにする。そのため、ステップST2で有音と判定された時間のみでの平均化を行う。
 ステップST10の処理は、図1の話速算出部7における処理と同内容のものである。
Next, in step ST10, the processor 91 calculates the speaking speed Ss.
For example, the number of syllables per unit time is obtained from the number of syllables in a certain period that was the most recent sound at each time point and the length of the certain period, and this is defined as the speaking speed Ss.
Here, as described above, the time when the voice speaker is silent is not included in the averaging time. Therefore, averaging is performed only for the time determined to be sound in step ST2.
The process in step ST10 has the same content as the process in the speech speed calculation unit 7 of FIG.
 次に、ステップST11で、プロセッサ91は、話速変換率Rcに決定する。話速変換率は、目標とする話速Stと、ステップST10で算出された話速Srとの比St/Srを求めることで決定される。ステップST11の処理は、図1の変換率決定部8における処理と同内容のものである。 Next, in step ST11, the processor 91 determines the speaking speed conversion rate Rc. The speaking speed conversion rate is determined by obtaining the ratio St / Sr of the target speaking speed St and the speaking speed Sr calculated in step ST10. The process in step ST11 has the same content as the process in the conversion rate determination unit 8 of FIG.
 ステップST12で、プロセッサ91は、ステップST11で求められた話速変換率を用いて、ステップST1で復号された音声信号を話速変換する。ステップST12の処理は、図1の話速変換部9における処理と同内容のものである。 In step ST12, the processor 91 converts the audio signal decoded in step ST1 into speech speed using the speech speed conversion rate obtained in step ST11. The processing in step ST12 has the same contents as the processing in the speech speed conversion unit 9 of FIG.
 なお、上記のように、ステップST5、ステップST7又はステップST9でNoの場合には、ステップST11等を経ずにステップST12に進む。これらの場合、新たな話速変換率は算出されない。
 この場合、ステップST12において過去に算出された最新の話速変換率を基に話速変換を行う。
 また、上記の通り、無音と判定された音声信号については話速を上げる、若しくはその一部又は全部を削除することによって、話速変換の処理遅延が増大し続けるのを避ける。
 また、遅延が一定値を超えた場合は、有音であっても話速変換をせずに音声を出力する。
As described above, if No in step ST5, step ST7 or step ST9, the process proceeds to step ST12 without going through step ST11 and the like. In these cases, no new speech speed conversion rate is calculated.
In this case, the speaking speed conversion is performed based on the latest speaking speed conversion rate calculated in the past in step ST12.
Further, as described above, by increasing the speaking speed of the audio signal determined to be silent, or deleting a part or all of the audio signal, it is possible to prevent the processing delay of the speaking speed conversion from continuing to increase.
If the delay exceeds a certain value, the voice is output without converting the speech speed even if there is sound.
 次に図5の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図14を参照して説明する。
 図14の処理の手順は、図13の処理の手順と概して同じであるが、ステップST3及びST4がステップST13及びST4bに置き換えられ、ステップST13の次にステップST14の処理が行われる点で異なる。
Next, when the speech speed conversion device of FIG. 5 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 14 is generally the same as the processing procedure of FIG. 13, except that steps ST3 and ST4 are replaced by steps ST13 and ST4b, and the processing of step ST14 is performed after step ST13.
 ステップST13で、プロセッサ91は、ステップST1における音声復号処理で生成されたLPC係数を抽出する。
 ステップST1においてCS-ACELP符号化方式による音声復号を行う場合、各フレームのLSP係数から変換されたLPC係数と、補間により生成されたLSP係数から変換されたLPC係数とがあるが、各フレームのLSP係数から変換されたLPC係数が抽出される。
 ステップST13の処理は、図5のLPC係数抽出部22における処理と同内容のものである。
In step ST13, the processor 91 extracts the LPC coefficient generated by the voice decoding process in step ST1.
When voice decoding is performed by the CS-ACELP coding method in step ST1, there are an LPC coefficient converted from the LSP coefficient of each frame and an LPC coefficient converted from the LSP coefficient generated by interpolation. The converted LPC coefficient is extracted from the LSP coefficient.
The process of step ST13 has the same contents as the process in the LPC coefficient extraction unit 22 of FIG.
 ステップST14で、プロセッサ91は、ステップST13で抽出されたLPC係数をLPCメルケプストラムに変換する。例えば、10次のLPCメルケプストラムが変換により生成される。LPCメルケプストラムは、周波数情報Faとして用いられる。
 ステップST14の処理は、図5のメルケプストラム算出部23における処理と同内容のものである。
In step ST14, the processor 91 converts the LPC coefficient extracted in step ST13 into an LPC mel cepstrum. For example, a 10th order LPC mel cepstrum is generated by the conversion. The LPC mel cepstrum is used as frequency information Fa.
The process of step ST14 has the same contents as the process in the mer cepstrum calculation unit 23 of FIG.
 図14のステップST4bの処理は、図13のステップST4の処理と同様である。
 但し、情報変化量Vfの算出に用いられる周波数情報Faが異なる。
 即ち、ステップST4bで、プロセッサ91は、10次のLPCメルケプストラムを1つのベクトルと見なし、最新のLPCメルケプストラムベクトルとそれより1フレーム前のLPCメルケプストラムベクトルとの距離を情報変化量Vfとして算出する。
 ステップST4bの処理は、図5の情報変化量算出部3bにおける処理と同内容のものである。
The process of step ST4b in FIG. 14 is the same as the process of step ST4 of FIG.
However, the frequency information Fa used for calculating the information change amount Vf is different.
That is, in step ST4b, the processor 91 regards the 10th-order LPC mel cepstrum as one vector, and calculates the distance between the latest LPC mel cepstrum vector and the LPC mel cepstrum vector one frame before it as the amount of information change Vf. do.
The process of step ST4b has the same content as the process in the information change amount calculation unit 3b of FIG.
 図5の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態2で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 5 is configured by the computer of FIG. 12, the same effect as described in the second embodiment can be obtained.
 次に図6の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図15を参照して説明する。
 図15の処理の手順は、図13の処理の手順と概して同じであるが、ステップST3の次にステップST15の処理が行われ、ステップST4の前にステップST16の処理が行われ、ステップST6がステップST6cで置き換えられている点で異なる。
Next, when the speech speed conversion device of FIG. 6 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 15 is generally the same as the processing procedure of FIG. 13, but the processing of step ST15 is performed after step ST3, the processing of step ST16 is performed before step ST4, and step ST6 is performed. It differs in that it has been replaced in step ST6c.
 ステップST15で、プロセッサ91は、間引き処理における抽出のタイミングかどうかを判定する。例えば、間引き率がMであり、Mフレームに1回だけ抽出が行われるものとする。その場合、前回の抽出からMフレーム経過したか否かを判定する。なお、プロセッサ91の動作の開始後の最初の抽出の際は、前回の抽出からMフレーム経過していなくても抽出のタイミングと判断することとする。 In step ST15, the processor 91 determines whether or not it is the extraction timing in the thinning process. For example, it is assumed that the thinning rate is M and the extraction is performed only once in the M frame. In that case, it is determined whether or not M frames have passed since the previous extraction. At the time of the first extraction after the start of the operation of the processor 91, it is determined that the extraction timing is set even if M frames have not passed since the previous extraction.
 ステップST15で、抽出のタイミングでなければ(ST15でNoであれば)、ステップST12に進む。抽出のタイミングであれば(ST15でYesであれば)、ステップST16に進む。
 ステップST16で、プロセッサ91は、ステップST3で生成された周波数情報を抽出する。ステップST15及びST16の処理は、図6の間引き部24における処理と同内容のものである。
 ステップST16の次にステップST4に進む。
If it is not the extraction timing in step ST15 (if it is No in ST15), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST16.
In step ST16, the processor 91 extracts the frequency information generated in step ST3. The processes of steps ST15 and ST16 have the same contents as the processes in the thinning section 24 of FIG.
After step ST16, the process proceeds to step ST4.
 図15のステップST4及びST5の処理は、図13のステップST4及びST5の処理と同様である。但し、これらの処理はMフレームに1回だけ行われる。 The processing of steps ST4 and ST5 of FIG. 15 is the same as the processing of steps ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.
 また、図15のステップST6cで、プロセッサ91は、直近の過去のNbフレームについてステップST4で算出された情報変化量Vfから、その極大値Mx及び極小値Mnの検出を行う。
 Nbは、図13のステップST6に関する説明におけるNaよりも小さい。これは、図15のステップST4では、情報変化量Vfが1フレーム毎でなくMフレーム毎に算出されるためである。例えば、NbはNa/Mに等しい値とされる。
 ステップST6cの処理は、図6の極値検出部4cにおける処理と同内容のものである。
Further, in step ST6c of FIG. 15, the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
Nb is smaller than Na in the description of step ST6 in FIG. This is because in step ST4 of FIG. 15, the amount of information change Vf is calculated not for each frame but for each M frame. For example, Nb is set to a value equal to Na / M.
The process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.
 図6の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態3で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 6 is configured by the computer of FIG. 12, the same effect as described in the third embodiment can be obtained.
 次に図7の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図16を参照して説明する。
 図16の処理の手順は、図14の処理の手順と概して同じであるが、ステップST3の次にステップST15の処理が行われ、ステップST14の前にステップST17の処理が行われ、ステップST6がステップST6cに置き換えられている点で異なる。
Next, when the speech speed conversion device of FIG. 7 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure of FIG. 16 is generally the same as the processing procedure of FIG. 14, but the processing of step ST15 is performed after step ST3, the processing of step ST17 is performed before step ST14, and step ST6 is performed. It differs in that it has been replaced by step ST6c.
 ステップST15で、プロセッサ91は、間引き処理における抽出のタイミングかどうかを判定する。
 抽出のタイミングでなければ(ST15でNoであれば)、ステップST12に進む。抽出のタイミングであれば(ST15でYesであれば)、ステップST17に進む。
 ステップST17で、プロセッサ91は、ステップST13で抽出されたLPC係数を抽出する。ステップST15及びST17の処理は、図7の間引き部24dにおける処理に同内容のものである。
 ステップST17の次にステップST14に進む。
In step ST15, the processor 91 determines whether or not it is the timing of extraction in the thinning process.
If it is not the extraction timing (if ST15 is No), the process proceeds to step ST12. If it is the extraction timing (if it is Yes in ST15), the process proceeds to step ST17.
In step ST17, the processor 91 extracts the LPC coefficient extracted in step ST13. The processes of steps ST15 and ST17 have the same contents as the processes in the thinning section 24d of FIG.
After step ST17, the process proceeds to step ST14.
 図16のステップST14、ST4及びST5の処理は、図14のステップST14、ST4及びST5の処理と同様である。但し、これらの処理はMフレームに1回だけ行われる。 The processing of steps ST14, ST4 and ST5 of FIG. 16 is the same as the processing of steps ST14, ST4 and ST5 of FIG. However, these processes are performed only once in the M frame.
 また、図16のステップST6cで、プロセッサ91は、直近の過去のNbフレームについてステップST4で算出された情報変化量Vfから、その極大値Mx及び極小値Mnの検出を行う。
 Nbは、図15のステップST6cに関する説明におけるNbと同様のものであり、図13のステップST6に関する説明におけるNaよりも小さい。
 ステップST6cの処理は、図7の極値検出部4cにおける処理と同内容のものである。
Further, in step ST6c of FIG. 16, the processor 91 detects the maximum value Mx and the minimum value Mn from the information change amount Vf calculated in step ST4 for the latest past Nb frame.
Nb is similar to Nb in the description for step ST6c in FIG. 15 and smaller than Na in the description for step ST6 in FIG.
The process of step ST6c has the same content as the process of the extreme value detection unit 4c of FIG.
 図7の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態4で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 7 is configured by the computer of FIG. 12, the same effect as described in the fourth embodiment can be obtained.
 次に図8の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図17を参照して説明する。
 図17に示される処理の手順は、図13に示される処理の手順と概して同じであるが、ステップST2の次にステップST18の処理が行われ、ステップST8の代わりにステップST8eの処理が行われる点で異なる。
Next, when the speech speed conversion device of FIG. 8 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 17 is generally the same as the processing procedure shown in FIG. 13, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
 ステップST18で、プロセッサ91は、ステップST1における音声復号処理の過程で得られる情報から、受信音声が有声か無声かの判定を行なう。
 ステップST1においてCS-ACELP符号化方式による音声復号を行う場合、長期ポストフィルタ処理の中でピッチ成分の強調度合いの制御のために用いられる利得係数に基づいて有声か無声かの判定を行なう。
 ステップST18の処理は、図8の有声情報抽出部10における処理と同内容のものである。
In step ST18, the processor 91 determines whether the received voice is voiced or unvoiced from the information obtained in the process of voice decoding processing in step ST1.
When voice decoding is performed by the CS-ACELP coding method in step ST1, it is determined whether the voice is voiced or unvoiced based on the gain coefficient used for controlling the degree of emphasis of the pitch component in the long-term post-filter processing.
The process of step ST18 has the same content as the process in the voiced information extraction unit 10 of FIG.
 ステップST8eで、プロセッサ91は、ステップST6において検出された極大値Mx及び極小値Mn、並びにステップST18における有声か無声かの判定の結果に基づき、音節遷移の有無を判定する。
 例えば、極大値Mxが予め定めた閾値Th1より大きく、かつ極大値Mxと極小値Mnとの差が予め定めた閾値Th2より大きい場合、又はステップST18における判定の結果が「無声」から「有声」に変化した場合には、音節遷移ありと判定する。
 これらのいずれにも該当しない場合は音節遷移なしと判定する。
 ステップST8eの処理は、図8の音節遷移判定部6eにおける処理と同内容のものである。
In step ST8e, the processor 91 determines the presence or absence of a syllable transition based on the maximum value Mx and the minimum value Mn detected in step ST6, and the result of the determination of voiced or unvoiced in step ST18.
For example, when the maximum value Mx is larger than the predetermined threshold value Th1 and the difference between the maximum value Mx and the minimum value Mn is larger than the predetermined threshold value Th2, or the result of the determination in step ST18 is from "unvoiced" to "voiced". If it changes to, it is determined that there is a syllable transition.
If none of these apply, it is determined that there is no syllable transition.
The process of step ST8e has the same content as the process of the syllable transition determination unit 6e of FIG.
 なお、ステップST18における判定の結果が「無声」から「有声」に変化した場合に、音節遷移ありと判定する代わりに、ステップST18における判定の結果が「有声」から「無声」に変化した場合に、音節遷移ありと判定することとしても良い。 When the result of the determination in step ST18 changes from "unvoiced" to "voiced", instead of determining that there is a syllable transition, the result of the determination in step ST18 changes from "voiced" to "unvoiced". , It may be determined that there is a syllable transition.
 また、図17の処理の手順では、ステップST7で極大値Mx及び極小値Mnが検出されたと判定された場合のみステップST8eに進むが、ステップST7で極大値Mx及び極小値Mnが検出されたと判定されない場合にも、ステップST18における有声か無声かの判定の結果に基づいて、音節遷移があったか否かの判定を行なうこととしても良い。 Further, in the processing procedure of FIG. 17, the process proceeds to step ST8e only when it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7, but it is determined that the maximum value Mx and the minimum value Mn are detected in step ST7. Even if this is not the case, it may be determined whether or not there is a syllable transition based on the result of the determination of whether it is voiced or unvoiced in step ST18.
 図8の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態5で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 8 is configured by the computer of FIG. 12, the same effect as described in the fifth embodiment can be obtained.
 次に図9の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図18を参照して説明する。
 図18に示される処理の手順は、図14に示される処理の手順と概して同じであるが、ステップST2の次にステップST18の処理が行われ、ステップST8の代わりにステップST8eの処理が行われる点で異なる。
Next, when the speech speed conversion device of FIG. 9 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 18 is generally the same as the processing procedure shown in FIG. 14, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
 ステップST18及びST8eの処理は、図17を参照して説明したステップST18及びST8eの処理と同様である。 The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.
 図9の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態6で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 9 is configured by the computer of FIG. 12, the same effect as described in the sixth embodiment can be obtained.
 次に図10の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図19を参照して説明する。
 図19に示される処理の手順は、図15に示される処理の手順と概して同じであるが、ステップST2の次にステップST18の処理が行われ、ステップST8の代わりにステップST8eの処理が行われる点で異なる。
 ステップST18及びST8eの処理は、図17を参照して説明したステップST18及びST8eの処理と同様である。
Next, when the speech speed conversion device of FIG. 10 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 19 is generally the same as the processing procedure shown in FIG. 15, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.
 図10の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態7で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 10 is configured by the computer of FIG. 12, the same effect as described in the seventh embodiment can be obtained.
 次に図11の話速変換装置が図12のコンピュータで構成されている場合の、プロセッサ91による処理の手順を図20を参照して説明する。
 図20に示される処理の手順は、図16に示される処理の手順と概して同じであるが、ステップST2の次にステップST18の処理が行われ、ステップST8の代わりにステップST8eの処理が行われる点で異なる。
 ステップST18及びST8eの処理は、図17を参照して説明したステップST18及びST8eの処理と同様である。
Next, when the speech speed conversion device of FIG. 11 is configured by the computer of FIG. 12, the procedure of processing by the processor 91 will be described with reference to FIG.
The processing procedure shown in FIG. 20 is generally the same as the processing procedure shown in FIG. 16, but the processing of step ST18 is performed after step ST2, and the processing of step ST8e is performed instead of step ST8. It differs in that.
The processing of steps ST18 and ST8e is the same as the processing of steps ST18 and ST8e described with reference to FIG.
 図11の話速変換装置が図12のコンピュータで構成されている場合にも、実施の形態8で述べたのと同様の効果が得られる。 Even when the speech speed conversion device of FIG. 11 is configured by the computer of FIG. 12, the same effect as described in the eighth embodiment can be obtained.
 以上の実施の形態には種々の変形が可能である。
 例えば、実施の形態1に関して説明した変形は、実施の形態2~8にも適用可能である。実施の形態5に関して説明した変形は、実施の形態6~8にも適用可能である。
 また、図17の処理手順に関して説明した変形は、図18~図20の処理手順にも適用可能である。
Various modifications are possible to the above embodiments.
For example, the modifications described with respect to Embodiment 1 are also applicable to Embodiments 2-8. The modifications described with respect to embodiment 5 are also applicable to embodiments 6-8.
Further, the modifications described with respect to the processing procedure of FIG. 17 can also be applied to the processing procedures of FIGS. 18 to 20.
 上記の実施の形態では、音声を聞き取りやすくするため話速を下げる場合について説明したが、上記の実施の形態の構成は、話速を上げる場合にも適用可能である。例えば、変換率決定部8で算出されたSt/Srが1より大きい場合に、該St/Srを話速変換率Rcとして用いて話速変換を行なうこととしても良い。 In the above embodiment, the case of lowering the speaking speed in order to make the voice easier to hear has been described, but the configuration of the above embodiment can also be applied to the case of increasing the speaking speed. For example, when St / Sr calculated by the conversion rate determination unit 8 is larger than 1, the speaking speed conversion may be performed using the St / Sr as the speaking speed conversion rate Rc.
 以上話速変換装置について説明したが、話速変換装置を用いて、話速変換方法を実施することも可能であり、また話速変換装置又は話速変換方法における処理をプログラムによりコンピュータに実行させることも可能である。 Although the speech speed conversion device has been described above, it is also possible to implement the speech speed conversion method by using the speech speed conversion device, and the computer is made to execute the processing in the speech speed conversion device or the speech speed conversion method by a program. It is also possible.
 1 音声復号部、 2 周波数情報生成部、 3 情報変化量算出部、 4 極値検出部、 5 有音検出部、 6 音節遷移判定部、 7 話速算出部、 8 変換率決定部、 9 話速変換部、 10 有声情報抽出部、 21 LSP係数抽出部、 22 LPC係数抽出部、 23 メルケプストラム算出部、 24,24d 間引き部、 51 低レベル検出部、 52 高レベル検出部、 53 論理和演算部、 54 ハングオーバ付加部、 55 雑音レベル算出部、 56 閾値設定部、 90 コンピュータ、 91 プロセッサ、 92 メモリ、 101 適応コードブックベクトル復号部、 102 利得復号部、 103 固定コードブックベクトル復号部、 104 適応プリフィルタ部、 105 予測利得算出部、 106 励振信号生成部、 107 LSP係数復号部、 108 補間部、 109 LPC係数変換部、 110 合成フィルタ部、 111 ポストフィルタ部、 511,521 比較部、 513,523 判定部。 1 voice decoding unit, 2 frequency information generation unit, 3 information change amount calculation unit, 4 extreme value detection unit, 5 sound detection unit, 6 syllable transition determination unit, 7 speech speed calculation unit, 8 conversion rate determination unit, 9 episodes Speed conversion unit, 10 voice information extraction unit, 21 LSP coefficient extraction unit, 22 LPC coefficient extraction unit, 23 mer cepstrum calculation unit, 24, 24d thinning unit, 51 low level detection unit, 52 high level detection unit, 53 OR operation Unit, 54 hangover addition part, 55 noise level calculation part, 56 threshold setting part, 90 computer, 91 processor, 92 memory, 101 adaptive codebook vector decoding part, 102 gain decoding part, 103 fixed codebook vector decoding part, 104 adaptation Pre-filter unit, 105 predictive gain calculation unit, 106 excitation signal generation unit, 107 LSP coefficient decoding unit, 108 interpolation unit, 109 LPC coefficient conversion unit, 110 composite filter unit, 111 post-filter unit, 511,521 comparison unit, 513, 523 Judgment unit.

Claims (10)

  1.  音声通信装置内で話速を変換する話速変換装置において、
     高能率符号化された音声符号データを復号して、音声信号を出力する音声復号部と、
     前記音声復号部において前記音声符号データを復号する過程で得られる情報から周波数情報を生成する周波数情報生成部と、
     生成された前記周波数情報の一定時間毎の時間変化量を情報変化量として求める情報変化量算出部と、
     前記音声信号に基づいて、前記音声符号データで表される受信音声が有音であるか無音であるかを判定する有音検出部と、
     前記有音検出部により前記受信音声が有音であると判定されている間の前記情報変化量が予め定められた条件を満たす場合に前記受信音声の音節が遷移したと判定する音節遷移判定部と、
     前記音節遷移判定部による判定結果に基づいて話速を算出する話速算出部と、
     前記話速算出部で算出された話速に基づいて変換率を決定する変換率決定部と、
     前記変換率決定部で決定された変換率で、前記音声信号の話速を変換する話速変換部とを有する
     話速変換装置。
    In a speech speed converter that converts speech speed in a voice communication device,
    A voice decoding unit that decodes high-efficiency encoded voice code data and outputs a voice signal,
    A frequency information generation unit that generates frequency information from information obtained in the process of decoding the voice code data in the voice decoding unit, and a frequency information generation unit.
    An information change amount calculation unit that obtains the time change amount of the generated frequency information at regular time intervals as the information change amount, and
    Based on the voice signal, a sound detection unit that determines whether the received voice represented by the voice code data is sound or no sound, and a sound detection unit.
    A syllable transition determination unit that determines that a syllable of the received voice has transitioned when the amount of change in information while the received voice is determined to be sound by the sound detection unit satisfies a predetermined condition. When,
    A speech speed calculation unit that calculates the speech speed based on the determination result by the syllable transition determination unit, and
    A conversion rate determination unit that determines the conversion rate based on the speech speed calculated by the speech speed calculation unit,
    A speech speed conversion device including a speech speed conversion unit that converts the speech speed of the audio signal at a conversion rate determined by the conversion rate determination unit.
  2.  前記予め定められた条件は、
     前記情報変化量が、一定期間内に極大値及び極小値を有し、
     前記極大値が予め定められた第1の閾値よりも大きく、
     前記極大値と前記極小値との差が第2の閾値よりも大きいことである
     請求項1に記載の話速変換装置。
    The predetermined conditions are
    The amount of change in information has a maximum value and a minimum value within a certain period of time.
    The maximum value is larger than the predetermined first threshold value,
    The speech speed conversion device according to claim 1, wherein the difference between the maximum value and the minimum value is larger than the second threshold value.
  3.  前記音声復号部において前記音声符号データを復号する過程で得られる情報から、前記受信音声が有声音か無声音かを否かを示す情報を抽出する有声情報抽出部を更に有し、
     前記音節遷移判定部は、前記有声情報抽出部で抽出された情報に基づいて、無声状態から有声状態に変化したことが検出された場合にも前記受信音声の音節が遷移したと判定し、或いは有声状態から無声状態に変化したことが検出された場合にも前記受信音声の音節が遷移したと判定する
     請求項1又は2に記載の話速変換装置。
    The voice decoding unit further includes a voiced information extraction unit that extracts information indicating whether or not the received voice is a voiced sound or an unvoiced sound from the information obtained in the process of decoding the voice code data.
    Based on the information extracted by the voiced information extraction unit, the syllable transition determination unit determines that the syllable of the received voice has transitioned even when it is detected that the voiced state has changed from the unvoiced state, or The speech speed conversion device according to claim 1 or 2, wherein it is determined that the syllable of the received voice has changed even when it is detected that the voiced state has changed to the unvoiced state.
  4.  前記周波数情報生成部は、一定時間毎に前記音声復号部内のLSP係数を抽出し、
     前記情報変化量算出部は、抽出された最新のLSP係数で構成されるLSP係数ベクトルと過去に抽出されたLSP係数で構成されるLSP係数ベクトルとの距離を前記情報変化量として算出する
     請求項1から3のいずれか1項に記載の話速変換装置。
    The frequency information generation unit extracts the LSP coefficient in the voice decoding unit at regular time intervals, and then extracts the LSP coefficient.
    The information change amount calculation unit calculates the distance between the LSP coefficient vector composed of the latest extracted LSP coefficient and the LSP coefficient vector composed of the LSP coefficient extracted in the past as the information change amount. The speech speed conversion device according to any one of 1 to 3.
  5.  前記周波数情報生成部は、前記LSP係数を間引き、
     前記情報変化量算出部は、間引き後のLSP係数で構成されるLSP係数ベクトルに基づいて前記情報変化量を算出する
     請求項4に記載の話速変換装置。
    The frequency information generator thins out the LSP coefficient,
    The speech speed conversion device according to claim 4, wherein the information change amount calculation unit calculates the information change amount based on an LSP coefficient vector composed of the LSP coefficient after thinning.
  6.  前記周波数情報生成部は、一定時間毎に前記音声復号部内のLPC係数を抽出して、抽出したLPC係数をLPCケプストラム又はLPCメルケプストラムに変換し、
     前記情報変化量算出部は、変換された最新のLPCケプストラムで構成されるLPCケプストラムベクトルと過去に変換されたLPCケプストラムで構成されるLPCケプストラムベクトルとの距離、又は変換された最新のLPCメルケプストラムで構成されるLPCメルケプストラムベクトルと過去に変換されたLPCメルケプストラムで構成されるLPCメルケプストラムベクトルとの距離を前記情報変化量として算出する
     請求項1から3のいずれか1項に記載の話速変換装置。
    The frequency information generation unit extracts the LPC coefficient in the voice decoding unit at regular time intervals, converts the extracted LPC coefficient into LPC cepstrum or LPC cepstrum, and converts the extracted LPC coefficient into LPC cepstrum or LPC cepstrum.
    The information change amount calculation unit is the distance between the LPC cepstrum vector composed of the latest converted LPC cepstrum and the LPC cepstrum vector composed of the LPC cepstrum converted in the past, or the latest converted LPC cepstrum. The story according to any one of claims 1 to 3, wherein the distance between the LPC mel cepstrum vector composed of the above and the LPC mel cepstrum vector composed of the LPC mel cepstrum converted in the past is calculated as the amount of change in the information. Speed converter.
  7.  前記周波数情報生成部は、前記LPC係数を間引き、間引き後のLPC係数を、前記LPCケプストラム又は前記LPCメルケプストラムに変換する
     請求項6に記載の話速変換装置。
    The speech speed conversion device according to claim 6, wherein the frequency information generation unit thins out the LPC coefficient and converts the thinned-out LPC coefficient into the LPC cepstrum or the LPC cepstrum.
  8.  音声通信装置内で話速を変換する話速変換方法において、
     高能率符号化された音声符号データを復号して、音声信号を出力し、
     前記音声符号データを復号する過程で得られる情報から周波数情報を生成し、
     生成された前記周波数情報の時間変化量を情報変化量として求め、
     前記音声信号に基づいて、前記音声符号データで表される受信音声が有音であるか無音であるかを判定し、
     前記受信音声が有音であると判定されている間の前記情報変化量が予め定められた条件を満たす場合に前記受信音声の音節が遷移したと判定し、
     前記音節の遷移の判定結果に基づいて話速を算出し、
     前記算出された話速に基づいて変換率を決定し、
     前記決定された変換率で、前記音声信号の話速を変換する
     話速変換方法。
    In the speech speed conversion method for converting the speech speed in the voice communication device,
    Decodes highly efficient coded audio code data, outputs an audio signal,
    Frequency information is generated from the information obtained in the process of decoding the voice code data.
    The amount of time change of the generated frequency information is obtained as the amount of information change.
    Based on the voice signal, it is determined whether the received voice represented by the voice code data is sounded or silent.
    When the amount of change in information while the received voice is determined to be sound satisfies a predetermined condition, it is determined that the syllable of the received voice has changed.
    The speaking speed is calculated based on the judgment result of the syllable transition, and the speaking speed is calculated.
    The conversion rate is determined based on the calculated speech speed,
    A speaking speed conversion method for converting the speaking speed of the audio signal at the determined conversion rate.
  9.  請求項8に記載の話速変換方法における処理をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the process according to the speech speed conversion method according to claim 8.
  10.  請求項9に記載のプログラムを記録したコンピュータで読取可能な記録媒体。 A computer-readable recording medium on which the program according to claim 9 is recorded.
PCT/JP2020/006780 2020-02-20 2020-02-20 Speaking speed conversion device, speaking speed conversion method, program, and storage medium WO2021166158A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
JP2021570271A JP7019117B2 (en) 2020-02-20 2020-02-20 Speech speed converter, speech velocity conversion method, program and recording medium
PCT/JP2020/006780 WO2021166158A1 (en) 2020-02-20 2020-02-20 Speaking speed conversion device, speaking speed conversion method, program, and storage medium
TW109129092A TW202133149A (en) 2020-02-20 2020-08-26 Speaking speed conversion device, speaking speed conversion method, program, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/006780 WO2021166158A1 (en) 2020-02-20 2020-02-20 Speaking speed conversion device, speaking speed conversion method, program, and storage medium

Publications (1)

Publication Number Publication Date
WO2021166158A1 true WO2021166158A1 (en) 2021-08-26

Family

ID=77392239

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/006780 WO2021166158A1 (en) 2020-02-20 2020-02-20 Speaking speed conversion device, speaking speed conversion method, program, and storage medium

Country Status (3)

Country Link
JP (1) JP7019117B2 (en)
TW (1) TW202133149A (en)
WO (1) WO2021166158A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7065485B1 (en) * 2002-01-09 2006-06-20 At&T Corp Enhancing speech intelligibility using variable-rate time-scale modification
JP2010026323A (en) * 2008-07-22 2010-02-04 Panasonic Electric Works Co Ltd Speech speed detection device
JP2014167525A (en) * 2013-02-28 2014-09-11 Mitsubishi Electric Corp Audio decoding device
JP2018180482A (en) * 2017-04-21 2018-11-15 富士通株式会社 Speech detection apparatus and speech detection program

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7065485B1 (en) * 2002-01-09 2006-06-20 At&T Corp Enhancing speech intelligibility using variable-rate time-scale modification
JP2010026323A (en) * 2008-07-22 2010-02-04 Panasonic Electric Works Co Ltd Speech speed detection device
JP2014167525A (en) * 2013-02-28 2014-09-11 Mitsubishi Electric Corp Audio decoding device
JP2018180482A (en) * 2017-04-21 2018-11-15 富士通株式会社 Speech detection apparatus and speech detection program

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YOSHIHIRO ADACHI, KENNOBU MAEJIMA, TATSUO YOTSUKURA, SHIGEO MORISHIMA: "Construction of the Intonation Conversion System by Audio Parameter Conversion", IEICE TECHNICAL REPORT; HCS, vol. 102, no. 734 (HCS2002-47), 17 April 2003 (2003-04-17), JP, pages 1 - 6, XP009530600 *

Also Published As

Publication number Publication date
JP7019117B2 (en) 2022-02-14
JPWO2021166158A1 (en) 2021-08-26
TW202133149A (en) 2021-09-01

Similar Documents

Publication Publication Date Title
RU2585999C2 (en) Generation of noise in audio codecs
JP5325292B2 (en) Method and identifier for classifying different segments of a signal
US9418666B2 (en) Method and apparatus for encoding and decoding audio/speech signal
JP4624552B2 (en) Broadband language synthesis from narrowband language signals
KR100615480B1 (en) Speech bandwidth extension apparatus and speech bandwidth extension method
JP6470857B2 (en) Unvoiced / voiced judgment for speech processing
JP2018120241A (en) Weighted value function determination method
KR101413967B1 (en) Encoding method and decoding method of audio signal, and recording medium thereof, encoding apparatus and decoding apparatus of audio signal
JPH0869299A (en) Voice coding method, voice decoding method and voice coding/decoding method
JP4040126B2 (en) Speech decoding method and apparatus
JP3223966B2 (en) Audio encoding / decoding device
JP4580190B2 (en) Audio processing apparatus, audio processing method and program thereof
JP2002140099A (en) Sound decoding device
WO2021166158A1 (en) Speaking speed conversion device, speaking speed conversion method, program, and storage medium
WO2012160767A1 (en) Fragment information generation device, audio compositing device, audio compositing method, and audio compositing program
JPH0782360B2 (en) Speech analysis and synthesis method
US7389226B2 (en) Optimized windows and methods therefore for gradient-descent based window optimization for linear prediction analysis in the ITU-T G.723.1 speech coding standard
KR20100006491A (en) Method and apparatus for encoding and decoding silence signal
JP6234134B2 (en) Speech synthesizer
KR101352608B1 (en) A method for extending bandwidth of vocal signal and an apparatus using it
JP3785363B2 (en) Audio signal encoding apparatus, audio signal decoding apparatus, and audio signal encoding method
JP2798919B2 (en) Voice section detection method
JPH08211895A (en) System and method for evaluation of pitch lag as well as apparatus and method for coding of sound
JP2003316398A (en) Acoustic signal coding method, coder and program therefor
JPH02160300A (en) Voice encoding system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920039

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021570271

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920039

Country of ref document: EP

Kind code of ref document: A1