JP2005275010A

JP2005275010A - Voice extension device, voice extension method and program

Info

Publication number: JP2005275010A
Application number: JP2004088533A
Authority: JP
Inventors: Hiroyasu Ide; 博康井手
Original assignee: Casio Computer Co Ltd
Current assignee: Casio Computer Co Ltd
Priority date: 2004-03-25
Filing date: 2004-03-25
Publication date: 2005-10-06

Abstract

<P>PROBLEM TO BE SOLVED: To reduce deterioration of voice waveform by voice extension processing. <P>SOLUTION: A voice processor 100 extends a voice signal with magnification instructed via an input device 12. A control part 100 extends the voice signal by unit of frame. The control part 110 first calculates (self) correlation between a noticed voice frame and voice frame before and after it. Size of calculated correlation coefficients is compared and a frame with larger correlation coefficient between the voice frames before and after the noticed voice frame is selected. Weighting addition is performed to the selected voice frame and the noticed voice frame using a predetermined weighting coefficient and a voice frame is generated. The generated voice frame is inserted between the noticed voice frame and the selected voice frame. <P>COPYRIGHT: (C)2006,JPO&NCIPI

Description

本発明は、音声信号を時間領域で処理して音声信号を伸張する音声伸張装置、音声伸張方法及びプログラムに関する。 The present invention relates to an audio expansion device, an audio expansion method, and a program for expanding an audio signal by processing the audio signal in the time domain.

音声信号を変形する処理の１つとして、処理対象の音声信号の振幅や周波数特性を変更することなく、処理対象の音声信号の長さ（サンプル数）を伸張する処理がある。この処理は、例えば、英会話教材で聞き取りにくい部分をゆっくり再生するような場面に適用可能である。この処理方式の１つとして、ＴＤＨＳ（Time Domain Harmonic Scaling）方式がある（例えば、特許文献１）。 As one of the processes for transforming the audio signal, there is a process for extending the length (number of samples) of the audio signal to be processed without changing the amplitude and frequency characteristics of the audio signal to be processed. This processing can be applied to a scene where, for example, a portion difficult to hear in an English conversation teaching material is slowly reproduced. As one of the processing methods, there is a TDHS (Time Domain Harmonic Scaling) method (for example, Patent Document 1).

ＴＤＨＳ方式では、処理対象の音声信号をｍ／ｎ倍（ｍ、ｎは自然数）に伸張する場合、現在の処理を行っている場所から長さｍＴの波形区間（Ｔは波形区間の１単位あたりの長さとする）の波形と、現在の場所から（ｎ−ｍ）Ｔの場所から長さｍＴの波形区間の波形とを重み付け加算し、生成した長さｍＴの波形で、現在の場所から長さｎＴの部分を置き換える。 In the TDHS system, when the audio signal to be processed is expanded m / n times (m and n are natural numbers), a waveform section of length mT from the current processing location (T is a unit of the waveform section). And the waveform of the waveform section of length mT from the location of (n−m) T from the current location, and the generated waveform of length mT is the length from the current location. Replace the nT part.

この場合、重み付け加算の対象となっている２つの波形区間のうち、時間的に過去側の波形区間の波形には重みＷ（ｋ）が、時間的に未来側の波形区間の波形には重み１−Ｗ（ｋ）が乗算される。ここで、Ｗ（ｋ）の値は波形区間の先頭のサンプル位置から末尾のサンプル位置に向かって、値０から値１まで変化する。重み係数Ｗ（ｋ）及び１−Ｗ（ｋ）を用いることで、連続性を保持しながら波形を伸張する。
特開平８−１４６９９３号公報（第３−５頁、図１２−１５） In this case, of the two waveform sections subject to weighted addition, the weight W (k) is used for the waveform in the waveform section on the past side in time, and the weight is used for the waveform in the waveform section on the future side in time. 1-W (k) is multiplied. Here, the value of W (k) changes from a value 0 to a value 1 from the head sample position to the tail sample position of the waveform section. By using the weighting factors W (k) and 1-W (k), the waveform is expanded while maintaining continuity.
JP-A-8-146993 (page 3-5, FIG. 12-15)

音声信号が定常波形を含むだけであれば、ある周期Ｔでその定常波形が繰り返されるので、伸張処理によっても、元の定常波形の繰り返し回数が増えるだけである。それゆえ、伸張した音声信号の波形は劣化しない。一方、定常波形を含まない部分では、重み付け加算により生成した波形は、処理対象の音声信号と比較して劣化し、再生時に雑音が発生する。英会話教材で聞き取りにくい部分をゆっくり再生する場面を例に説明すると、ある母音から別の母音への過渡期に雑音が発生する。 If the audio signal only includes a steady waveform, the steady waveform is repeated at a certain period T, so that the number of repetitions of the original steady waveform only increases even by the expansion process. Therefore, the waveform of the expanded audio signal does not deteriorate. On the other hand, in a portion that does not include a steady waveform, the waveform generated by weighted addition is degraded as compared with the audio signal to be processed, and noise is generated during reproduction. For example, in the case of slowly replaying parts that are difficult to hear in English conversation materials, noise occurs during the transition from one vowel to another.

本発明は、上記問題点に鑑みてなされたもので、音声伸張による音声波形の劣化を減少させる音声伸張装置、音声伸張方法及びプログラムを提供することを目的とする。 The present invention has been made in view of the above problems, and an object of the present invention is to provide an audio expansion device, an audio expansion method, and a program that reduce deterioration of an audio waveform due to audio expansion.

本発明の第１の観点にかかる音声伸張装置は、
音声波形中の複数の波形区間について、波形区間単位で該波形区間の波形とその近傍の波形区間の波形とに基づいて波形を生成し、生成した波形に基づいて伸張波形を出力する音声伸張装置であって、
波形区間単位で、波形区間の直前直後の波形区間のうち、該波形区間の波形とより類似する波形を有する波形区間を判別する判別手段と、
前記波形区間を含む区間の波形と、前記判別手段において該波形区間の波形とより類似する波形を有していると判別された波形区間を含む区間の波形との所定の重み付け加算を求め、波形を生成する生成手段と、
前記生成手段が生成した波形を、前記波形区間と前記より類似する波形を有していると判別された波形区間との間に挿入する波形接続手段と、
伸張波形を出力する手段と、
を具備し、
前記波形区間を含む区間と前記判別手段において該波形区間の波形とより類似する波形を有していると判別された波形区間を含む区間とは、前記波形区間と前記判別手段において該波形区間の波形とより類似する波形を有していると判別された波形区間との境界部分で接していること、
を特徴とする。 An audio decompression apparatus according to a first aspect of the present invention includes:
A voice expansion device that generates a waveform for a plurality of waveform sections in a speech waveform based on the waveform of the waveform section and a waveform section in the vicinity thereof, and outputs an expanded waveform based on the generated waveform. Because
A discriminating means for discriminating a waveform section having a waveform more similar to the waveform of the waveform section among the waveform sections immediately before and after the waveform section in waveform section units;
Obtaining a predetermined weighted addition between the waveform of the section including the waveform section and the waveform of the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determination unit; Generating means for generating
A waveform connecting means for inserting the waveform generated by the generating means between the waveform section and a waveform section determined to have a more similar waveform;
Means for outputting a stretched waveform;
Comprising
The section including the waveform section and the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determining means are the waveform section and the section of the waveform section in the determining means. Touching at the boundary with the waveform section determined to have a waveform more similar to the waveform,
It is characterized by.

本発明によれば、波形区間の直前直後の波形区間のうち、該波形区間の波形とより類似する波形を有する波形区間を判別し、より類似する波形を有していると判別された波形区間の波形と前記波形区間の波形との重み付け加算を求める。このため、生成する波形の劣化を減少させることができる。 According to the present invention, among the waveform sections immediately before and after the waveform section, the waveform section having a waveform more similar to the waveform in the waveform section is determined, and the waveform section determined to have a more similar waveform And the weighted addition of the waveform in the waveform section. For this reason, deterioration of the generated waveform can be reduced.

上記音声伸張装置において、
前記判別手段は、
波形区間と該波形区間の直前直後の波形区間との相関係数を求め、
求めた相関係数に基づいて、該波形区間の直前直後の波形区間のうち、前記波形区間の波形により類似する波形を有する波形区間を判別することが望ましい。 In the above audio decompression device,
The discrimination means includes
Find the correlation coefficient between the waveform section and the waveform section immediately before and after the waveform section,
It is desirable to discriminate a waveform section having a waveform similar to the waveform of the waveform section from among the waveform sections immediately before and after the waveform section based on the obtained correlation coefficient.

上記音声伸張装置において、
前記生成手段は、例えば、
前記波形区間と前記より類似する波形を有していると判別された波形区間との前後関係を判別し、
時間的に未来の波形に１から始まり０で終わる重み係数と、時間的に過去の波形に０から始まり１で終わる重み係数とをそれぞれ用いて、重み付け加算する。 In the above audio decompression device,
The generation means is, for example,
Determine the front-to-back relationship between the waveform section and the waveform section determined to have a more similar waveform,
The weighting addition is performed using the weighting coefficient starting from 1 and ending with 0 for the future waveform in time and the weighting coefficient starting with 0 and ending with 1 for the past waveform in time.

上記音声伸張装置において、
処理対象である音声波形から前記複数の波形区間に分割する音声分割手段をさらに具備してもよい。 In the above audio decompression device,
You may further comprise the audio | voice division | segmentation means to divide | segment into the said several waveform area from the audio | voice waveform which is a process target.

上記音声伸張装置は、
音声波形を伸張する部分の指定を受け付ける受け付け手段をさらに具備してもよい。
この場合、音声伸張装置は、前記受け付け手段で指定された部分のみを伸張し、他の部分の伸張を行わない。 The above voice decompression device
You may further comprise a reception means which receives designation | designated of the part which expands an audio | voice waveform.
In this case, the voice decompression device decompresses only the part designated by the receiving means and does not decompress other parts.

本発明の第２の観点にかかる音声伸張方法は、
音声波形中の複数の波形区間について、波形区間単位で該波形区間の波形とその近傍の波形区間の波形とに基づいて波形を生成し、生成した波形に基づいて伸張波形を出力する音声伸張方法であって、
波形区間単位で、波形区間の直前直後の波形区間のうち、該波形区間の波形とより類似する波形を有する波形区間を判別する判別ステップと、
前記波形区間を含む区間の波形と、前記判別ステップにおいて該波形区間の波形とより類似する波形を有していると判別された波形区間を含む区間の波形との所定の重み付け加算を求め、波形を生成する生成ステップと、
前記生成ステップが生成した波形を、前記波形区間と前記より類似する波形を有していると判別された波形区間との間に挿入する波形接続ステップと、
伸張波形を出力するステップと、
を具備し、
前記波形区間を含む区間と前記判別ステップにおいて該波形区間の波形とより類似する波形を有していると判別された波形区間を含む区間とは、前記波形区間と前記判別ステップにおいて該波形区間の波形とより類似する波形を有していると判別された波形区間との境界部分で接していること、
を特徴とする。 The audio decompression method according to the second aspect of the present invention is:
A voice expansion method for generating a waveform for a plurality of waveform sections in a speech waveform based on the waveform of the waveform section and a waveform section in the vicinity of the plurality of waveform sections, and outputting an expanded waveform based on the generated waveform Because
A discrimination step for discriminating a waveform section having a waveform more similar to the waveform of the waveform section among the waveform sections immediately before and after the waveform section in waveform section units;
Obtaining a predetermined weighted addition between the waveform of the section including the waveform section and the waveform of the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determination step; A generation step for generating
A waveform connecting step of inserting the waveform generated by the generating step between the waveform section and the waveform section determined to have a waveform more similar to the waveform section;
Outputting an expanded waveform; and
Comprising
The section including the waveform section and the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determining step are the waveform section and the section of the waveform section in the determining step. Touching at the boundary with the waveform section determined to have a waveform more similar to the waveform,
It is characterized by.

本発明の第３の観点にかかるプログラムは、
コンピュータを
音声波形中の波形区間単位で、波形区間の直前直後の波形区間のうち、該波形区間の波形とより類似する波形を有している波形区間を判別し、
前記波形区間を含む区間の波形と、該波形区間の波形とより類似する波形を有していると判別された波形区間を含む区間であって、前記波形区間と前記より類似する波形を有してしていると判別された波形区間との境界部分で接している区間の波形との所定の重み付け加算を求めて、波形を生成し、
生成した波形を、前記波形区間と前記より類似する波形を有していると判別された波形区間との間に挿入し、
伸張波形を出力する、
音声伸張装置として機能させる。 The program according to the third aspect of the present invention is:
The computer determines a waveform section having a waveform more similar to the waveform section of the waveform section immediately before and immediately after the waveform section in units of waveform sections in the audio waveform,
A section including a waveform section that includes the waveform section and a waveform section that is determined to have a waveform that is more similar to the waveform of the waveform section, and has a waveform that is more similar to the waveform section. The waveform is generated by calculating a predetermined weighted addition with the waveform of the section that is in contact with the boundary portion with the waveform section determined to be
Inserting the generated waveform between the waveform section and the waveform section determined to have a more similar waveform,
Output an expanded waveform,
It functions as a voice expansion device.

本発明によれば、音声波形を伸張する場合に、伸張後の音声波形の劣化を減少させる。 According to the present invention, when a speech waveform is expanded, deterioration of the expanded speech waveform is reduced.

本発明にかかる実施形態を、以下図面を参照して説明する。図１は、本発明の実施形態にかかる音声処理装置の構成を示すブロック図である。 Embodiments according to the present invention will be described below with reference to the drawings. FIG. 1 is a block diagram showing a configuration of a sound processing apparatus according to an embodiment of the present invention.

図１に示すように、音声処理装置１００は、例えば、コンピュータなどの情報処理装置から構成される。入力装置１２と出力装置１３と記録媒体１７とが音声処理装置１００に接続される。音声処理装置１００は、入力装置１２から指示を受けて、記録媒体１７から入力された音声波形データを指定された倍数の長さに伸張し、記録媒体１７に出力する。 As shown in FIG. 1, the audio processing device 100 is configured by an information processing device such as a computer, for example. The input device 12, the output device 13, and the recording medium 17 are connected to the sound processing device 100. In response to the instruction from the input device 12, the audio processing device 100 expands the audio waveform data input from the recording medium 17 to a specified multiple length, and outputs it to the recording medium 17.

ここで、音声波形データとは、アナログ音声が所定のサンプリング周波数（例えば、８ｋＨｚ）で量子化されているサンプル値データである。 Here, the audio waveform data is sample value data in which analog audio is quantized at a predetermined sampling frequency (for example, 8 kHz).

記録媒体１７は、例えば、ＣＤ−ＲＷ（Compact Disk ReWritable）ディスクなどであり、音声波形データを格納する。 The recording medium 17 is, for example, a CD-RW (Compact Disk ReWritable) disk and stores audio waveform data.

音声処理装置１００は、制御部１１０と、入力制御部１２０と、出力制御部１３０と、プログラム格納部１４０と、記憶部１５０と、データ記録部１７０とを備える。 The speech processing apparatus 100 includes a control unit 110, an input control unit 120, an output control unit 130, a program storage unit 140, a storage unit 150, and a data recording unit 170.

制御部１１０は、例えば、ＣＰＵ（Central Processing Unit：中央演算処理装置）、ＲＡＭ（Random Access Memory）等を備え、プログラム格納部１４０に予め格納されている動作プログラムに基づいて、音声処理装置１００の各部を制御したり、データ記録部１７０を介して、記録媒体１７に格納されている音声波形データを読み出したり、伸張した音声波形データを記録媒体１７に書き込んだり、後述する波形伸張処理などを実行したりする。 The control unit 110 includes, for example, a central processing unit (CPU), a random access memory (RAM), and the like, and is based on an operation program stored in advance in the program storage unit 140. Control each unit, read audio waveform data stored in the recording medium 17 via the data recording unit 170, write the expanded audio waveform data to the recording medium 17, and execute waveform expansion processing described later. To do.

制御部１１０は、記憶部１５０に一時記憶された音声波形データに対して、波形伸張処理を行い、伸張後の音声波形データを記憶部１５０に格納する。波形伸張処理において、制御部１１０は、音声波形データを繰り返し単位でいくつかの部分（以下、音声フレームと称する）に分割し、各部分を当該部分とその前後の部分のうちの一方とに基づいて、指定の倍数となるよう音声フレームを生成して挿入する。 The control unit 110 performs waveform expansion processing on the audio waveform data temporarily stored in the storage unit 150, and stores the expanded audio waveform data in the storage unit 150. In the waveform expansion process, the control unit 110 divides speech waveform data into several parts (hereinafter referred to as speech frames) in units of repetition, and each part is based on the part and one of the parts before and after the part. Then, a voice frame is generated and inserted so as to be a specified multiple.

制御部１１０は、音声フレームを生成する時に、その前後の音声フレームのうち、注目している音声フレームとの相関が高い方を判別する。相関が高いということは、２つの音声フレームが類似しているということである。より類似する音声フレームから音声フレームを生成すればするほど、得られる伸張波形の劣化を抑えることができる。 When generating the audio frame, the control unit 110 determines which of the preceding and succeeding audio frames has a higher correlation with the focused audio frame. A high correlation means that two speech frames are similar. The more the audio frame is generated from the more similar audio frame, the more the deterioration of the obtained expanded waveform can be suppressed.

ここで、注目している音声フレームのサンプル値列を｛ｘ_ｋ，ｘ_ｋ＋１，・・・，ｘ_{ｋ＋Ｎ−１}｝、この音声フレームの前の音声フレームのサンプル値列を｛ｘ_ｋ−Ｎ，ｘ_{ｋ−Ｎ＋１}，・・・，ｘ_ｋ−１｝、そして、この音声フレームの後の音声フレームのサンプル値列を｛ｘ_ｋ＋Ｎ，ｘ_{ｋ＋Ｎ＋１}，・・・，ｘ_{ｋ＋２Ｎ−１}｝とすると、注目している音声フレームとその前の区間の音声フレームとの相関係数ｃ_ａは数１に示す式を用いて求められ、注目している音声フレームとその後の区間の音声フレームとの相関係数ｃ_ｂは数２に示す式を用いて求められる。制御部１１０は、ｃ_ａ，ｃ_ｂの値のうち大きな方に対応する音声フレームが他方の音声フレームよりも相関が高いと判別する。 Here, the sample value sequence of the audio frame of interest is {x _k , x _{k + 1} ,..., X _{k + N−1} }, and the sample value sequence of the audio frame before this audio frame is {x _k−N , x _{k−N + 1} ,..., x _k−1 }, and the sample value sequence of the audio frame after this audio frame is {x _{k + N} , x _{k + N + 1} ,..., x _{k + 2N−1} }. The correlation coefficient c _a between the voice frame being played and the voice frame in the previous section is obtained using the equation shown in Equation 1, and the correlation coefficient between the voice frame of interest and the voice frame in the subsequent section c _b is obtained using the equation shown in Equation 2. The control unit 110 determines that the audio frame corresponding to the larger one of the values of c _a and c _b has a higher correlation than the other audio frame.

数１及び数２に示した式は２つの音声フレームの相互相関をとっているが、これらの音声フレームは元々同じ音声波形データから取り出されたものである。このため、結局、数１及び数２に示した式は、音声波形データの自己相関をとっている。 The equations shown in the equations (1) and (2) take the cross-correlation between two audio frames, and these audio frames are originally extracted from the same audio waveform data. Therefore, in the end, the equations shown in Equations 1 and 2 take the autocorrelation of the speech waveform data.

入力制御部１２０は、例えば、キーボードやポインティングデバイス、等の入力装置１２を接続し、入力装置１２から入力された制御部１１０への指示などを受け付けて制御部１１０に伝達する。 For example, the input control unit 120 connects the input device 12 such as a keyboard or a pointing device, receives an instruction to the control unit 110 input from the input device 12, and transmits the instruction to the control unit 110.

出力制御部１３０は、例えば、ディスプレイやスピーカ、等の出力装置１３を接続し、制御部１１０の処理結果などを必要に応じて出力装置１３に出力する。 For example, the output control unit 130 connects the output device 13 such as a display or a speaker, and outputs the processing result of the control unit 110 to the output device 13 as necessary.

プログラム格納部１４０は、ＲＯＭ（Read Only Memory）などによって構成され、制御部１１０が実行するプログラムを格納する。 The program storage unit 140 is configured by a ROM (Read Only Memory) or the like, and stores a program executed by the control unit 110.

記憶部１５０は、例えば、ハードディスク装置やＲＡＭ（Random Access Memory）などの記憶装置から構成され、データ記録部１７０から送られてきた音声波形データ、及び波形伸張処理後の音声波形データを一時記憶する。記憶部１５０は、一時記憶した音声波形データをデータ記録部１７０または制御部１１０に送り出す。 The storage unit 150 is composed of a storage device such as a hard disk device or RAM (Random Access Memory), for example, and temporarily stores the audio waveform data sent from the data recording unit 170 and the audio waveform data after the waveform expansion processing. . The storage unit 150 sends the temporarily stored audio waveform data to the data recording unit 170 or the control unit 110.

データ記録部１７０は、例えば、ＣＤ−ＲＷドライブなどであって、制御部１１０からの指示に従って、記録媒体１７に格納されている音声波形データを読み出す。また、伸張された音声波形データを記録媒体１７に書き込む。 The data recording unit 170 is, for example, a CD-RW drive or the like, and reads audio waveform data stored in the recording medium 17 in accordance with an instruction from the control unit 110. Further, the expanded audio waveform data is written to the recording medium 17.

以下、図面を参照して波形伸張処理を説明する。図２はこの波形伸張処理のフローチャートである。ここでは、入力装置１２から音声波形データを２倍に伸張することを指示された場面を例に説明する。従って、制御部１１０は音声波形データのサンプル数を２倍にして出力する。 Hereinafter, waveform expansion processing will be described with reference to the drawings. FIG. 2 is a flowchart of this waveform expansion process. Here, a scene in which the input device 12 is instructed to double the audio waveform data will be described as an example. Accordingly, the control unit 110 outputs the voice waveform data by doubling the number of samples.

まず、制御部１１０は音声波形データをサンプル数Ｎ個の音声フレームに分割する（ステップＳ１０１）。そして、最初の音声フレームを注目する音声フレームとする。 First, the control unit 110 divides the speech waveform data into speech frames of N samples (step S101). Then, the first audio frame is set as a focused audio frame.

次に、制御部１１０は注目している音声フレーム（サンプル値列を｛ｘ_ｋ，ｘ_ｋ＋１，・・・，ｘ_{ｋ＋Ｎ−１}｝とおく）とその前の区間の音声フレーム（サンプル値列を｛ｘ_ｋ−Ｎ，ｘ_{ｋ−Ｎ＋１}，・・・，ｘ_ｋ−１｝とおく）との相関係数ｃ_ａを数１に示す式を用いて計算し、注目している音声フレームとその後の区間の音声フレーム（サンプル値列を｛ｘ_ｋ＋Ｎ，ｘ_{ｋ＋Ｎ＋１}，・・・，ｘ_{ｋ＋２Ｎ−１}｝とおく）との相関係数ｃ_ｂを数２に示す式を用いて計算する（図２：ステップＳ１０２）。 Next, the control unit 110 puts the voice frame of interest (sample value sequence is set as {x _k , x _{k + 1} ,..., X _{k + N−1} }) and the voice frame (sample value sequence in the previous section). _{_{{x k-N, x k}} -N + 1, ···, x k-1} is calculated using the equation shown in equation 1 the correlation coefficient _{c a} between the put and), then a speech frame of interest The correlation coefficient c _b with the voice frame in the section (the sample value sequence is {x _{k + N} , x _{k + N + 1} ,..., X _{k + 2N−1} }) is calculated using the equation shown in FIG. : Step S102).

そして、制御部１１０はステップＳ１０２で計算したｃ_ａとｃ_ｂとの大小を判別し、注目している音声フレームとの相関が高い方の音声フレームを判別する（ステップＳ１０３）。 Then, the control unit 110 determines the magnitude of c _a and c _b calculated in step S102, the correlation between the speech frame of interest to determine the higher speech frames (step S103).

過去側の音声フレームの相関が未来側の音声フレームの相関よりも高い場合（ステップＳ１０３：過去側（前））、制御部１１０は次の数３に示す式に従って、音声フレームを生成する（ステップＳ１０４）。
（数３）
ｓ_ｉ＝（ｉ／Ｎ−１）×ｘ_{ｋ−Ｎ＋ｉ}＋（（Ｎ−１−ｉ）／Ｎ−１）×ｘ_ｋ＋ｉ
（ｉは０からＮ−１） When the correlation of the speech frame on the past side is higher than the correlation of the speech frame on the future side (step S103: past side (previous)), the control unit 110 generates a speech frame according to the following equation (3) (step S103). S104).
(Equation 3)
s _i = (i / N−1) × x _{k−N + i} + ((N−1−i) / N−1) × x _{k + i}
(I is 0 to N-1)

数３で示した式は、過去側の音声フレームと注目区間の音声フレームの各サンプル値を重み付け加算していることを示している。過去側の音声フレームの重み係数（ｉ／Ｎ−１）は０から始まり１で終わる。そして、注目区間の音声フレームの重み係数（（Ｎ−１−ｉ）／Ｎ−１）は１で始まり０で終わる。 The expression shown in Equation 3 indicates that the sample values of the past audio frame and the audio frame of the target section are weighted and added. The weight coefficient (i / N-1) of the voice frame on the past side starts from 0 and ends with 1. Then, the weight coefficient ((N-1-i) / N-1) of the speech frame in the attention section starts with 1 and ends with 0.

次に、制御部１１０は生成した音声フレームを前の音声フレームと注目している音声フレームとの間に接続し（ステップＳ１０５）、ステップＳ１０８に進む。 Next, the control unit 110 connects the generated audio frame between the previous audio frame and the audio frame of interest (step S105), and the process proceeds to step S108.

従って、ステップＳ１０４、Ｓ１０５の処理で得られる音声波形のサンプル値の並びは、｛・・・，ｘ_ｋ−１，ｓ_０，ｓ_１，・・・，ｓ_Ｎ−１，ｘ_ｋ，ｘ_ｋ＋１，・・・，ｘ_{ｋ＋Ｎ−１}，・・・｝となる。 Therefore, the arrangement of the sample values of the speech waveform obtained by the processes of steps S104 and S105 is {..., X _k−1 , s ₀ , s ₁ ,..., S _N−1 , x _k , x _{k + 1.} ,..., X _{k + N−1} ,.

一方、未来側の音声フレームの相関が過去側の音声フレームの相関よりも高い場合（図２：ステップＳ１０３：未来側（後））、制御部１１０は次の数４に示す式に従って、音声フレームを生成する（ステップＳ１０６）。
（数４）
ｓ_ｉ＝（ｉ／Ｎ−１）×ｘ_ｋ＋ｉ＋（（Ｎ−１−ｉ）／Ｎ−１）×ｘ_{ｋ＋Ｎ＋ｉ}
（ｉは０からＮ−１） On the other hand, when the correlation of the voice frame on the future side is higher than the correlation of the voice frame on the past side (FIG. 2: Step S103: Future side (after)), the control unit 110 performs the voice frame according to the following equation (4). Is generated (step S106).
(Equation 4)
s _i = (i / N−1) × x _{k + i} + ((N−1−i) / N−1) × x _{k + N + i}
(I is 0 to N-1)

次に、制御部１１０は生成した音声フレームを注目している音声フレームと後の音声フレームとの間に接続する（ステップＳ１０７）。 Next, the control unit 110 connects the generated audio frame between the audio frame of interest and the subsequent audio frame (step S107).

従って、ステップＳ１０６、Ｓ１０７の処理で得られる音声波形のサンプル値の並びは、｛・・・，ｘ_ｋ，ｘ_ｋ＋１，・・・，ｘ_{ｋ＋Ｎ−１}，ｓ_０，ｓ_１，・・・，ｓ_Ｎ−１，ｘ_ｋ＋Ｎ，・・・｝となる。 Therefore, the arrangement of the sample values of the speech waveform obtained by the processing in steps S106 and S107 is {..., X _k , x _{k + 1} ,..., X _{k + N−1} , s ₀ , s ₁ ,. s _N−1 , x _{k + N} ,.

最後に、制御部１１０は、まだ波形伸張処理を施していない音声フレームがあるか否かを判別する（ステップＳ１０８）。すべての音声フレームに対して、波形を伸張させたと判別した場合は（ステップＳ１０８：ＹＥＳ）、波形伸張処理を終了する。波形伸張処理を施していない音声フレームがあると判別した場合は（ステップＳ１０８：ＮＯ）、注目する音声フレームを次の音声フレームに変更し、ステップＳ１０２に戻る。 Finally, the control unit 110 determines whether there is an audio frame that has not been subjected to waveform expansion processing yet (step S108). If it is determined that the waveform has been expanded for all audio frames (step S108: YES), the waveform expansion process is terminated. If it is determined that there is an audio frame that has not been subjected to waveform expansion processing (step S108: NO), the target audio frame is changed to the next audio frame, and the process returns to step S102.

なお、注目する音声フレームが先頭あるいは末尾の音声フレームである場合は、前の音声フレームあるいは後の音声フレームの一方しか存在しない。この場合、制御部１１０は注目する音声フレームと相関係数をとる対象となっている音声フレームとから音声フレームを生成するしかない。ここで、注目する音声フレームが先頭である場合を説明すると、先頭の音声フレームとその次の音声フレームとから音声フレームを生成する。このような場合、制御部１１０は、ステップＳ１０２ではサンプル値が無いことを理由として、一方の相関係数を計算することなく、ステップＳ１０３で判別を行うことができる。 Note that if the audio frame of interest is the first or last audio frame, there is only one of the previous audio frame and the subsequent audio frame. In this case, the control unit 110 has no choice but to generate an audio frame from the audio frame of interest and the audio frame for which a correlation coefficient is to be obtained. Here, the case where the audio frame of interest is the head will be described. An audio frame is generated from the head audio frame and the next audio frame. In such a case, the control unit 110 can make a determination in step S103 without calculating one correlation coefficient because there is no sample value in step S102.

ただし、対応するサンプル値が無い場合に相関係数を適宜定めることはできる。例えば、注目する音声フレームが先頭の場合、ｃ_ａは該当するサンプル値が無いため数１では求められない。しかし、ｃ_ａ≦−√（ｘ_０ ^２＋ｘ_１ ^２＋・・・＋ｘ_Ｎ−１ ^２）を満たす適当な値をｃ_ａに設定すれば、ステップＳ１０３においてｃ_ａ≦ｃ_ｂが成立し、後（未来側）の音声フレームの方が相関が高いと判別できる。なぜなら、数１を計算すると、ｃ_ａは−√（ｘ_０ ^２＋ｘ_１ ^２＋・・・＋ｘ_Ｎ−１ ^２）未満にならないからである。 However, the correlation coefficient can be appropriately determined when there is no corresponding sample value. For example, if the speech frame of interest is the head of the, c _a is not required in Equation 1 for the sample value is not applicable. However, by setting _{c a} ≦ -√ an appropriate value satisfying _{^{_{^{(x 0 2 + x 1 2}}}} + ··· + x N-1 2) to _{_{_c a,} c a} ≦ _c _b is satisfied in step S103, after It can be determined that the (future side) speech frame has a higher correlation. This is because when calculating the number 1, _{c a} is because not less than _{^{_{^{-√ (x 0 2 + x 1}}}} 2 + ··· + x N-1 2).

このような構成によれば、過去側と未来側とのうち、相関が高い側の音声フレームと、注目している音声フレームとに基づいて音声フレームを生成し、これら２つの音声フレームの間に挿入する。このため、特に過渡期の音声波形を再生する際に、波形の劣化を減少させることができる。また、２つの音声フレームのうち、過去側の音声フレームに対し、０から始まり１で終わるような重み係数を乗算し、未来側の音声フレームに対し、１から始まり０で終わるような重み係数を乗算する。このため、生成した音声フレームは、波形の連続性を保った状態でこれら２つの音声フレームと接続される。 According to such a configuration, an audio frame is generated based on an audio frame having a higher correlation between the past side and the future side, and an audio frame of interest, and between these two audio frames. insert. For this reason, deterioration of the waveform can be reduced particularly when the audio waveform in the transition period is reproduced. Of the two audio frames, the past audio frame is multiplied by a weighting factor starting from 0 and ending with 1, and the future audio frame is multiplied by a weighting factor starting from 1 and ending with 0. Multiply. For this reason, the generated voice frame is connected to these two voice frames in a state where the continuity of the waveform is maintained.

なお、本発明は上記実施形態に限定されず、種々の変形及び応用が可能である。
例えば、上記実施形態では音声信号を２倍にする例を説明したが、３倍、４倍など任意の整数倍で伸張することができる。ｍ倍に伸張する場合は、制御部１１０は伸張波形を生成するための部分音声波形を次のように選択する。ここで、注目している部分音声波形の先頭位置を０と置き、処理単位の長さをＮとする。
１）過去側の相関が未来側の相関より大きい場合
（１−ｍ）Ｎから０までの部分音声波形と０から（ｍ−１）Ｎまでの部分音声波形
２）未来側の相関が過去側の相関より大きい場合
（２−ｍ）ＮからＮまでの部分音声波形とＮからｍＮまでの部分音声波形 In addition, this invention is not limited to the said embodiment, A various deformation | transformation and application are possible.
For example, in the above-described embodiment, an example in which the audio signal is doubled has been described. When expanding by m times, the control unit 110 selects a partial speech waveform for generating the expanded waveform as follows. Here, the head position of the partial speech waveform of interest is set to 0, and the length of the processing unit is set to N.
1) When past side correlation is larger than future side correlation (1-m) Partial speech waveform from N to 0 and partial speech waveform from 0 to (m-1) N 2) Future side correlation is past side (2-m) Partial speech waveform from N to N and partial speech waveform from N to mN

そして、数３及び数４の重み係数は、ｉ／（Ｎ−１）の代わりにｉ／（（ｍ−１）Ｎ−１）とし、（Ｎ−１−ｉ）／（Ｎ−１）の代わりに（（ｍ−１）Ｎ−１−ｉ）／（（ｍ−１）Ｎ−１）とする。なお、相関係数ｃ_ａを（１−ｍ）Ｎから０までの部分音声波形と０から（ｍ−１）Ｎまでの部分音声波形とから、相関係数ｃ_ｂを（２−ｍ）ＮからＮまでの部分音声波形とＮからｍＮまでの部分音声波形とから求めることが望ましい。 Then, the weighting coefficients of Equation 3 and Equation 4 are i / ((m-1) N-1) instead of i / (N-1), and (N-1-i) / (N-1) Instead, ((m-1) N-1-i) / ((m-1) N-1) is assumed. Incidentally, from the partial speech waveform correlation coefficient _{c a} (1-m) from the partial speech waveform and 0 from N to 0 to (m-1) N, the correlation coefficient _{c b} (2-m) N It is desirable to obtain from partial speech waveforms from N to N and partial speech waveforms from N to mN.

また、上述した数３及び数４で用いている重み係数は、一例であり、０から始まり１で終わるようなＮ個の数列ａ_ｋ（上記実施形態ではｉ／（Ｎ−１））、及び１から始まり０で終わるＮ個の数列ｂ_ｋ（上記実施形態では（Ｎ−１−ｉ）／（Ｎ−１））であればどのようなものでもよい。ただし、各ｋ（０からＮ−１まで）に対し、ａ_ｋ＋ｂ_ｋ＝１なる関係を満たしている必要がある。 Further, the weighting coefficients used in the above-described equations 3 and 4 are examples, and N number sequences a _k (i / (N−1) in the above embodiment) starting from 0 and ending with 1, and Any number is possible as long as it is an N number sequence b _k starting with 1 and ending with 0 (in the above embodiment, (N-1-i) / (N-1)). However, for each k (from 0 to N−1), it is necessary to satisfy the relationship of a _k + b _k = 1.

また、処理対象の音声波形データすべてを伸張するのではなく、一部分だけを伸張するようにしてもよい。この場合、音声処理装置１００は、入力装置１２を介して、伸張する倍数と共に伸張する部分の指定を受け付け、制御部１１０で指定された部分だけを指定された倍数に伸張し、残りの部分はそのままとする。そして、処理結果を記憶部１５０に格納したり、データ記録部１７０を介して記録媒体１７に格納したりする。 Further, not all the audio waveform data to be processed may be expanded, but only a part may be expanded. In this case, the speech processing apparatus 100 receives the designation of the part to be decompressed together with the multiple to be decompressed via the input device 12, and decompresses only the part designated by the control unit 110 to the designated multiple, and the remaining part is Leave as it is. Then, the processing result is stored in the storage unit 150 or stored in the recording medium 17 via the data recording unit 170.

また、音声処理装置１００は、飛び飛びの波形区間に対してのみ波形伸張処理を行ってもよい。この場合、制御部１１０は、例えば、波形区間の先頭からの位置をカウントし、、カウンタ値が２で割り切れるときだけ、波形伸張処理を行う。 Further, the speech processing apparatus 100 may perform the waveform expansion process only for the skipped waveform section. In this case, for example, the control unit 110 counts the position from the beginning of the waveform section, and performs the waveform expansion processing only when the counter value is divisible by 2.

また、上記実施形態では、波形区間は長さＮの音声フレームであったが、波形区間は固定長で無くてもよい。例えば、高速フーリエ変換を用いたスペクトル解析やケプストラム法などの既知の手法により音声波形が有するピッチを抽出し、音声波形データをそのピッチに応じた長さの波形区間に分割した上で上述の波形伸張処理を行うようにしてもよい。この場合には、制御部１１０が記憶部１５０に格納されている音声波形データからピッチを抽出し、制御部１１０は抽出したピッチの周波数に対応する長さを波形区間の長さとして波形伸張処理を行う。なお、例えば、音声信号の途中でピッチが変化する場合は、制御部１１０は同一のピッチが連続する部分ごとに波形伸張処理を行う。 Moreover, in the said embodiment, although the waveform area was an audio | voice frame of length N, a waveform area may not be fixed length. For example, the pitch of a speech waveform is extracted by a known method such as spectrum analysis using fast Fourier transform or a cepstrum method, and the waveform described above is obtained by dividing speech waveform data into waveform sections having a length corresponding to the pitch. An expansion process may be performed. In this case, the control unit 110 extracts the pitch from the speech waveform data stored in the storage unit 150, and the control unit 110 performs waveform expansion processing with the length corresponding to the frequency of the extracted pitch as the length of the waveform section. I do. For example, when the pitch changes in the middle of the audio signal, the control unit 110 performs waveform expansion processing for each portion where the same pitch continues.

また、音声処理装置１００はアナログ音声の入力を受け付けるようにしてもよい。この場合、音声処理装置１００は、アナログ音声データをＰＣＭ（Pulse Code Modulation）などの方式により、サンプリングする音声サンプリング部をさらに備える。また、音声処理装置１００は伸張した音声信号をＤ／Ａ変換して出力するようにしてもよい。 In addition, the voice processing apparatus 100 may accept an analog voice input. In this case, the audio processing apparatus 100 further includes an audio sampling unit that samples analog audio data by a method such as PCM (Pulse Code Modulation). Further, the audio processing apparatus 100 may output the expanded audio signal after D / A conversion.

また、音声処理装置１００は、インターネット等の通信ネットワークを介して他の装置と通信を行う通信制御部をさらに備えてもよく、この通信制御部を介して、伸張した音声波形データを他の装置に送信するようにしてもよい。また、この通信制御部を介して、音声波形データを他の装置から受信し、伸張を行うようにしてもよい。 The speech processing apparatus 100 may further include a communication control unit that communicates with other devices via a communication network such as the Internet, and the expanded speech waveform data is transmitted to the other devices via the communication control unit. You may make it transmit to. In addition, voice waveform data may be received from another device via the communication control unit and decompressed.

なお、本発明の実施形態にかかる音声処理装置１００を実現するための情報処理装置は、専用のシステムによらず、通常のコンピュータシステムを用いて実現可能である。例えば、汎用コンピュータに、上述の動作を実行するためのプログラムを格納したコンピュータ読み取り可能な記録媒体（ＦＤ、ＣＤ−ＲＯＭ、ＤＶＤ等）に格納して配布し、該プログラムをコンピュータにインストールすることにより、上述の処理を実行する音声処理装置１００を構成することができる。また、インターネット等の通信ネットワーク上のサーバ装置が有するディスク装置に格納しておき、例えばコンピュータにダウンロード等するようにしてもよい。 Note that the information processing apparatus for realizing the speech processing apparatus 100 according to the embodiment of the present invention can be realized using a normal computer system, not a dedicated system. For example, by storing in a general-purpose computer a computer-readable recording medium (FD, CD-ROM, DVD, etc.) storing a program for executing the above-described operation and installing the program in the computer The voice processing apparatus 100 that executes the above-described processing can be configured. Further, it may be stored in a disk device included in a server device on a communication network such as the Internet, and downloaded to a computer, for example.

また、ＯＳが上述の処理の一部を分担する場合、あるいは、ＯＳが本願発明の構成要素の一部を構成するような場合には、記録媒体には、その部分を除いたプログラムを格納して配布してもよく、また、コンピュータにダウンロード等してもよい。この場合も、その記録媒体には、コンピュータが実行する各機能または各ステップを実行するためのプログラムが格納されている。 Further, when the OS shares a part of the above-described processing or when the OS constitutes a part of the constituent elements of the present invention, a program excluding that part is stored in the recording medium. It may be distributed or downloaded to a computer. Also in this case, the recording medium stores a program for executing each function or each step executed by the computer.

本発明の実施形態にかかる音声処理装置のブロック図である。It is a block diagram of the audio processing apparatus concerning embodiment of this invention. 本発明の実施形態にかかる波形伸張処理を説明するためのフローチャートである。It is a flowchart for demonstrating the waveform expansion | extension process concerning embodiment of this invention.

Explanation of symbols

１００…音声処理装置、１１０…制御部、１２０…入力制御部、１２…入力装置、１３０…出力制御部、１３…出力装置、１４０…プログラム格納部、１５０…記憶部、１７０…データ記録部、１７…記録媒体 DESCRIPTION OF SYMBOLS 100 ... Voice processing apparatus, 110 ... Control part, 120 ... Input control part, 12 ... Input device, 130 ... Output control part, 13 ... Output device, 140 ... Program storage part, 150 ... Memory | storage part, 170 ... Data recording part, 17. Recording medium

Claims

A voice expansion device that generates a waveform for a plurality of waveform sections in a speech waveform based on the waveform of the waveform section and a waveform section in the vicinity thereof, and outputs an expanded waveform based on the generated waveform. Because
A discriminating means for discriminating a waveform section having a waveform more similar to the waveform of the waveform section among the waveform sections immediately before and after the waveform section in waveform section units;
Obtaining a predetermined weighted addition between the waveform of the section including the waveform section and the waveform of the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determination unit; Generating means for generating
A waveform connecting means for inserting the waveform generated by the generating means between the waveform section and a waveform section determined to have a more similar waveform;
Means for outputting a stretched waveform;
Comprising
The section including the waveform section and the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determining means are the waveform section and the section of the waveform section in the determining means. Touching at the boundary with the waveform section determined to have a waveform more similar to the waveform,
A voice expansion device characterized by the above.

The discrimination means includes
Find the correlation coefficient between the waveform section and the waveform section immediately before and after the waveform section,
Based on the obtained correlation coefficient, determining a waveform section having a waveform more similar to the waveform section of the waveform section immediately before and after the waveform section;
The audio decompression apparatus according to claim 1.

The generating means includes
Determine the front-to-back relationship between the waveform section and the waveform section determined to have a more similar waveform,
A weighted addition using a weighting factor starting from 1 and ending with 0 for a future waveform in time and a weighting factor starting with 0 and ending with 1 for a past waveform in time,
The audio expansion device according to claim 1 or 2,

4. The speech decompression apparatus according to claim 1, further comprising speech division means for dividing a speech waveform to be processed into the plurality of waveform sections.

The audio decompressor is
Further comprising a receiving means for accepting designation of a portion to expand the voice waveform;
Decompress only the part specified by the accepting means and do not decompress other parts;
The audio expansion device according to any one of claims 1 to 4, wherein:

A voice expansion method for generating a waveform for a plurality of waveform sections in a speech waveform based on the waveform of the waveform section and a waveform section in the vicinity of the plurality of waveform sections, and outputting an expanded waveform based on the generated waveform Because
A discrimination step for discriminating a waveform section having a waveform more similar to the waveform of the waveform section among the waveform sections immediately before and after the waveform section in waveform section units;
Obtaining a predetermined weighted addition between the waveform of the section including the waveform section and the waveform of the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determination step; A generation step for generating
A waveform connecting step of inserting the waveform generated by the generating step between the waveform section and the waveform section determined to have a waveform more similar to the waveform section;
Outputting an expanded waveform; and
Comprising
The section including the waveform section and the section including the waveform section determined to have a waveform more similar to the waveform of the waveform section in the determining step are the waveform section and the section of the waveform section in the determining step. Touching at the boundary with the waveform section determined to have a waveform more similar to the waveform,
A voice decompression method characterized by the above.

Computer
In the waveform section unit in the voice waveform, among the waveform sections immediately before and after the waveform section, determine the waveform section having a waveform more similar to the waveform of the waveform section,
A section including a waveform section that includes the waveform section and a waveform section that is determined to have a waveform that is more similar to the waveform of the waveform section, and has a waveform that is more similar to the waveform section. The waveform is generated by calculating a predetermined weighted addition with the waveform of the section that is in contact with the boundary portion with the waveform section determined to be
Inserting the generated waveform between the waveform section and the waveform section determined to have a more similar waveform,
Output stretched waveform,
A program characterized by functioning as a voice expansion device.