JP5182112B2

JP5182112B2 - Decoding device and speech coding method estimation method

Info

Publication number: JP5182112B2
Application number: JP2009007496A
Authority: JP
Inventors: 道也鈴木
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2009-01-16
Filing date: 2009-01-16
Publication date: 2013-04-10
Anticipated expiration: 2029-01-16
Also published as: JP2010164809A

Description

本発明は、デコード装置および音声符号化方式推定方法に関する。 The present invention relates to a decoding apparatus and a speech coding method estimation method.

近年、音声の帯域を対象とした多様な音声符号化方式が開発され、当該多様な音声符号化方式で符号化された音声データ（以降、符号化データと称する）が通信に用いられている。一方、日本の電波法では、あらかじめ届け出られていない音声符号化方式の符号化データを用いた通信波の発信は禁止されている。テロリズムや犯罪などの防止を目的として通信波を監視する場合、発信された通信波の音声符号化方式を推定し、当該音声符号化方式が届け出されているか否かを判別する必要がある。 In recent years, various speech coding schemes targeting speech bands have been developed, and speech data encoded by the various speech coding schemes (hereinafter referred to as encoded data) is used for communication. On the other hand, in the Japanese radio law, transmission of communication waves using encoded data of a voice encoding method that has not been reported in advance is prohibited. When monitoring a communication wave for the purpose of preventing terrorism, crime, etc., it is necessary to estimate the voice encoding method of the transmitted communication wave and determine whether or not the voice encoding method has been reported.

音声符号化方式を推定するために、符号化データを規格が公開されている音声符号化方式のデコーダで順次デコードし、音声が出力されるか否かによって判別する方法がある。 In order to estimate the speech coding method, there is a method in which coded data is sequentially decoded by a speech coding method decoder whose standard is disclosed, and discriminated by whether or not speech is output.

なお、特許文献１では、符号化データに対して各音声符号化方式の特徴から分布をとり、その分布の特徴を見ることにより音声符号化方式を自動的に判別する方法が提示されている。 Note that Patent Document 1 presents a method of automatically determining a speech encoding method by taking a distribution from the features of each speech encoding method with respect to encoded data and viewing the characteristics of the distribution.

特開２００７−２４３６５０号公報JP 2007-243650 A

符号化データを規格が公開されている音声符号化方式のデコーダで順次デコードし、音声が出力されたか否かを判別するためには、音声の出力レベルあるいは周波数帯域を判別するための回路が必要となり、デコード装置の構成が複雑となる。音声の出力レベルあるいは出力周波数帯域から音声が出力されたか否かを判別すると、背景雑音などの影響により判別ミスが発生する場合もある。操作者の聴音により音声が出力されたか否かを判別することも可能であるが、操作者の多大な労力を必要とすることになる。また、現在、ＩＴＵ−Ｔ（International Telecommunication Union Telecommunication Standardization Sector）あるいはＥＴＳＩ（European Telecommunications Standards Institute）などで規格が公開されている音声の帯域を対象とした音声符号化方式は約２０種類ある。規格が公開されている全ての音声符号化方式でデコードすると音声符号化方式を推定するまでに多大な時間を要することになる。 A circuit for determining the output level or frequency band of audio is required to sequentially decode the encoded data with a decoder of the audio encoding method whose standard is publicly available and determine whether or not the audio is output. This complicates the configuration of the decoding apparatus. When it is determined whether or not sound is output from the output level or output frequency band of the sound, a determination error may occur due to the influence of background noise or the like. Although it is possible to determine whether or not a sound is output based on the sound of the operator, it requires a great deal of labor from the operator. Currently, there are about 20 types of speech coding schemes for speech bands whose standards are published by ITU-T (International Telecommunication Union Telecommunication Standardization Sector) or ETSI (European Telecommunications Standards Institute). If decoding is performed using all speech coding schemes whose standards are publicly available, it takes a long time to estimate the speech coding scheme.

また、上述した特許文献１では、符号化データを判別するための処理が音声符号化方式ごとに異なる。音声符号化方式ごとに異なる処理を実施する必要があるため、判別に時間を要する。 Further, in Patent Document 1 described above, processing for determining encoded data differs for each audio encoding method. Since it is necessary to carry out different processing for each speech encoding method, it takes time for the determination.

本発明の目的は、符号化データの音声符号化方式を短時間で推定できるデコード装置および音声符号化方式推定方法を提供することである。 An object of the present invention is to provide a decoding apparatus and a speech coding scheme estimation method that can estimate a speech coding scheme of encoded data in a short time.

上記目的を達成するために、本発明のデコード装置は、外部から未知の音声符号化方式で符号化された符号化データを受信するデータ入力部と、デコード可能な音声符号化方式が互いに異なる複数のデコーダを備えるデコーダ部と、前記データ入力部から受信した符号化データを周波数解析することにより音声フレームのフレーム長を推定し、前記推定したフレーム長に基づいて前記符号化データの音声符号化方式を判別し、前記デコーダ部から前記判別した音声符号化方式に対応するデコーダを選択し、前記符号化データを該デコーダにデコードさせるデータ処理部を有する。 In order to achieve the above object, a decoding apparatus according to the present invention includes a data input unit that receives encoded data encoded from an unknown external audio encoding method, and a plurality of audio encoding methods that can be decoded. A decoder unit comprising: a decoder; and a frequency analysis of the encoded data received from the data input unit to estimate a frame length of an audio frame, and an audio encoding method for the encoded data based on the estimated frame length And a data processing unit that selects a decoder corresponding to the determined speech encoding method from the decoder unit, and causes the decoder to decode the encoded data.

本発明によれば、符号化データの音声符号化方式を短時間で推定できる。 ADVANTAGE OF THE INVENTION According to this invention, the audio | voice coding system of coding data can be estimated in a short time.

第１の実施の形態のデコード装置の構成を示すブロック図である。It is a block diagram which shows the structure of the decoding apparatus of 1st Embodiment. 図１に示したデータ処理部の処理手順を示すフローチャートである。It is a flowchart which shows the process sequence of the data processing part shown in FIG. Ｇ．７２９ＣＳ−ＡＣＥＬＰの符号化データの周波数スペクトルの波形図である。G. 729 is a waveform diagram of a frequency spectrum of encoded data of CS-ACELP. FIG. Ｇ．７２９ＣＳ−ＡＣＥＬＰ（８．０ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (8.0 kbps). FIG. Ｇ．７２９ＣＳ−ＡＣＥＬＰ（６．４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (6.4 kbps). FIG. Ｇ．７２９ＣＳ−ＡＣＥＬＰ（１１．８ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (11.8 kbps). FIG. ＡＣＥＬＰ（６．７ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。It is a distribution map of the run length histogram of ACELP (6.7 kbps) encoded data. ＬＤ−ＣＥＬＰ（１６ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。It is a distribution map of the run length histogram of the coding data of LD-CELP (16 kbps). ＤｏＤＭＥＬＰ（５４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。It is a distribution map of a run length histogram of encoded data of DoD MELP (54 kbps). Ｇ．７２６ＡＤＰＣＭ（１６ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (16 kbps). FIG. Ｇ．７２６ＡＤＰＣＭ（２４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (24 kbps). FIG. Ｇ．７２６ＡＤＰＣＭ（３２ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (32 kbps). FIG. Ｇ．７２６ＡＤＰＣＭ（４０ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。G. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (40 kbps). FIG. 第２の実施の形態のデコード装置の構成を示すブロック図である。It is a block diagram which shows the structure of the decoding apparatus of 2nd Embodiment.

次に本発明について図面を参照して詳細に説明する。 Next, the present invention will be described in detail with reference to the drawings.

現在、規格が公開されている音声の帯域を対象とした音声符号化方式として、Ｇ．７２９ＣＳ−ＡＣＥＬＰ（Conjugate Structure Algebraic Code Excited Linear Prediction）、Ｇ．７２８ＬＤ−ＣＥＬＰ（low delay-code excited linear prediction）、電波産業会（ＡＲＩＢ）が規定するＡＣＥＬＰ（Conjugate Structure Algebraic Code Excited Linear Prediction）、アメリカ国防総省（ＤｏＤ）が規定するＭＥＬＰ（United States Department of Defense Mixed Excitation Linear Prediction）、Ｇ．７２６ＡＤＰＣＭ（Adaptive Differential Pulse Code Modulation）あるいはＧ７２２ＳＢ−ＡＤＰＣＭ（Sub-Band Adaptive Differential Pulse Code Modulation）などがある。 As a voice encoding method for a voice band whose standard is currently open, G.K. 729 CS-ACELP (Conjugate Structure Algebraic Code Excited Linear Prediction); 728 LD-CELP (low delay-code excited linear prediction), ACELP (Conjugate Structure Algebraic Code Excited Linear Prediction) defined by the Radio Industries Association (ARIB), MELP (United States Department of Defense defined by the US Department of Defense (DoD) Mixed Excitation Linear Prediction), G. 726 ADPCM (Adaptive Differential Pulse Code Modulation) or G722 SB-ADPCM (Sub-Band Adaptive Differential Pulse Code Modulation).

原音声データは、それぞれの音声符号化方式でサイズが規定されている音声フレームごとに符号化される。符号化データは、データ値の連続性について、音声符号化方式とビットレートの組み合わせ（以降、符号化仕様と称する）ごとに異なる特徴を有する。同一の符号化仕様では、データ値が０から１、あるいは１から０に変化する、音声フレーム内のビット位置および当該ビット位置から同一のデータ値が連続する長さをあらわす連長が類似する傾向にある。 The original audio data is encoded for each audio frame whose size is defined by each audio encoding method. Encoded data has different characteristics regarding the continuity of data values for each combination of speech encoding method and bit rate (hereinafter referred to as encoding specification). In the same coding specification, the data values change from 0 to 1, or from 1 to 0, and the bit positions in the audio frame and the run lengths representing the lengths of the same data values from the bit positions tend to be similar. It is in.

符号化データが音声フレームの生成周期で類似した値を有するため、当該符号化データを、高速フーリエ変換（以降、ＦＦＴと称する）を用いて周波数解析すると、音声フレームが生成される周波数の倍数の周波数で波の強さが大きくなる。これを利用して、音声フレームのフレーム長を推定できる。フレーム長は、音声符号化方式およびビットレートごとに規定されるので、推定したフレーム長に基づいて符号化データの音声符号化方式を判別できる。 Since the encoded data has a similar value in the speech frame generation cycle, when the encoded data is subjected to frequency analysis using a fast Fourier transform (hereinafter referred to as FFT), the multiple of the frequency at which the speech frame is generated. Wave strength increases with frequency. Using this, the frame length of the audio frame can be estimated. Since the frame length is defined for each audio encoding method and bit rate, the audio encoding method of the encoded data can be determined based on the estimated frame length.

本発明のデコード装置は、外部から受信した符号化データ（以降、受信データと称する）を周波数解析することにより音声フレームのフレーム長を推定し、推定したフレーム長に基づいて当該受信データの音声符号化方式を判別する。 The decoding apparatus according to the present invention estimates the frame length of an audio frame by frequency analysis of encoded data received from the outside (hereinafter referred to as reception data), and the audio code of the received data based on the estimated frame length Determine the conversion method.

また、本発明のデコード装置は、受信データの音声フレームのビット位置ごとの連長のヒストグラムを作成し、符号化仕様ごとにあらかじめ作成しておいた連長のヒストグラムと比較することにより当該受信データの音声符号化方式を判別する。 Further, the decoding device of the present invention creates a run length histogram for each bit position of the voice frame of the received data, and compares the received data with the run length histogram created in advance for each coding specification. Is determined.

（第１の実施の形態）
図１は本発明の第１の実施の形態のデコード装置の構成を示すブロック図である。 (First embodiment)
FIG. 1 is a block diagram showing the configuration of the decoding apparatus according to the first embodiment of the present invention.

図１に示すように、第１の実施の形態のデコード装置１は、データ入力部１０、データ処理部１１、デコーダ部１２、ＤＡ変換器１３およびデータベース１４を有する。 As illustrated in FIG. 1, the decoding device 1 according to the first embodiment includes a data input unit 10, a data processing unit 11, a decoder unit 12, a DA converter 13, and a database 14.

データ入力部１０は、図示しない外部の通信機器あるいはハードディスクなどの記憶装置から入力された受信データをデータ処理部１１に出力する。 The data input unit 10 outputs received data input from an external communication device (not shown) or a storage device such as a hard disk to the data processing unit 11.

データ処理部１１は、データ入力部１０から受信した受信データを周波数解析し、音声フレームのフレーム長を推定し、推定したフレーム長に基づいて当該受信データの音声符号化方式を判別する。 The data processing unit 11 performs frequency analysis on the reception data received from the data input unit 10, estimates the frame length of the speech frame, and determines the speech encoding method of the reception data based on the estimated frame length.

また、データ処理部１１は、符号化仕様ごとの所定のサイズの符号化データに対する音声フレームのビット位置ごとの連長のヒストグラム（以降、参照ヒストグラムと称する）をあらかじめ作成しておく。データ処理部１１は、受信データから参照ヒストグラムを作成したサイズと同サイズのデータを抽出し、音声フレームのビット位置ごとの連長のヒストグラム（以降、連長ヒストグラムと称する）を作成する。データ処理部１１は、連長ヒストグラムと参照ヒストグラムを比較し、当該受信データの音声符号化方式を判別する。 In addition, the data processing unit 11 creates in advance a histogram of consecutive lengths (hereinafter referred to as reference histograms) for each bit position of an audio frame for encoded data of a predetermined size for each encoding specification. The data processing unit 11 extracts data having the same size as the reference histogram created from the received data, and creates a run length histogram for each bit position of the audio frame (hereinafter referred to as run length histogram). The data processing unit 11 compares the run length histogram and the reference histogram to determine the speech encoding method of the received data.

また、データ処理部１１は、連長ヒストグラムと参照ヒストグラムとを比較することで、音声フレーム内のビット位置のずれを推定し、推定したビット位置のずれに相当するデータを破棄し、音声同期ずれを補償する。 Further, the data processing unit 11 compares the run length histogram with the reference histogram to estimate a bit position shift in the audio frame, discards data corresponding to the estimated bit position shift, and generates a voice synchronization shift. To compensate.

さらに、データ処理部１１は、判別した音声符号化方式に応じてデコーダ部１２が備える入力スイッチ１５および出力スイッチ１６を切り替え、デコーダ部１２が備える、判別した音声符号化方式のデコーダに受信データを出力する。 Further, the data processing unit 11 switches the input switch 15 and the output switch 16 included in the decoder unit 12 according to the determined speech encoding method, and the received data is transmitted to the determined speech encoding method decoder included in the decoder unit 12. Output.

デコーダ部１２は、デコード可能な音声符号化方式が互いに異なる複数のデコーダを備える。デコーダの例として、ＣＳ−ＡＣＥＬＰデコーダ、ＬＤ−ＣＥＬＰデコーダ、ＡＣＥＬＰデコーダ、ＭＥＬＰデコーダ、ＡＤＰＣＭデコーダあるいはＳＢ−ＡＤＰＣＭデコーダなどがある。さらに、デコーダ部１２は、デコードするデコーダを選択するための入力スイッチ１５および出力スイッチ１６を備える。デコーダ部１２は、データ処理部１１から受信した受信データを、データ処理部１１によって選択されたデコーダでデコードし、デコードされたデータ（以降、デコードデータと称する）をＤＡ変換器１３に出力する。 The decoder unit 12 includes a plurality of decoders having different audio encoding methods that can be decoded. Examples of the decoder include a CS-ACELP decoder, an LD-CELP decoder, an ACELP decoder, a MELP decoder, an ADPCM decoder, and an SB-ADPCM decoder. Furthermore, the decoder unit 12 includes an input switch 15 and an output switch 16 for selecting a decoder to be decoded. The decoder unit 12 decodes the received data received from the data processing unit 11 with the decoder selected by the data processing unit 11 and outputs the decoded data (hereinafter referred to as decoded data) to the DA converter 13.

ＤＡ変換器１３は、デコーダ部１２から受信したデコードデータをアナログ値に変換して出力する。 The DA converter 13 converts the decoded data received from the decoder unit 12 into an analog value and outputs the analog value.

データベース１４は、参照ヒストグラムを記憶するための記憶装置である。 The database 14 is a storage device for storing a reference histogram.

データ処理部１１は、例えば各種の論理回路からなるＬＳＩによって実現できる。 The data processing unit 11 can be realized by an LSI including various logic circuits, for example.

次に図１に示したデコード装置１のデータ処理部１１の処理手順について図２のフローチャートを参照して説明する。 Next, the processing procedure of the data processing unit 11 of the decoding apparatus 1 shown in FIG. 1 will be described with reference to the flowchart of FIG.

データ処理部１１は、処理を開始すると、まず、データ入力部１０から受信データを受信するまで待機する（ステップＳ１）。 When the processing is started, the data processing unit 11 first waits until reception data is received from the data input unit 10 (step S1).

データ処理部１１は、データ入力部１０から受信データを受信すると、当該受信データの音声フレームのフレーム長を推定する（ステップＳ２）。同一の符号化仕様では、データ値が０から１、あるいは１から０に変化する、音声フレーム内のビット位置および当該ビット位置以降同一のデータ値が連続する長さを示す連長が類似する傾向にある。符号化データが音声フレームの生成周期で類似した値を有するため、符号化データを周波数軸にならべかえた周波数スペクトルを作成すると、音声フレームが生成される周波数の倍数の周波数で波の強さが大きくなることが示される。これを利用して音声フレームのフレーム長を推定する方法を、Ｇ．７２９ＣＳ−ＡＣＥＬＰの音声符号化方式を例に説明する。図３は、Ｇ．７２９ＣＳ−ＡＣＥＬＰの符号化データの周波数スペクトルの波形図である。図３の横軸は、周波数を示し、縦軸は、各周波数の波の強さを示している。なお、現在、規格が公開されている、音声の帯域を対象とした音声符号化方式のサンプリング周波数は、Ｇ７２２ＳＢ−ＡＤＰＣＭでは１６ｋＨｚであり、それ以外では８ｋＨｚである。本発明では、ＦＦＴにより抽出する最大の周波数は、想定されるサンプリング周波数のうち最小のサンプリング周波数の半分である４ｋＨｚとしている。Ｇ．７２９ＣＳ−ＡＣＥＬＰでは、サンプリング周波数が８ｋＨｚであり、音声フレームのフレーム長が８０ビットであるため、音声フレームは１００Ｈｚ周期で生成される。図３に示す周波数スペクトルでは、１００Ｈｚの倍数の周波数で波の強さが大きくなることが示されている。つまり、符号化データを周波数軸にならべかえた周波数スペクトルの波の強さが大きくなる周期を測定することにより、符号化データの音声フレームのフレーム長を推定できる。 When receiving the received data from the data input unit 10, the data processing unit 11 estimates the frame length of the voice frame of the received data (step S2). With the same coding specification, data values change from 0 to 1 or from 1 to 0, and bit lengths in a voice frame and run lengths indicating the lengths of the same data values after the bit positions tend to be similar. It is in. Since the encoded data has a similar value in the speech frame generation cycle, creating a frequency spectrum that replaces the encoded data with the frequency axis creates a wave strength at a frequency that is a multiple of the frequency at which the speech frame is generated. Shown to grow. A method for estimating the frame length of an audio frame using this is described in G. 729 CS-ACELP speech coding method will be described as an example. FIG. 729 is a waveform diagram of a frequency spectrum of encoded data of CS-ACELP. FIG. The horizontal axis in FIG. 3 indicates the frequency, and the vertical axis indicates the strength of the wave at each frequency. Note that the sampling frequency of the speech coding method for speech bands whose standards are currently open is 16 kHz for G722 SB-ADPCM, and 8 kHz otherwise. In the present invention, the maximum frequency extracted by FFT is 4 kHz, which is half of the minimum sampling frequency among the assumed sampling frequencies. G. In 729 CS-ACELP, since the sampling frequency is 8 kHz and the frame length of the audio frame is 80 bits, the audio frame is generated at a cycle of 100 Hz. The frequency spectrum shown in FIG. 3 shows that the intensity of the wave increases at a frequency that is a multiple of 100 Hz. That is, the frame length of the speech frame of the encoded data can be estimated by measuring the period in which the intensity of the frequency spectrum wave in which the encoded data is arranged on the frequency axis increases.

したがって、まず、データ処理部１１は、受信データを、ＦＦＴを用いて周波数軸にならべかえた周波数スペクトルを作成する。 Therefore, first, the data processing unit 11 creates a frequency spectrum in which the received data is arranged on the frequency axis using FFT.

次に、データ処理部１１は、周波数スペクトルから波の強さのピークを抽出する。周波数軸にならべかえた周波数スペクトルをあらかじめ設定された範囲で低周波数領域から高周波数領域に順に走査し、各周波数の波の強さとあらかじめ設定された上限閾値および下限閾値を比較していく。波の強さが上限閾値以上の状態になってから下限閾値以下の状態になるまでの間に波の強さが最大となる箇所をピークとして抽出する。 Next, the data processing unit 11 extracts a wave intensity peak from the frequency spectrum. A frequency spectrum assigned to the frequency axis is scanned in order from a low frequency region to a high frequency region within a preset range, and the intensity of each frequency wave is compared with preset upper and lower thresholds. A point where the wave intensity becomes maximum between the state where the wave intensity is equal to or higher than the upper limit threshold and the state where the wave intensity is equal to or lower than the lower limit threshold is extracted as a peak.

続いて、データ処理部１１は、隣り合うピークの間隔を測定し、全てのピーク間隔の平均値を算出する。なお、周波数スペクトルは、周波数軸にならべられているため、算出したピーク間隔の平均値は、周波数を示す。符号化データは、当該周波数の逆数で示される周期で類似した値を有していると考えられる。このため、音声フレームは、当該周波数で生成されると推定できる。したがって、当該周波数をサンプリング周波数で割った値が、音声フレームのフレーム長であると推定できる。図３に示したＧ．７２９ＣＳ−ＡＣＥＬＰの符号化データの周波数スペクトルでは、ピーク間隔の平均値が１００Ｈｚであり、サンプリング周波数が８ｋＨｚであるため、音声フレームのフレーム長は、８０フレームであると推定される。音声フレームのフレーム長は、符号化仕様に応じて規定されているため、推定したフレーム長から符号化仕様を推定できる。なお、ステップＳ２では受信データのサンプリング周波数が判明していないため、サンプリング周波数が８ｋＨｚであると仮定して音声フレームのフレーム長を算出する。当該算出したフレーム長を規定フレーム長とする。音声符号化方式がＧ７２２ＳＢ−ＡＤＰＣＭの場合、サンプリング周波数は１６ｋＨｚであり、規定フレーム長は、実際のフレーム長の半分になる。 Subsequently, the data processing unit 11 measures an interval between adjacent peaks and calculates an average value of all the peak intervals. Since the frequency spectrum is arranged on the frequency axis, the average value of the calculated peak intervals indicates the frequency. It is considered that the encoded data has a similar value at a period indicated by the reciprocal of the frequency. For this reason, it can be estimated that the voice frame is generated at the frequency. Therefore, it can be estimated that the value obtained by dividing the frequency by the sampling frequency is the frame length of the audio frame. As shown in FIG. In the frequency spectrum of the encoded data of 729 CS-ACELP, since the average value of the peak interval is 100 Hz and the sampling frequency is 8 kHz, the frame length of the voice frame is estimated to be 80 frames. Since the frame length of the audio frame is defined according to the encoding specification, the encoding specification can be estimated from the estimated frame length. In step S2, since the sampling frequency of the received data is not known, the frame length of the audio frame is calculated on the assumption that the sampling frequency is 8 kHz. The calculated frame length is defined as a specified frame length. When the speech encoding method is G722 SB-ADPCM, the sampling frequency is 16 kHz, and the specified frame length is half of the actual frame length.

規定フレーム長を推定すると、データ処理部１１は、推定した規定フレーム長と規格が公開されている全ての符号化仕様の規定フレーム長を比較する（ステップＳ３）。なお、ステップＳ２で、データ処理部１１は、受信データのサンプリング周波数が８ｋＨｚであると仮定して規定フレーム長を算出した。このため、サンプリング周波数が１６ｋＨｚの音声符号化方式と比較する場合、その符号化仕様の実際のフレーム長の半分を規定フレーム長とする。データ処理部１１は、推定した規定フレーム長が、規格が公開されている符号化仕様の規定フレーム長と一致すると、当該符号化仕様を受信データの符号化仕様の候補としてピックアップする。 When the prescribed frame length is estimated, the data processing unit 11 compares the estimated prescribed frame length with the prescribed frame lengths of all the coding specifications for which the standards are disclosed (step S3). In step S2, the data processing unit 11 calculates the specified frame length on the assumption that the sampling frequency of the received data is 8 kHz. For this reason, when compared with a speech coding method having a sampling frequency of 16 kHz, half the actual frame length of the coding specification is set as the specified frame length. When the estimated specified frame length matches the specified frame length of the encoding specification for which the standard is published, the data processing unit 11 picks up the encoding specification as a candidate for the encoding specification of the received data.

規格が公開されているいずれの符号化仕様も受信データの符号化仕様の候補としてピックアップされなかった場合、データ処理部１１は、受信データの音声符号化方式が不明と判断し、ステップＳ９へ移行する。ステップＳ９へ移行すると、データ処理部１１は、利用者に音声符号化方式が不明であることを通知し（ステップＳ９）、処理を終了する。 If none of the encoding specifications for which the standard is published is picked up as a candidate for the encoding specification of the received data, the data processing unit 11 determines that the speech encoding method of the received data is unknown, and proceeds to step S9. To do. When the process proceeds to step S9, the data processing unit 11 notifies the user that the speech encoding method is unknown (step S9), and ends the process.

一方、１つ以上の符号化仕様が受信データの符号化仕様の候補としてピックアップされた場合、データ処理部１１は、ステップＳ４へ移行し、受信データの連長ヒストグラムを作成する（ステップＳ４）。 On the other hand, when one or more encoding specifications are picked up as candidates for the encoding specification of the received data, the data processing unit 11 proceeds to step S4 and creates a continuous length histogram of the received data (step S4).

図４は、Ｇ．７２９ＣＳ−ＡＣＥＬＰ（８．０ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図４の左側の図は、０の連長ヒストグラムを示し、右側の図は、１の連長ヒストグラムを示している。図４のＹ軸は、音声フレーム内のビット位置を示し、Ｘ軸は、連長数を示し、Ｚ軸は、連長の度数をログスケールで示している。Ｙ軸の最大値は、ステップＳ２で推定した規定フレーム長である。 FIG. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (8.0 kbps). FIG. The left diagram of FIG. 4 shows a run length histogram of 0, and the right diagram shows a run length histogram of 1. The Y axis in FIG. 4 indicates the bit position in the audio frame, the X axis indicates the run length number, and the Z axis indicates the run length frequency in a log scale. The maximum value of the Y axis is the specified frame length estimated in step S2.

図４に示した連長ヒストグラムの分布図の作成手順例を以下に示す。データ処理部１１は、受信データを先頭から走査し、データ値が１から０に変化すると、データ値が変化した音声フレーム内のビット位置をＹ軸上の値とし、当該ビット位置からデータ値０が連続する長さをＸ軸上の値として、０の連長ヒストグラムにプロットする。データ処理部１１は、同様に、データ値が０から１に変化すると、データ値が変化した音声フレーム内のビット位置をＹ軸上の値とし、当該ビット位置からデータ値１が連続する長さをＸ軸上の値として、１の連長ヒストグラムにプロットする。なお、データ値０が連続する長さが０の連長であり、データ値１が連続する長さが１の連長である。 An example of the procedure for creating the distribution chart of the run length histogram shown in FIG. 4 is shown below. The data processing unit 11 scans the received data from the beginning, and when the data value changes from 1 to 0, the bit position in the audio frame in which the data value has changed is set as a value on the Y axis, and the data value 0 from the bit position. Are plotted on a run-length histogram of 0, with the length of continuous as a value on the X-axis. Similarly, when the data value changes from 0 to 1, the data processing unit 11 sets the bit position in the audio frame where the data value has changed to a value on the Y axis, and the length from which the data value 1 continues from the bit position. Is plotted on the run length histogram of 1 as the value on the X-axis. The continuous length of data value 0 is a run length of 0, and the continuous length of data value 1 is a run length of 1.

データ処理部１１は、受信データの連長ヒストグラムを作成すると、作成した連長ヒストグラムを、ステップＳ３で受信データの符号化仕様の候補としてピックアップした全ての符号化仕様の参照ヒストグラムと比較する（ステップＳ５）。 When the data processing unit 11 creates the run length histogram of the received data, it compares the created run length histogram with reference histograms of all the coding specifications picked up as candidates for the coding specification of the received data in step S3 (step S3). S5).

図５は、Ｇ．７２９ＣＳ−ＡＣＥＬＰ（６．４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図６は、Ｇ．７２９ＣＳ−ＡＣＥＬＰ（１１．８ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図７は、ＡＣＥＬＰ（６．７ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図８は、ＬＤ−ＣＥＬＰ（１６ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図９は、ＤｏＤＭＥＬＰ（５４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図１０は、Ｇ．７２６ＡＤＰＣＭ（１６ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図１１は、Ｇ．７２６ＡＤＰＣＭ（２４ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図１２は、Ｇ．７２６ＡＤＰＣＭ（３２ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図１３は、Ｇ．７２６ＡＤＰＣＭ（４０ｋｂｐｓ）の符号化データの連長ヒストグラムの分布図である。図４から図１３までに示したように、連長ヒストグラムは、符号化仕様によって異なる特徴を有する。また、同一の符号化仕様では、データ値が０から１に、あるいは１から０に変化する、音声フレーム内のビット位置および当該ビット位置での連長が類似する傾向にある。したがって、ステップＳ４で作成した連長ヒストグラムを規格が公開されている符号化仕様ごとにあらかじめ作成しておいた参照ヒストグラムと比較し、その相関性を求めることにより、受信データの音声符号化方式を推定できる。 FIG. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (6.4 kbps). FIG. FIG. 729 is a distribution diagram of a run-length histogram of encoded data of CS-ACELP (11.8 kbps). FIG. FIG. 7 is a distribution diagram of a run length histogram of ACELP (6.7 kbps) encoded data. FIG. 8 is a distribution diagram of a run length histogram of LD-CELP (16 kbps) encoded data. FIG. 9 is a distribution diagram of a run length histogram of DoD MELP (54 kbps) encoded data. FIG. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (16 kbps). FIG. FIG. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (24 kbps). FIG. FIG. 726 is a distribution diagram of a run-length histogram of encoded data of 726 ADPCM (32 kbps). FIG. FIG. 726 is a distribution map of a run length histogram of encoded data of 726 ADPCM (40 kbps); As shown in FIGS. 4 to 13, the run length histogram has different characteristics depending on the encoding specification. In the same coding specification, the data position changes from 0 to 1 or from 1 to 0, and the bit position in the audio frame and the run length at the bit position tend to be similar. Therefore, by comparing the run length histogram created in step S4 with a reference histogram created in advance for each coding specification for which the standard is published, and obtaining the correlation thereof, the speech coding scheme of the received data is determined. Can be estimated.

ステップＳ４で作成した０の連長ヒストグラムをｆ₀（ｘ、ｙ）、１の連長ヒストグラムをｆ₁（ｘ、ｙ）とする。また、規格が公開されている符号化仕様ごとにあらかじめ作成しておいた０の参照ヒストグラムをｇ₀（ｘ、ｙ）、１の参照ヒストグラムをｇ₁（ｘ、ｙ）とする。ここで、ｘは、各音声フレーム内のビット位置における連長数であり、連長数の最大値をＭとすると、１≦ｘ≦Ｍの範囲の値となる。また、ｙは、音声フレーム内のビット位置であり、音声フレームの規定フレーム長をＮとすると、０≦ｙ≦（Ｎ−１）の範囲の値となる。 The 0 run length histogram created in step S4 is f ₀ (x, y), and the 1 run length histogram is f ₁ (x, y). In addition, a reference histogram of 0 created in advance for each coding specification for which a standard is disclosed is g ₀ (x, y), and a reference histogram of ₁ is g ₁ (x, y). Here, x is the number of run lengths at bit positions in each audio frame. If the maximum value of the run length number is M, the value is in the range of 1 ≦ x ≦ M. Further, y is a bit position in the audio frame, and is a value in the range of 0 ≦ y ≦ (N−1), where N is the specified frame length of the audio frame.

また、デコード装置１は、図示しないエンコード装置と必ずしも同期をとっているわけではない。このため、デコード装置１は、エンコード装置がエンコードした符号化データを必ずしも先頭から受信しているとは限らない。したがって、ステップＳ４で作成した連長ヒストグラムと参照ヒストグラムとの相関性は、音声フレーム内のビット位置のずれを考慮して、求められなければならない。 Further, the decoding device 1 is not necessarily synchronized with an encoding device (not shown). For this reason, the decoding apparatus 1 does not necessarily receive the encoded data encoded by the encoding apparatus from the beginning. Therefore, the correlation between the run length histogram created in step S4 and the reference histogram must be obtained in consideration of the bit position shift in the audio frame.

そこで、データ処理部１１は、連長ヒストグラムの音声フレーム内のビット位置を値ｋずつずらして参照ヒストグラムとの相関値を求める。なお、ｋは、音声フレーム内のビット位置の範囲内の値であり、音声フレームの規定フレーム長以上の値にはならない。 Therefore, the data processing unit 11 obtains a correlation value with the reference histogram by shifting the bit position in the speech frame of the run length histogram by a value k. Note that k is a value within the range of the bit position in the audio frame, and does not exceed the specified frame length of the audio frame.

式（１）に示すように、相関値は、連長ヒストグラムと参照ヒストグラムの音声フレーム内のビット位置ごとの０の連長の差分絶対値と１の連長の差分絶対値の総和で算出される。 As shown in the equation (1), the correlation value is calculated as the sum of the absolute difference value of the 0 run length and the absolute difference value of the run length of 1 for each bit position in the speech frame of the run length histogram and the reference histogram. The

０≦ｋ≦（Ｎ−１）の範囲で、式（１）により算出される相関値が最小となるｋが、音声フレーム内のビット位置のずれであると推定できる。 In the range of 0 ≦ k ≦ (N−1), it can be estimated that k at which the correlation value calculated by the equation (1) is the minimum is a bit position shift in the audio frame.

データ処理部１１は、ステップＳ３で受信データの符号化仕様の候補としてピックアップした全ての符号化仕様の参照ヒストグラムに対して相関値を算出する。データ処理部１１は、当該相関値が最小となる参照ヒストグラムの音声符号化方式が受信データの音声符号化方式であると推定する。なお、ステップＳ３で受信データの符号化仕様の候補としてピックアップした符号化仕様が１種類のみである場合、当該符号化仕様の音声符号化方式が受信データの音声符号化方式であると推定する。 The data processing unit 11 calculates correlation values for the reference histograms of all the encoding specifications picked up as candidates for the encoding specifications of the received data in step S3. The data processing unit 11 estimates that the speech coding method of the reference histogram that minimizes the correlation value is the speech coding method of the received data. If there is only one type of encoding specification picked up as a candidate for the encoding specification of the received data in step S3, it is estimated that the speech encoding method of the encoding specification is the speech encoding method of the received data.

続いて、データ処理部１１は、ステップＳ５で算出した相関値が、あらかじめ設定された閾値以下であるか否かを判別する（ステップＳ６）。ステップＳ５で算出した相関値があらかじめ設定された閾値以下でない場合、データ処理部１１は、受信データの音声符号化方式が不明と判断し、ステップＳ９へ移行する。ステップＳ９へ移行すると、データ処理部１１は、利用者に音声符号化方式が不明であることを通知し（ステップＳ９）、処理を終了する。 Subsequently, the data processing unit 11 determines whether or not the correlation value calculated in step S5 is equal to or less than a preset threshold value (step S6). If the correlation value calculated in step S5 is not less than or equal to a preset threshold value, the data processing unit 11 determines that the speech encoding method of the received data is unknown, and proceeds to step S9. When the process proceeds to step S9, the data processing unit 11 notifies the user that the speech encoding method is unknown (step S9), and ends the process.

一方、ステップＳ５で算出した相関値があらかじめ設定された閾値以下の場合、データ処理部１１は、受信データがステップＳ５で推定した符号化仕様の音声符号化方式で符号化されたと判断し、ステップＳ７へ移行する。 On the other hand, if the correlation value calculated in step S5 is less than or equal to a preset threshold value, the data processing unit 11 determines that the received data has been encoded by the speech encoding method of the encoding specification estimated in step S5. The process proceeds to S7.

ステップＳ７へ移行すると、データ処理部１１は、受信データの先頭からステップＳ６で推定した音声フレーム内のビット位置のずれに相当するデータを破棄し、音声フレームの同期をあわせる（ステップＳ７）。 In step S7, the data processing unit 11 discards data corresponding to the bit position shift in the audio frame estimated in step S6 from the beginning of the received data, and synchronizes the audio frame (step S7).

続いて、デコーダ部１２が備える入力スイッチ１５および出力スイッチ１６を切り替え、推定した音声符号化方式のデコーダを選択し（ステップＳ８）、処理を終了する。 Subsequently, the input switch 15 and the output switch 16 included in the decoder unit 12 are switched to select the estimated speech coding decoder (step S8), and the process is terminated.

第１の実施の形態のデコード装置は、ステップＳ２でフレーム長を推定した後に、ステップＳ４に移行し、受信データの連長ヒストグラムに基づいて、音声符号化方式を限定する。ステップＳ２でフレーム長を推定したときに、受信データの音声符号化方式が限定できる場合、ステップＳ４からステップＳ７を省略し、フレーム長から推定した音声符号化方式のデコーダを選択してもよい。この場合、データ処理部１１で同期ずれ補償ができず、デコーダ部１２で同期ずれ補償を実施する必要があるが、ステップＳ４からステップＳ７を省略できるため、より高速に受信データの音声符号化方式を推定できる。 The decoding apparatus according to the first embodiment proceeds to step S4 after estimating the frame length in step S2, and limits the speech coding scheme based on the continuous length histogram of the received data. If the speech encoding scheme of the received data can be limited when the frame length is estimated in step S2, steps S4 to S7 may be omitted, and a speech encoding scheme decoder estimated from the frame length may be selected. In this case, the data processing unit 11 cannot perform synchronization deviation compensation, and the decoder unit 12 needs to perform synchronization deviation compensation. However, since steps S4 to S7 can be omitted, the speech coding method for received data at a higher speed. Can be estimated.

第１の実施の形態のデコード装置は、受信データを周波数解析することにより、音声フレームのフレーム長を推定する。推定したフレーム長が、規格が公開されている符号化仕様のうち、１種類のみの符号化仕様で規定されているフレーム長と一致する場合、当該符号化仕様の音声符号化方式が受信データの音声符号化方式であると推定できる。また、推定したフレーム長が、規格が公開されている全ての符号化仕様で規定されているフレーム長と一致しない場合、受信データは、不明な音声符号化方式で符号化されていると推定できる。 The decoding apparatus according to the first embodiment estimates the frame length of an audio frame by performing frequency analysis on received data. If the estimated frame length matches the frame length defined in only one type of coding specification for which the standard is published, the speech coding scheme of the coding specification is the received data It can be estimated that this is a speech coding method. Also, if the estimated frame length does not match the frame length defined in all coding specifications for which the standard is published, it can be estimated that the received data is encoded by an unknown speech encoding method .

また、第１の実施の形態のデコード装置は、受信データの連長ヒストグラムを作成し、規格が公開されている符号化仕様ごとにあらかじめ作成しておいた参照ヒストグラムと比較することにより、受信データの音声符号化方式を推定する。受信データの音声フレームのフレーム長から音声符号化方式を限定できなかった場合、連長ヒストグラムと参照ヒストグラムを比較することにより当該音声符号化方式を１種類に限定できる。また、連長ヒストグラムと参照ヒストグラムを比較することにより受信データの音声符号化方式の推定の確度を高めることができる。 Further, the decoding apparatus according to the first embodiment creates a continuous length histogram of received data, and compares the received data with a reference histogram created in advance for each coding specification whose standard is published. Is estimated. If the speech encoding method cannot be limited based on the frame length of the speech frame of the received data, the speech encoding method can be limited to one type by comparing the run length histogram and the reference histogram. Also, the accuracy of estimation of the speech coding scheme of received data can be increased by comparing the run length histogram and the reference histogram.

また、第１の実施の形態のデコード装置は、連長ヒストグラムの音声フレーム内のビット位置のずれを考慮して、参照ヒストグラムとの相関性を求める。このため、デコード装置とエンコード装置の同期が取られていない場合でも、デコード装置は、受信データの音声符号化方式を推定できる。デコード装置は、音声フレーム内のビット位置のずれに相当するデータを破棄した受信データをデコーダ部に供給するため、デコーダ部で音声フレームの同期をとる必要がなくなり、デコード処理を実施するまでの時間を短縮できる。 In addition, the decoding apparatus according to the first embodiment obtains the correlation with the reference histogram in consideration of the bit position shift in the audio frame of the run length histogram. For this reason, even when the decoding device and the encoding device are not synchronized, the decoding device can estimate the voice encoding method of the received data. Since the decoding device supplies the decoder unit with received data in which data corresponding to the bit position shift in the audio frame is discarded, it is not necessary to synchronize the audio frame in the decoder unit, and the time until the decoding process is performed. Can be shortened.

これらにより、受信データの音声符号化方式を短時間に推定できる。 As a result, it is possible to estimate the speech encoding method of the received data in a short time.

（第２の実施の形態）
第１の実施の形態のデータ処理部の処理はプログラムによって実現されても良い。 (Second Embodiment)
The processing of the data processing unit of the first embodiment may be realized by a program.

図１４は第２の実施の形態のデコード装置の構成を示すブロック図である。 FIG. 14 is a block diagram showing the configuration of the decoding apparatus according to the second embodiment.

図１４に示すように、第２の実施の形態のデコード装置１は、ＣＰＵ２０と、主記憶装置２１と、補助記憶装置２２と、記録媒体インタフェース装置２３と、記録媒体２４と、データ入力部１０と、データベース１４と、デコーダ部１２と、ＤＡ変換器１３とを備え、それらが内部バス３０を介して接続される。 As shown in FIG. 14, the decoding device 1 according to the second embodiment includes a CPU 20, a main storage device 21, an auxiliary storage device 22, a recording medium interface device 23, a recording medium 24, and a data input unit 10. A database 14, a decoder unit 12, and a DA converter 13, which are connected via an internal bus 30.

記録媒体２４には、第１の実施の形態のデータ処理部の機能を実現するためのプログラムが記録される。記録媒体２４に記録されたプログラムはＣＰＵ２０によって記憶媒体インタフェース装置２３を介して主記憶装置２１に読み込まれる。ＣＰＵ２０は主記憶装置２１に読み込んだプログラムにしたがって処理を実行する。なお、記録媒体２４は、磁気ディスク、半導体メモリ、光ディスクあるいはその他の記録媒体であってもよい。 The recording medium 24 records a program for realizing the function of the data processing unit of the first embodiment. The program recorded on the recording medium 24 is read into the main storage device 21 by the CPU 20 via the storage medium interface device 23. The CPU 20 executes processing according to the program read into the main storage device 21. The recording medium 24 may be a magnetic disk, a semiconductor memory, an optical disk, or other recording medium.

１デコード装置
１０データ入力部
１１データ処理部
１２デコーダ部
１３ＤＡ変換器
１４データベース
１５入力スイッチ
１６出力スイッチ
２０ＣＰＵ
２１主記憶装置
２２補助記憶装置
２３記録媒体インタフェース部
２４記録媒体 DESCRIPTION OF SYMBOLS 1 Decoding apparatus 10 Data input part 11 Data processing part 12 Decoder part 13 DA converter 14 Database 15 Input switch 16 Output switch 20 CPU
21 Main storage device 22 Auxiliary storage device 23 Recording medium interface unit 24 Recording medium

Claims

A data input unit for receiving encoded data encoded by an unknown speech encoding method from the outside;
A decoder unit comprising a plurality of decoders having different decodable audio encoding methods;
A frame length of a speech frame is estimated by frequency analysis of the encoded data received from the data input unit, a speech encoding method of the encoded data is determined based on the estimated frame length, and the decoder unit A data processor that selects a decoder corresponding to the determined speech encoding method and causes the decoder to decode the encoded data;
A decoding device.

The data processing unit
A histogram of the run length for each bit position of the audio frame of the encoded data is created, and for each bit position of the audio frame created in advance for each combination of the audio encoding method and the bit rate for which the standard is disclosed The decoding apparatus according to claim 1, wherein a speech encoding method of the encoded data is determined by comparison with a run length histogram.

The data processing unit
A continuous length histogram for each bit position of the audio frame of the encoded data and a sequence for each bit position of the audio frame created in advance for each combination of the audio encoding method and bit rate for which the standard is disclosed. 3. The decoding according to claim 2, wherein a deviation of a bit position in a voice frame is estimated by comparing length histograms, and the encoded data in which data corresponding to the estimated deviation of the bit position is discarded is supplied to the decoder unit. apparatus.

A speech encoding method estimation method for a decoding device including a plurality of decoders having different speech encoding methods that can be decoded from each other,
Estimate the frame length of the voice frame by frequency analysis of the received encoded data,
Determining a speech encoding method of the encoded data based on the estimated frame length;
A speech encoding method estimation method for selecting a decoder corresponding to the determined speech encoding method and causing the decoder to decode the encoded data.

A histogram of the run length for each bit position of the audio frame of the encoded data is created, and for each bit position of the audio frame created in advance for each combination of the audio encoding method and the bit rate for which the standard is disclosed 5. The speech coding method estimation method according to claim 4, wherein a speech coding method of the coded data is determined by comparing with a run length histogram.

A continuous length histogram for each bit position of the audio frame of the encoded data and a sequence for each bit position of the audio frame created in advance for each combination of the audio encoding method and bit rate for which the standard is disclosed. By comparing the long histograms, we estimate the bit position shift in the audio frame,
6. The speech encoding method estimation method according to claim 5, wherein the encoded data in which data corresponding to the estimated bit position shift is discarded is supplied to the decoder.