JPWO2009038115A1

JPWO2009038115A1 - Speech coding apparatus, speech coding method, and program

Info

Publication number: JPWO2009038115A1
Application number: JP2009533171A
Authority: JP
Inventors: 一範小澤; 野村　俊之; 俊之野村; 伊藤　博紀; 伊藤　　博紀
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2007-09-21
Filing date: 2008-09-18
Publication date: 2011-01-06
Also published as: WO2009038115A1

Abstract

高効率の音声符号化方式で携帯端末に対し、音楽やメロディの配信を行う際の音質の劣化を軽減する。音声符号化装置は、出力信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力する聴覚マスキング整形処理部１２０と、前記聴覚マスキング整形処理部の出力信号を音声圧縮符号化してビットストリームを出力する音声符号化処理を実行する音声符号化処理部１３０と、を備える（図１）。High-efficiency voice coding method reduces sound quality degradation when distributing music and melody to mobile terminals. The speech coding apparatus includes an auditory masking shaping processing unit 120 that suppresses and outputs a signal component that is audibly unnecessary due to the auditory masking effect, and outputs an output signal of the auditory masking shaping processing unit. A speech encoding processing unit 130 that executes speech encoding processing that performs speech compression encoding and outputs a bit stream (FIG. 1).

Description

［関連出願の記載］
本発明は、日本国特許出願：特願２００７−２４５５４７号（２００７年９月２１日出願）の優先権主張に基づくものであり、同出願の全記載内容は引用をもって本書に組み込み記載されているものとする。
本発明は、音声符号化方式で伝送される音楽信号やメロディ信号等の音質を改善するための音声符号化装置、音声符号化方法及びプログラムに関する。[Description of related applications]
The present invention is based on the priority claim of Japanese patent application: Japanese Patent Application No. 2007-245547 (filed on Sep. 21, 2007), the entire description of which is incorporated herein by reference. Shall.
The present invention relates to a voice encoding device, a voice encoding method, and a program for improving the sound quality of music signals, melody signals, and the like transmitted by a voice encoding method.

近年、携帯端末に音楽やメロディを配信するサービスが普及化してきている。例えば、携帯電話で相手に電話したときに相手が出るまでの待ち受け時に、リングバックメロディとして網側に用意した音声処理装置から携帯電話に音楽信号を流したり、音声処理装置から音楽コンテンツを携帯電話に配信するサービス等が挙げられる。 In recent years, services for distributing music and melody to mobile terminals have become widespread. For example, when a call is made to the other party using a mobile phone, a music signal is sent from the voice processing device prepared on the network side to the mobile phone as a ringback melody, or music content is sent from the voice processing device to the mobile phone. The service etc. delivered to are mentioned.

こうしたサービスを実現する場合、再生機器となる携帯端末に搭載されている音声符号化方式（例えば、非特許文献１のＡＭＲ符号化方式）と同一の方式を用いて、音楽信号や音楽コンテンツを予め圧縮符号化したビットストリームで配信することになる。 When such a service is realized, a music signal or music content is preliminarily used by using the same method as the voice coding method (for example, the AMR coding method of Non-Patent Document 1) mounted on a portable terminal serving as a playback device. The data is distributed as a compressed and encoded bit stream.

上記音楽信号や音楽コンテンツを送信した場合の音質の劣化を対象とするものではないが、音質の改善を試みる文献として、特許文献１が挙げられる。特許文献１には、符号化された複数の調波の振幅と位相を入力して復号し、該復号された調波が他の調波により聴覚的にマスキングされる調波である場合にその調波の振幅を抑圧する振幅部分抑圧手段を備えた音声復号化装置が開示されている。なお、同文献には、復号した音声を符号化する構成は開示されていない。 Although not intended for deterioration of sound quality when the music signal or music content is transmitted, Patent Document 1 is cited as a document that attempts to improve sound quality. In Patent Document 1, when the amplitude and phase of a plurality of encoded harmonics are input and decoded, and the decoded harmonics are harmonically masked by other harmonics, A speech decoding apparatus including an amplitude partial suppressing unit that suppresses the amplitude of harmonics is disclosed. The document does not disclose a configuration for encoding decoded speech.

また、特許文献２には、入力音声が非音声信号であるか否かを判別する判別手段と、判別結果により聴感補正フィルタを通過させるか否かを選択する経路選択手段と、を備えた音声符号化装置及び音声復号化装置が開示されている。なお、同文献の非音声信号とは、データ信号のことを指しており、入力信号が非音声（データ信号）である場合に、聴感補正フィルタを経由せず、その他の音声は聴感補正フィルタを経由して出力する構成となっている（段落００３２、００９９参照）。また、同文献にも、復号した音声を符号化する構成は開示されていない。 Japanese Patent Application Laid-Open No. 2004-26883 also includes a determination unit that determines whether or not the input sound is a non-speech signal, and a route selection unit that selects whether or not to pass the audibility correction filter based on the determination result. An encoding device and a speech decoding device are disclosed. The non-speech signal in this document refers to a data signal. When the input signal is non-speech (data signal), the non-speech signal does not pass through the audibility correction filter. The output is via (see paragraphs 0032 and 0099). Also, this document does not disclose a configuration for encoding decoded speech.

特開平６−３３２４９６号公報JP-A-6-332496 特開平９−５０２９８号公報Japanese Patent Laid-Open No. 9-50298 ３ＧＰＰＴＳ２６．０９０ｖ．３．１．０ ”ＡＭＲスピーチコーデックトランスコーディングファンクションズ”，１９９９年3GPP TS 26.090 v. 3.1.0 "AMR Speech Codec Transcoding Functions", 1999 ”ディジタル・コーディング・オブ・ウェーブフォームス”，プレンティス・ホール，１９９０年（ＤＩＧＩＴＡＬＣＯＤＩＮＧＯＦＷＡＶＥＦＯＲＭＳ，ＰＲＩＮＣＩＰＬＥＳＡＮＤＡＰＰＬＩＣＡＴＩＯＮＳＴＯＳＰＥＥＣＨＡＮＤＶＩＤＥＯ，ＰＲＥＮＴＩＣＥ−ＨＡＬＬ，１９９０．）"Digital Coding of Waveforms", Prentice Hall, 1990 (DIGITAL CODING OF WAVEFORMS, PRINCIPLES AND APPLICATIONS TO SPEECH AND VIDEO, PENTICE-HALL, 1990.) ”マルチレートシステムズ・アンド・フィルタバンクス”，プレンティス・ホール，１９９３年（ＭＵＬＴＩＲＡＴＥＳＹＳＴＥＭＳＡＮＤＦＩＬＴＥＲＢＡＮＫＳ，ＰＲＥＮＴＩＣＥ−ＨＡＬＬ，１９９３．）"Multirate Systems and Filterbanks", Prentice Hall, 1993 (MULTIIRATE SYSTEMS AND FILTER BANKS, PENTICE-HALL, 1993.) ”サイコアコースティクス”，スプリンガー，１９９９年（ＰＳＹＣＨＯＡＣＯＵＳＴＩＣＳ，ＳＰＲＩＮＧＥＲ，１９９９．）“Psychoa Caustics”, Springer, 1999 (PSYCHOACUSTICS, SPRINGER, 1999.) ”アイ・イー・イー・イー・インターナショナル・カンファレンス・オン・アクースティック・スピーチ・アンド・シグナル・プロセシング，２５．１．１，９３７〜９４０頁，１９８５年３月（ＩＥＥＥＩＮＴＥＲＮＡＴＩＯＮＡＬＣＯＮＦＥＲＥＮＣＥＯＮＡＣＯＵＳＴＩＣＳ，ＳＰＥＥＣＨ，ＡＮＤＳＩＧＮＡＬＰＲＯＣＥＳＳＩＮＧ，２５．１．１，ＭＡＲ，１９８５，ｐｐ．９３７−９４０）"IEE International Conference on Acoustic Speech and Signal Processing, 25.1.1, 937-940, March 1985 (IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 25.1.1, MAR, 1985, pp. 937-940)

以上の特許文献１〜２及び非特許文献１〜５の全開示内容は、本書に引用をもって繰り込み記載されているものとする。以下に本発明による関連技術の分析を与える。 The entire disclosures of Patent Documents 1 and 2 and Non-Patent Documents 1 to 5 are incorporated herein by reference. The following is an analysis of the related art according to the present invention.

上記ＡＭＲ符号化方式等のＣＥＬＰ（ＣｏｄｅＥｘｃｉｔａｔｉｏｎＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）型音声符号化方式は、原理的に通話音声に対して最適化してあり、音声信号を圧縮符号化しても音質の劣化はわずかであるが、音楽信号を圧縮符号化すると音質が大幅に劣化してしまう。このため、これらの音声符号化方式にてメロディや音楽コンテンツを配信すると、携帯端末での再生時に音質が大幅に劣化するという問題点がある。 A CELP (Code Excitation Linear Prediction) type audio encoding method such as the above AMR encoding method is optimized in principle for call speech, and there is little degradation in sound quality even if the audio signal is compression encoded. When the music signal is compression-encoded, the sound quality is greatly deteriorated. For this reason, when a melody or music content is distributed by these voice encoding methods, there is a problem that the sound quality is greatly deteriorated during reproduction on a portable terminal.

これは、音声信号に対して最適化された音声符号化方式では、モデル化できない音楽信号の成分が圧縮符号化により雑音となって再生信号に重畳し、この雑音が耳につくためと考えられる。 This is thought to be because, in a speech coding system optimized for speech signals, music signal components that cannot be modeled become noise due to compression coding and are superimposed on the playback signal, and this noise is heard. .

本発明は、上述した問題点に鑑みてなされたものであって、音声符号化方式で圧縮符号化したビットストリームを配信する必要のある携帯端末に対し、音楽やメロディの配信を行う際の音質の劣化を軽減することができる音声符号化装置、音声符号化方法及びプログラムを提供することにある。 The present invention has been made in view of the above-described problems, and is a sound quality when distributing music and melody to a mobile terminal that needs to distribute a bitstream compressed and encoded by an audio encoding method. It is an object to provide a speech coding apparatus, speech coding method, and program capable of reducing deterioration of the sound.

本発明の第１の視点によれば、音声信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力する聴覚マスキング整形処理部と、前記聴覚マスキング整形処理部の出力信号を音声圧縮符号化してビットストリームを出力する音声符号化処理を実行する音声符号化部と、を備える音声符号化装置が提供される。 According to a first aspect of the present invention, an auditory masking shaping processing unit that suppresses and outputs a signal component that is audibly unnecessary due to an auditory masking effect among signal components of an audio signal, and the auditory masking shaping processing unit A speech encoding device is provided that includes a speech encoding unit that performs speech encoding processing that outputs a bit stream by performing speech compression encoding of the output signal.

本発明の第２の視点によれば、音声符号化装置が、入力音声信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力し、前記音声符号化装置が、前記聴覚的に不要となる信号成分が抑圧された整形信号を音声圧縮符号化してビットストリームを出力する音声符号化方法が提供される。 According to the second aspect of the present invention, the speech coding apparatus suppresses and outputs a signal component that is audibly unnecessary due to the auditory masking effect among the signal components of the input speech signal, and the speech coding apparatus. However, there is provided a speech coding method for speech compression coding the shaped signal in which the auditory unnecessary signal component is suppressed and outputting a bit stream.

本発明の第３の視点によれば、音声符号化装置を構成するコンピュータに実行させるプログラムであって、入力音声信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力する聴覚マスキング整形処理と、前記聴覚マスキング整形処理がなされた整形信号を音声圧縮符号化してビットストリームを出力する音声符号化処理と、を前記コンピュータに実行させるプログラムが提供される。 According to a third aspect of the present invention, there is provided a program that is executed by a computer that constitutes a speech coding apparatus, and suppresses a signal component that is audibly unnecessary due to an auditory masking effect among signal components of an input speech signal. A program for causing the computer to execute an audio masking shaping process to be output and an audio encoding process to output a bit stream by audio compression encoding the shaped signal subjected to the audio masking shaping process is provided.

本発明によれば、音声符号化方式で圧縮符号化したビットストリームを配信する必要のある携帯端末に対し、音楽やメロディの配信を行う際の音質の劣化を軽減することができる。その理由は、聴覚的に不要な成分や劣化の原因となる成分を予め除去する構成を採用したことにある。 ADVANTAGE OF THE INVENTION According to this invention, deterioration of the sound quality at the time of delivering a music and a melody can be reduced with respect to the portable terminal which needs to deliver the bit stream compression-encoded by the audio | voice coding system. The reason is that a configuration is adopted in which components that are audibly unnecessary and components that cause deterioration are removed in advance.

本発明の第１の実施形態に係る音声符号化装置の構成を示す図である。It is a figure which shows the structure of the audio | voice coding apparatus which concerns on the 1st Embodiment of this invention. 図１の聴覚マスキング整形処理部の構成例を表したブロック図である。It is a block diagram showing the example of a structure of the auditory masking shaping process part of FIG. 本発明の第２の実施形態に係る音声符号化装置の構成を示す図である。It is a figure which shows the structure of the audio | voice coding apparatus which concerns on the 2nd Embodiment of this invention.

Explanation of symbols

１００、１４０端子
１２０聴覚マスキング整形処理部
１２２周波数変換部
１２４平滑化部
１２６整形部
１２８周波数逆変換部
１３０音声符号化処理部
２５０＿１、２５０＿２切替部100, 140 Terminal 120 Auditory masking shaping processing unit 122 Frequency conversion unit 124 Smoothing unit 126 Shaping unit 128 Frequency inverse conversion unit 130 Speech coding processing unit 250_1, 250_2 switching unit

音声信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力する手段と、聴覚的に不要となる信号成分が抑圧された出力信号を音声圧縮符号化してビットストリームを出力する手段と、を備える音声符号化装置は、以下の形態に展開することができる。 Of the signal components of the audio signal, a means for suppressing and outputting a signal component that is audibly unnecessary by the auditory masking effect, and a bit by compressing and encoding the output signal in which the signal component that is audibly unnecessary is suppressed A speech encoding device including means for outputting a stream can be developed in the following form.

前記聴覚的に不要となる信号成分を抑圧して出力する処理は、復号信号に対し予め定められた時間区間毎に、周波数軸上の高レベルの信号成分（マスカー）の存在により聴覚的に不要となる周波数成分（マスキー）を除去した上で、時間軸上に戻して出力することにより実現できる。 The process of suppressing and outputting the auditory unnecessary signal component is unnecessary auditoryly due to the presence of a high-level signal component (masker) on the frequency axis for each predetermined time interval for the decoded signal. This can be realized by removing the frequency component (masky) to be output and returning it to the time axis.

前記聴覚的に不要となる信号成分を抑圧して出力する処理手段は、例えば、入力音声信号より構成したブロックを周波数変換する周波数変換部と、前記周波数変換部の出力信号を平滑化する平滑化部と、前記平滑化部の出力信号をマスキング閾値として用いて、前記周波数変換部の出力信号中の不要な周波数成分を除去する整形部と、前記整形部の出力信号を逆変換して整形された信号を出力する周波数逆変換部と、により構成することができる。 The processing means that suppresses and outputs the signal components that are audibly unnecessary includes, for example, a frequency conversion unit that converts the frequency of a block composed of input audio signals, and smoothing that smoothes the output signal of the frequency conversion unit A shaping unit that removes unnecessary frequency components in the output signal of the frequency conversion unit using the output signal of the smoothing unit as a masking threshold, and the output signal of the shaping unit is inversely transformed and shaped And a frequency inverse transform unit that outputs the received signal.

前記マスキング閾値を用いて前記周波数変換部の出力信号中の不要な周波数成分を除去する方法に代え、あるいは、該方法と併用して、周波数軸上の予め定める個数の周波数成分が残るようレベルの低い周波数成分を除去する方法を用いることができる。 Instead of using the masking threshold to remove unnecessary frequency components in the output signal of the frequency converter, or in combination with this method, a level of a predetermined number of frequency components on the frequency axis remains. A method of removing low frequency components can be used.

また、予め定める帯域の周波数成分を前記除去対象とすることができる。 Further, a frequency component in a predetermined band can be the removal target.

前記音声符号化装置は、更に、入力音声信号の特徴を分析し、前記聴覚マスキング整形処理部を介した出力を行うか否かを切り替える切替部を備える構成とすることができる。前記切替部は、前記入力音声信号が音楽信号の特徴を有する場合に、前記聴覚的に不要となる信号成分を抑圧してから出力する構成とすることができる。 The speech encoding apparatus may further include a switching unit that analyzes characteristics of the input speech signal and switches whether to perform output via the auditory masking shaping processing unit. The switching unit may be configured to output after suppressing the auditory unnecessary signal component when the input audio signal has a characteristic of a music signal.

続いて、本発明を実施するための最良の形態について図面を参照して詳細に説明する。 Next, the best mode for carrying out the present invention will be described in detail with reference to the drawings.

［第１の実施形態］
図１は、本発明の第１の実施形態に係る音声符号化装置の構成を示す図である。図１を参照すると、音声符号化装置は、聴覚マスキング整形処理部１２０と、音声符号化処理部１３０と、を備えて構成されている。なお、上記した聴覚マスキング整形処理部１２０、音声符号化処理部１３０は、回路による構成はもちろんとして、コンピュータを上記した各処理部として機能させるプログラムによっても実現することができる。[First Embodiment]
FIG. 1 is a diagram showing the configuration of a speech encoding apparatus according to the first embodiment of the present invention. Referring to FIG. 1, the speech encoding apparatus includes an auditory masking shaping processing unit 120 and a speech encoding processing unit 130. The auditory masking shaping processing unit 120 and the speech coding processing unit 130 described above can be realized not only by a circuit configuration but also by a program that causes a computer to function as each processing unit described above.

聴覚マスキング整形処理部１２０は、端子１００から入力される音声信号に対し周波数軸上で聴覚心理分析に基づく処理を行い、聴覚に影響ないと判断される成分を抑圧した上で時間軸上に戻して出力する。 The auditory masking shaping processing unit 120 performs processing based on auditory psychological analysis on the frequency axis for the audio signal input from the terminal 100, suppresses components that are determined not to affect hearing, and returns them to the time axis. Output.

音声符号化処理部１３０は、聴覚マスキング整形処理部１２０の出力信号を入力し、予め定められた時間間隔毎に信号を区切って、音声符号化処理を施し圧縮符号化ビットストリームを端子１４０を通して出力する。音声符号化には、例えば、非特許文献１に記載のＡＭＲ音声符号化を用いることができ、この場合、前述の出力信号の区切り間隔は、２０ｍｓとなる。ここで、非特許文献１の全記載内容は、本書に引用をもって繰込み記載されているものとする。 The speech coding processing unit 130 receives the output signal of the auditory masking shaping processing unit 120, divides the signal at predetermined time intervals, performs speech coding processing, and outputs a compression-coded bit stream through the terminal 140. To do. For speech coding, for example, AMR speech coding described in Non-Patent Document 1 can be used, and in this case, the interval between the output signals described above is 20 ms. Here, it is assumed that the entire description of Non-Patent Document 1 is incorporated by reference in this document.

続いて、図１の聴覚マスキング整形処理部１２０の詳細構成について図２を参照して説明する。 Next, a detailed configuration of the auditory masking shaping processing unit 120 in FIG. 1 will be described with reference to FIG.

図２を参照すると、本実施形態に係る聴覚マスキング整形処理部１２０は、周波数変換部１２２と、平滑化部１２４と、整形部１２６と、周波数逆変換部１２８とから構成されている。 Referring to FIG. 2, the auditory masking shaping processing unit 120 according to the present embodiment includes a frequency conversion unit 122, a smoothing unit 124, a shaping unit 126, and a frequency inverse conversion unit 128.

周波数変換部１２２は、図１の端子１００から入力された音声信号を、周波数軸上の成分に変換して変換信号を生成し、平滑化部１２４と整形部１２６に出力する。 The frequency converter 122 converts the audio signal input from the terminal 100 in FIG. 1 into a component on the frequency axis, generates a converted signal, and outputs the converted signal to the smoothing unit 124 and the shaping unit 126.

上記変換信号の生成に際して、周波数変換部１２２は、複数の入力信号サンプルをまとめて、１ブロックを構成し、このブロックに対して周波数変換を適用する。周波数変換の例としては、フーリエ変換、コサイン変換、ＫＬ（カルーネンレーベ）変換などを挙げることができる。これらの変換の具体的な演算に関連する技術は、非特許文献２に開示されている。ここで、非特許文献２の全記載内容は、本書に引用をもって繰込み記載されているものとする。 When generating the converted signal, the frequency converting unit 122 collects a plurality of input signal samples to form one block, and applies the frequency conversion to this block. Examples of frequency conversion include Fourier transform, cosine transform, and KL (Kalunen label) transform. Non-patent document 2 discloses a technique related to specific operations of these conversions. Here, it is assumed that the entire description of Non-Patent Document 2 is incorporated herein by reference.

また、上記変換信号の生成に際して、周波数変換部１２２が、１ブロックの入力信号サンプルを窓関数で重み付けする構成も採用可能である。このような窓関数としては、ハミング、ハニング（ハン）、ケイザー、ブラックマンなどの窓関数が知られている。また、さらに複雑な窓関数を用いることもできる。これらの窓関数に関連する技術は、非特許文献３に開示されている。ここで、非特許文献３の全記載内容は、本書に引用をもって繰込み記載されているものとする。 In addition, it is possible to employ a configuration in which the frequency converter 122 weights one block of input signal samples with a window function when generating the converted signal. As such window functions, window functions such as Hamming, Hanning (Han), Kaiser, and Blackman are known. A more complicated window function can also be used. Non-patent document 3 discloses a technique related to these window functions. Here, the entire description of Non-Patent Document 3 is incorporated herein by reference.

また、上記周波数変換部１２２が入力信号サンプルから１ブロックを構成する際に、各ブロックに重なり（オーバラップ）が生ずるようにすることもできる。例えば、ブロック長の５０％のオーバラップを適用する場合には、あるブロックに属する信号サンプルの最後（後半）５０％は、次のブロックに属する信号サンプルの最初（前半）５０％となるように、複数のブロックで重複して用いられる。このオーバラップを有するブロック化と変換に関連する技術は、非特許文献３に開示されている。 Further, when the frequency converting unit 122 forms one block from the input signal samples, each block may be overlapped. For example, when an overlap of 50% of the block length is applied, the last (second half) 50% of the signal samples belonging to a block is set to the first (first half) 50% of the signal samples belonging to the next block. Are used in duplicate in a plurality of blocks. Non-patent document 3 discloses a technique related to blocking and conversion having overlap.

さらに、上記した周波数変換部１２２を、複数の帯域通過フィルタから構成され、受信した入力信号を複数の周波数帯域に分割する帯域分割フィルタバンクで構成してもよい。。帯域分割フィルタバンクの各周波数帯域は等間隔であってもよいし、不等間隔であってもよい。不等間隔に帯域分割する場合、低域では狭帯域に分割して時間分解能を低く、高域では広い帯域に分割して時間分解能を高くすることができる。不等間隔分割の代表例には、低域に向かって帯域が逐次半分になるオクターブ分割や人間の聴覚特性に対応した臨界帯域分割などがある。帯域分割フィルタバンクとその設計法に関連する技術は、非特許文献３に開示されている。 Furthermore, the above-described frequency conversion unit 122 may be configured by a band division filter bank that includes a plurality of band pass filters and divides a received input signal into a plurality of frequency bands. . Each frequency band of the band division filter bank may be equally spaced or unequal. In the case of dividing the band at unequal intervals, the time resolution can be reduced by dividing the band into a narrow band in the low band, and the time resolution can be increased by dividing the band in a high band. Typical examples of unequal interval division include octave division in which the band is successively halved toward the low band and critical band division corresponding to human auditory characteristics. A technique related to the band division filter bank and its design method is disclosed in Non-Patent Document 3.

平滑化部１２４は、上記した周波数変換部１２２より入力された変換信号を平滑化し、整形部１２６に平滑化変換信号を出力する。平滑化の方法としては、非特許文献４に開示されている聴覚マスキング効果を利用する方法を挙げることができる。例えば、ある周波数成分が近傍の周波数成分をマスキングする関数を用いて、変換信号を周波数軸上で畳み込みすることにより、平滑化変換信号を生成することができる。ここで、非特許文献４の全記載内容は、本書に引用をもって繰込み記載されているものとする。 The smoothing unit 124 smoothes the conversion signal input from the frequency conversion unit 122 described above, and outputs the smoothed conversion signal to the shaping unit 126. Examples of the smoothing method include a method using the auditory masking effect disclosed in Non-Patent Document 4. For example, a smoothed transformed signal can be generated by convolving the transformed signal on the frequency axis using a function in which a certain frequency component masks nearby frequency components. Here, the entire description of Non-Patent Document 4 is incorporated herein by reference.

また、簡易的な平滑化方法として、次式［数１］により、Ｓ２（ｎ）を算出し、Ｓ２（ｎ）のエネルギレベルを下げた信号を平滑化信号としても良い。ここで、ｍａｘ（ｘ，ｙ）はｘとｙの大きい方を表す。Ｅ（ｎ）は変換信号のエネルギであり、Ｎはブロックサイズである。 As a simple smoothing method, S2 (n) may be calculated by the following equation [Equation 1], and a signal obtained by lowering the energy level of S2 (n) may be used as a smoothed signal. Here, max (x, y) represents the larger of x and y. E (n) is the energy of the converted signal and N is the block size.

[数１]
Ｓ１（０）＝Ｅ（０）
Ｓ１（ｎ）＝ｍａｘ（Ｅ（ｎ），ａ×Ｓ１（ｎ−１））（ｎ＝１，…，Ｎ−１）
Ｓ２（Ｎ−１）＝Ｓ１（Ｎ−１）
Ｓ２（ｎ）＝ｍａｘ（Ｓ１（ｎ），ｂ×Ｓ２（ｎ＋１））（ｎ＝Ｎ−２，…，０）[Equation 1]
S1 (0) = E (0)
S1 (n) = max (E (n), a × S1 (n−1)) (n = 1,..., N−1)
S2 (N-1) = S1 (N-1)
S2 (n) = max (S1 (n), b × S2 (n + 1)) (n = N−2,..., 0)

このように算出した平滑化変換信号は、元の変換信号のエネルギレベルを平滑化したものとなり、マスキング閾値として使用することができる。即ち、このマスキング閾値よりもエネルギレベルの小さな周波数成分は聴覚上認知されないものとして除去対象となる。 The smoothed conversion signal calculated in this way is obtained by smoothing the energy level of the original conversion signal, and can be used as a masking threshold. That is, a frequency component having an energy level lower than the masking threshold value is to be removed as it is not perceptually perceived.

整形部１２６は、平滑化部１２４から入力された平滑化変換信号を用いて、変換信号を整形する。より具体的には、整形部１２６は、平滑化変換信号よりもエネルギレベルが小さな周波数成分を除去することにより、変換信号を整形する。 The shaping unit 126 shapes the converted signal using the smoothed converted signal input from the smoothing unit 124. More specifically, the shaping unit 126 shapes the converted signal by removing a frequency component having an energy level smaller than that of the smoothed converted signal.

このとき、整形部１２６は、平滑化変換信号に対する変換信号のエネルギレベル比が大きなものから順に、予め定めた個数の周波数成分のみを残し、他の周波数成分を除去することにより変換信号を整形するようにしても良い。更に、整形部１２６が、帯域制限として、低域のみ、高域のみ、あるいは低域と高域の両方を除去するようにしても良い。 At this time, the shaping unit 126 shapes the converted signal by leaving only a predetermined number of frequency components and removing other frequency components in descending order of the energy level ratio of the converted signal to the smoothed converted signal. You may do it. Further, the shaping unit 126 may remove only the low frequency, only the high frequency, or both the low frequency and the high frequency as the band limitation.

周波数逆変換部１２８は、整形された変換信号を逆変換して整形信号を生成し、整形信号を聴覚マスキング整形処理部１２０の出力信号として出力する。周波数逆変換部１２８において実行される逆変換は、周波数変換部１２２が適用する変換と対応する逆変換が選択されることが望ましい。例えば、周波数変換部１２２が、複数の入力信号サンプルをまとめて１ブロックを構成し、このブロックに対して周波数変換を適用するときには、周波数逆変換部１２８は同一数のサンプルに対して対応する逆変換を適用する。また、周波数変換部１２２が複数の入力信号サンプルから１ブロックを構成する際に、各ブロックに重なり（オーバラップ）を許容する場合には、これに対応して、周波数逆変換部１２８は逆変換後の信号に対して同一のオーバラップを適用する。さらに、周波数変換部１２２を帯域分割フィルタバンクで構成するときには、周波数逆変換部１２８を帯域合成フィルタバンクで構成する。帯域合成フィルタバンクとその設計法に関連する技術は、非特許文献３に開示されている。 The frequency inverse transform unit 128 inversely transforms the shaped conversion signal to generate a shaped signal, and outputs the shaped signal as an output signal of the auditory masking shaping processing unit 120. As the inverse transformation executed in the frequency inverse transformation unit 128, it is desirable to select an inverse transformation corresponding to the transformation applied by the frequency transformation unit 122. For example, when the frequency conversion unit 122 collects a plurality of input signal samples to form one block and applies frequency conversion to this block, the frequency inverse conversion unit 128 applies the corresponding inverse to the same number of samples. Apply transformation. In addition, when the frequency converting unit 122 constitutes one block from a plurality of input signal samples, if the blocks are allowed to overlap (overlap), the frequency inverse converting unit 128 correspondingly performs the inverse conversion. The same overlap is applied to later signals. Further, when the frequency converting unit 122 is configured by a band division filter bank, the frequency inverse converting unit 128 is configured by a band synthesis filter bank. A technique related to the band synthesis filter bank and its design method is disclosed in Non-Patent Document 3.

このようにして生成された整形信号は、上述のように平滑化部１２４と整形部１２６により、マスキング効果等の聴覚特性を利用し、聴覚上認知されない信号成分を除去した（聴覚的に不要な成分が除去された）後、時間軸上に戻された信号となる。 As described above, the smoothing unit 124 and the shaping unit 126 use auditory characteristics such as a masking effect to remove a signal component that is not perceptually recognized from the shaped signal generated in this way (not audibly unnecessary). After the component is removed), the signal is returned to the time axis.

したがって、聴覚マスキング整形処理部１２０における聴覚マスキング整形処理を、ＡＭＲ符号化方式などに代表されるＣＥＬＰ型分析合成符号化（詳細は非特許文献５で開示されている。非特許文献５の全記載内容は、本書に引用をもって繰込み記載されているものとする。）の前処理として利用した場合、聴覚的に不要な成分が除去された整形信号を分析することにより、線形予測係数やピッチ周期などのパラメータが安定し、復号後の信号の音質が向上する効果が得られる。 Therefore, the auditory masking shaping processing in the auditory masking shaping processing unit 120 is performed by CELP-type analysis / synthesis coding represented by an AMR coding method or the like (details are disclosed in Non-Patent Document 5. All descriptions of Non-Patent Document 5). The contents are assumed to be incorporated by reference in this document.) When used as preprocessing, the linear prediction coefficient and pitch period are analyzed by analyzing the shaped signal from which auditory unnecessary components have been removed. And the like, and the sound quality of the decoded signal is improved.

［第２の実施形態］
続いて、上記本発明の第１の実施形態に変更を加えた本発明の第２の実施形態について説明する。[Second Embodiment]
Subsequently, a second embodiment of the present invention in which a change is made to the first embodiment of the present invention will be described.

図３は、本発明の第２の実施形態に係る音声符号化装置の構成を示すブロック図である。図３において、図１及び図２と同一の番号を付した構成要素は、図１及び図２と同一の動作を行うので、説明は省略する。 FIG. 3 is a block diagram showing a configuration of a speech encoding apparatus according to the second embodiment of the present invention. In FIG. 3, the constituent elements having the same numbers as those in FIGS. 1 and 2 perform the same operations as those in FIGS.

図３において、切替部２５０＿１は、端子１００から入力された音声信号を予め定められた時間間隔に区切って種々の特徴パラメータを抽出し、得られた特徴パラメータに基づいて、聴覚マスキング整形処理を施した方がよいかどうかを判別する。例えば、切替部２５０＿１は、特徴パラメータの値を組み合わせて判断した結果、音楽性が強い（音楽信号の特徴を有する。）と判断した場合は、聴覚マスキング整形処理部１２０に、端子１００から入力された音声信号を出力する。 In FIG. 3, the switching unit 250_1 extracts various feature parameters by dividing the audio signal input from the terminal 100 at predetermined time intervals, and performs auditory masking shaping processing based on the obtained feature parameters. Determine if it is better. For example, if the switching unit 250_1 determines that the musicality is strong (has a characteristic of a music signal) as a result of the determination by combining the characteristic parameter values, the switching unit 250_1 is input from the terminal 100 to the auditory masking shaping processing unit 120. Audio signal is output.

一方、特徴パラメータの値を組み合わせて判断した結果、音声性が強い（音楽性が弱い。）と判断した場合は、切替部２５０＿１は、切替部２５０＿２に端子１００から入力された音声信号を出力する。ここで、切替部２５０＿２は、切替部２５０＿１と同期して切り替え動作を行う。 On the other hand, when it is determined that the voice property is strong (musicality is weak) as a result of the combination parameter value determination, the switching unit 250_1 outputs the voice signal input from the terminal 100 to the switching unit 250_2. . Here, the switching unit 250_2 performs a switching operation in synchronization with the switching unit 250_1.

以上のとおり、本実施形態によれば、音楽系の信号を的確に捉えて、聴覚マスキング整形処理部１２０に、端子１００から入力された音声信号を入力させることが可能となり、携帯端末での音質の劣化を更に軽減することができる。また、本実施形態によれば、音声性が強い音声信号が聴覚マスキング整形処理部１２０に入力されることを考慮する必要がなくなるため、聴覚マスキング整形処理部１２０における処理を、より効率のよいものとすることが可能となる。 As described above, according to the present embodiment, it is possible to accurately capture a music signal and input the audio signal input from the terminal 100 to the auditory masking shaping processing unit 120. Can be further reduced. Further, according to the present embodiment, since it is not necessary to consider that a voice signal with strong voice characteristics is input to the auditory masking shaping processing unit 120, the processing in the auditory masking shaping processing unit 120 is more efficient. It becomes possible.

以上、本発明の好適な実施形態を説明したが、本発明は、上記した各実施形態に限定されるものではなく、本発明の基本的技術的思想を逸脱しない範囲で、更なる変形・置換・調整を加えることができる。 The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and further modifications and replacements are possible without departing from the basic technical idea of the present invention. -Adjustments can be made.

以上、本発明の好適な実施形態を説明したが、本発明は、上記した各実施形態に限定されるものではなく、本発明の基本的技術的思想を逸脱しない範囲で、更なる変形・置換・調整を加えることができる。
［付記１−国際出願時請求項１１］
入力音声信号より構成したブロックを周波数変換し、
前記周波数変換した信号を平滑化し、
前記平滑化した信号をマスキング閾値として用いて、前記周波数変換した信号から不要な周波数成分を除去し、
前記不要な周波数成分を除去した信号を逆変換することにより、前記聴覚的に不要となる信号成分を抑圧する請求項９又は１０に記載の音声符号化方法。
［付記２−国際出願時請求項１２］
周波数軸上の予め定める個数の周波数成分が残るよう周波数成分を除去することにより前記聴覚的に不要となる信号成分を抑圧する請求項９乃至１１いずれか一に記載の音声符号化方法。
［付記３−国際出願時請求項１３］
予め定める帯域の周波数成分を除去することにより前記聴覚的に不要となる信号成分を抑圧する請求項９乃至１２いずれか一に記載の音声符号化方法。
［付記４−国際出願時請求項１４］
入力音声信号の特徴を分析し、前記聴覚的に不要となる信号成分を抑圧するか否かを判定してから、前記聴覚的に不要となる信号成分を抑圧する請求項９乃至１３いずれか一に記載の音声符号化方法。
［付記５−国際出願時請求項１５］
前記入力音声信号が音楽信号の特徴を有する場合に、前記聴覚的に不要となる信号成分を抑圧する請求項１４に記載の音声符号化方法。
［付記６−国際出願時請求項１６］
音声符号化装置を構成するコンピュータに実行させるプログラムであって、
入力音声信号の信号成分のうち、聴覚マスキング効果により聴覚的に不要となる信号成分を抑圧して出力する聴覚マスキング整形処理と、
前記聴覚マスキング整形処理がなされた整形信号を音声圧縮符号化してビットストリームを出力する音声符号化処理と、を前記コンピュータに実行させるプログラム。

The preferred embodiments of the present invention have been described above. However, the present invention is not limited to the above-described embodiments, and further modifications and replacements are possible without departing from the basic technical idea of the present invention. -Adjustments can be made.
[Appendix 1-Claim 11 at the time of international application]
The block composed of the input audio signal is frequency converted,
Smoothing the frequency converted signal;
Using the smoothed signal as a masking threshold, remove unnecessary frequency components from the frequency converted signal,
The speech encoding method according to claim 9 or 10, wherein the auditory unnecessary signal component is suppressed by inversely transforming the signal from which the unnecessary frequency component is removed.
[Appendix 2-Claim 12 at the time of international application]
The speech coding method according to any one of claims 9 to 11, wherein the audibly unnecessary signal component is suppressed by removing the frequency component so that a predetermined number of frequency components remain on the frequency axis.
[Appendix 3-Claim 13 at the time of international application]
The speech coding method according to any one of claims 9 to 12, wherein a signal component that is audibly unnecessary is suppressed by removing a frequency component of a predetermined band.
[Appendix 4-International application claim 14]
14. The auditory unnecessary signal component is suppressed after analyzing characteristics of an input audio signal and determining whether or not to suppress the auditory unnecessary signal component. The speech encoding method described in 1.
[Appendix 5-International filing claim 15]
15. The speech encoding method according to claim 14, wherein when the input speech signal has characteristics of a music signal, the auditory unnecessary signal component is suppressed.
[Appendix 6-Claim 16 at the time of international application]
A program to be executed by a computer constituting a speech encoding device,
Auditory masking shaping processing that suppresses and outputs signal components that are audibly unnecessary by the auditory masking effect among the signal components of the input audio signal;
A program that causes the computer to execute voice coding processing for voice compression coding the shaped signal subjected to the auditory masking shaping processing and outputting a bit stream.

Claims

Auditory masking shaping processing unit that suppresses and outputs signal components that are audibly unnecessary by the auditory masking effect among the signal components of the audio signal;
A speech encoding unit that performs speech encoding processing for compressing and encoding the output signal of the auditory masking shaping processing unit and outputting a bitstream;
A speech encoding apparatus comprising:

The speech coding according to claim 1, wherein the auditory masking shaping processing unit removes an audibly unnecessary frequency component on the frequency axis and outputs it back on the time axis for each predetermined time interval. apparatus.

The auditory masking shaping processing unit
A frequency converter that converts the frequency of the block composed of the input audio signal;
A smoothing unit that smoothes the output signal of the frequency conversion unit;
A shaping unit that removes unnecessary frequency components in the output signal of the frequency conversion unit using the output signal of the smoothing unit as a masking threshold;
The speech coding apparatus according to claim 1, further comprising: a frequency inverse transform unit that inversely transforms an output signal of the shaping unit and outputs a shaped signal.

The speech coding apparatus according to claim 1, wherein the auditory masking shaping processing unit removes frequency components such that a predetermined number of frequency components on the frequency axis remain.

The speech coding apparatus according to claim 1, wherein the auditory masking shaping processing unit removes a frequency component in a predetermined band.

Furthermore,
The speech coding apparatus according to any one of claims 1 to 5, further comprising a switching unit that analyzes characteristics of the input speech signal and switches whether to perform output via the auditory masking shaping processing unit.

The speech coding apparatus according to claim 6, wherein the switching unit selects an output to the auditory masking shaping processing unit when the input speech signal has a characteristic of a music signal.

The speech coding apparatus according to any one of claims 1 to 6, wherein the speech coding apparatus functions as a speech processing apparatus that distributes a music signal to a mobile phone terminal.

Among the signal components of the input audio signal, the auditory masking effect suppresses signal components that are audibly unnecessary and outputs them.
Audio compression-coding the shaped signal in which the auditory unnecessary signal component is suppressed, and outputting a bitstream;
A speech encoding method characterized by the above.

The auditory unnecessary signal component is suppressed by removing an audibly unnecessary frequency component on the frequency axis and outputting it back on the time axis for each predetermined time interval. 10. The speech encoding method according to 9.

The block composed of the input audio signal is frequency converted,
Smoothing the frequency converted signal;
Using the smoothed signal as a masking threshold, remove unnecessary frequency components from the frequency converted signal,
The speech encoding method according to claim 9 or 10, wherein the auditory unnecessary signal component is suppressed by inversely transforming the signal from which the unnecessary frequency component is removed.

The speech coding method according to any one of claims 9 to 11, wherein the audibly unnecessary signal component is suppressed by removing the frequency component so that a predetermined number of frequency components remain on the frequency axis.

The speech coding method according to any one of claims 9 to 12, wherein a signal component that is audibly unnecessary is suppressed by removing a frequency component of a predetermined band.

14. The auditory unnecessary signal component is suppressed after analyzing characteristics of an input audio signal and determining whether or not to suppress the auditory unnecessary signal component. The speech encoding method described in 1.

15. The speech encoding method according to claim 14, wherein when the input speech signal has characteristics of a music signal, the auditory unnecessary signal component is suppressed.

A program to be executed by a computer constituting a speech encoding device,
Auditory masking shaping processing that suppresses and outputs signal components that are audibly unnecessary by the auditory masking effect among the signal components of the input audio signal;
A program that causes the computer to execute voice coding processing for voice compression coding the shaped signal subjected to the auditory masking shaping processing and outputting a bit stream.