KR101352608B1

KR101352608B1 - A method for extending bandwidth of vocal signal and an apparatus using it

Info

Publication number: KR101352608B1
Application number: KR1020120036878A
Authority: KR
Inventors: 김홍국; 박남인
Original assignee: 광주과학기술원
Priority date: 2011-12-07
Filing date: 2012-04-09
Publication date: 2014-01-17
Also published as: KR20130063990A

Abstract

본 발명의 실시예에 따른 음성 신호의 대역폭 확장 방법은, 수신한 음성 신호를 디코딩하여 주파수 도메인으로 변환하는 단계; 상기 변환된 음성 신호에 대한 정규화를 수행하는 단계; 상기 수신된 음성 신호로부터 유성음 또는 무성음 구간을 판별하는 단계; 상기 유성음으로 판별된 구간에 기초하여 상기 정규화된 음성 신호로부터 상기 유성음 구간의 하모닉 성분을 포함하는 제1 구간을 추출하는 단계; 상기 무성음으로 판별된 구간과 상기 정규화된 음성 신호의 상호상관도에 기초하여 상기 정규화된 음성 신호로부터 제2 구간을 추출하는 단계; 상기 제1 구간 및 상기 제2 구간에 기초하여 고대역 음성 신호를 생성하는 단계; 및 상기 생성된 고대역 음성 신호와 상기 변환된 음성 신호를 합성하여 광대역 음성 신호로 출력하는 단계를 포함한다.In accordance with another aspect of the present invention, there is provided a method of expanding a bandwidth of a voice signal, the method comprising: decoding a received voice signal and converting the received voice signal into a frequency domain; Performing normalization on the converted speech signal; Determining a voiced sound or unvoiced sound interval from the received voice signal; Extracting a first section including a harmonic component of the voiced sound section from the normalized voice signal based on the section determined as the voiced sound; Extracting a second section from the normalized speech signal based on the cross-correlation between the section determined as the unvoiced sound and the normalized speech signal; Generating a high band speech signal based on the first interval and the second interval; And synthesizing the generated high-band speech signal with the converted speech signal and outputting the synthesized high-band speech signal.

Description

A method for extending bandwidth of vocal signal and an apparatus using it}

본 발명은 음성 신호의 대역폭 확장 방법 및 그 장치에 관한 것이며, 더욱 상세하게는 성능을 향상시킬 수 있는 음성 신호의 대역폭 확장 방법 및 그 장치에 관한 것이다.The present invention relates to a method and device for bandwidth expansion of a voice signal, and more particularly, to a method and device for bandwidth expansion of a voice signal capable of improving performance.

대부분 음성통신 시스템에서, 음성 대역폭은 0.3-3.4 kHz로 제한되어 있다. 이 음성 대역폭은 유성음과 무성음을 포함하며, 대역폭이 낮아 원음보다 음질이 떨어지게 된다. 이러한 음질 저하 현상을 억제하기 위해서 광대역 음성 수신 장치가 제안되었다. 대역폭이 50 Hz에서 7 kHz인 광대역 음성은 유/무성음을 포함한 모든 음성 대역을 나타낼 수 있을 뿐만 아니라, 협대역 음성과 비교하여 자연성과 명료성을 증대 시킬 수 있다. 그러나, 공중 회선 교환 전화망(PSTN), 인터넷 전화(VoIP, VoWiFi) 및 스마트 폰에 탑재되어 있는 음성관련 어플리케이션과 같은 음성 통신에서 현재도 협대역 음성코덱으로 서비스되고 있기 때문에 코덱을 광대역 코덱으로 교체하는 데에는 시간 및 비용적으로 큰 부담이 되는 문제가 있다.In most voice communication systems, the voice bandwidth is limited to 0.3-3.4 kHz. This voice bandwidth includes voiced and unvoiced sounds, and the bandwidth is lower than the original sound. In order to suppress such sound degradation, a broadband voice receiver has been proposed. Wideband speech with a bandwidth of 50 Hz to 7 kHz can represent all voice bands, including voiced and unvoiced, as well as increasing naturalness and clarity compared to narrowband voice. However, voice codes such as public line switched telephone networks (PSTN), Internet phones (VoIP, VoWiFi), and voice-related applications in smart phones are still being serviced as narrowband voice codecs. There is a big burden in terms of time and money.

한편, 이와 같은 문제점에 따라 협대역 음성을 수신하여 복호화단에서 광대역 신호로 바꾸는 방법이 제안되고 있다. 이에 따라 음성대역폭을 확장하기 위한 여러 방법들이 제안되었다. On the other hand, according to this problem, a method for receiving a narrowband voice and converting it into a wideband signal at a decoding stage has been proposed. Accordingly, various methods for extending the voice bandwidth have been proposed.

먼저, 고대역에 대한 추가 비트를 할당하는 방법이 있다. 이는 사이드 정보를 사용하는 방식으로서, 부호화단으로부터 전송되는 부호화 정보를 이용하여, 음성대역을 확장하는 방법이다. 부호화기는 입력 신호의 고대역 정보의 분석을 통해서 추가 정보를 생성해서 전송하며, 이때 복호화기는 전송된 추가정보와 저대역 신호를 바탕으로 고대역 신호를 생성한다. 예를 들면, 광대역 음성 코덱인 G.729.1는 8 kbit/s에서 32 kbit/s까지 12개의 다양한 계층으로 코딩할 수 있다. G.729.1의 기본 계층은 대표적인 협대역 코덱인 G.729 코덱이기 때문에 8 kbit/s 모드에서 협대역 음질을 보장할 수 있다. 이때, 부호화기는 Layer 3이라고 불리는 14 kbit/s 모드부터 상술한 대역폭 확장 기법을 이용하여 광대역 음성을 생성한다. 부호화기는 G.729.1의 Layer 3에서 사용되는 대역폭 확장 기법을 이용하기 위하여 추가적인 비트를 할당함으로써 복호화 시 고대역 신호를 만들어 낼 수 있도록 한다. 하지만, 이러한 방식의 대역폭 확장 기법은 추가적인 비트를 할당하여 네트워크 과부하를 초래할 뿐만 아니라 이에 따른 부호화기를 전면적으로 수정해야 하는 단점이 있다.First, there is a method of allocating additional bits for the high band. This is a method of using side information, and is a method of extending a voice band by using encoding information transmitted from an encoding end. The encoder generates and transmits additional information through analysis of highband information of the input signal, and the decoder generates a highband signal based on the transmitted additional information and the lowband signal. For example, G.729.1, a wideband voice codec, can code in 12 different layers from 8 kbit / s to 32 kbit / s. Since the base layer of G.729.1 is a representative narrowband codec, the G.729 codec, it can guarantee narrowband sound quality in 8 kbit / s mode. At this time, the encoder generates a wideband voice using the above-described bandwidth extension technique from the 14 kbit / s mode called Layer 3. The encoder allocates additional bits to use the bandwidth extension scheme used in Layer 3 of G.729.1 to produce high-band signals during decoding. However, this technique of bandwidth extension not only causes network overload by allocating additional bits, but also requires a full modification of the encoder.

또한, 추가비트 할당 없이 복호화기단에서 저대역 신호로부터 고대역 신호를 생성하는 방법이 제안되고는 있다. 예를 들어, hidden Markov model (HMM) 및 Gaussian mixture model (GMM) 등 패턴인식 기법을 이용한 추정을 통한 방식들이 제안되었다. 그러나, 패턴인식은 training 과정을 요구하며, 언어에 따라 성능이 달라질 수 있다. 그리고, 예측 또는 추정을 위한 방식의 경우 추가적인 비트가 포함되고, 연산량이 매우 증가하여 실시간으로 수신되는 음성을 빠르고 효과적으로 처리하기 어려운 문제점이 있다. 그리고, 추가비트 할당 없이 제안되고 있는 대역폭 확장 방법들은 출력 음질이 좋지 않은 문제점이 있다.In addition, a method of generating a highband signal from a lowband signal at the decoder stage without additional bit allocation has been proposed. For example, estimation methods using pattern recognition techniques such as hidden Markov model (HMM) and Gaussian mixture model (GMM) have been proposed. However, pattern recognition requires a training process, and performance may vary depending on the language. In addition, in the case of a method for prediction or estimation, additional bits are included and the amount of computation is greatly increased, making it difficult to process voice received in real time quickly and effectively. In addition, the bandwidth extension methods proposed without additional bit allocation have a problem of poor output sound quality.

본 발명의 목적은, 협대역 음성 신호의 대역폭을 빠르고 효과적으로 확장할 수 있는 음성 신호의 대역폭 확장 방법 및 그 장치를 제공함에 있다.SUMMARY OF THE INVENTION An object of the present invention is to provide a method and an apparatus for extending a bandwidth of a voice signal capable of quickly and effectively extending the bandwidth of a narrowband voice signal.

또한, 추가적인 비트 없이도, 대역폭 확장된 광대역 음성 신호의 음질을 향상시킬 수 있어 비용을 절감시키면서도 성능은 향상시킬 수 있는 효과적인 음성 신호의 대역폭 확장 방법 및 그 장치를 제공함에 있다.In addition, the present invention provides a method and apparatus for effectively expanding a bandwidth of a voice signal, which can improve the sound quality of a bandwidth-extended wideband voice signal without additional bits, thereby improving performance.

상기 목적을 달성하기 위한 본 발명의 실시예에 따른 음성 신호의 대역폭 확장 방법은, 음성 신호를 수신하여 대역폭을 확장하는 방법에 있어서, 수신한 음성 신호를 디코딩하여 주파수 도메인으로 변환하는 단계; 상기 변환된 음성 신호에 대한 정규화를 수행하는 단계; 상기 수신된 음성 신호로부터 유성음 또는 무성음 구간을 판별하는 단계; 상기 유성음으로 판별된 구간에 기초하여 상기 정규화된 음성 신호로부터 상기 유성음 구간의 하모닉 성분을 포함하는 제1 구간을 추출하는 단계; 상기 무성음으로 판별된 구간과 상기 정규화된 음성 신호의 상관도에 기초하여 상기 정규화된 음성 신호로부터 제2 구간을 추출하는 단계; 상기 제1 구간 및 상기 제2 구간에 기초하여 고대역 음성 신호를 생성하는 단계; 및 상기 생성된 고대역 음성 신호와 상기 변환된 음성 신호를 합성하여 광대역 음성 신호로 출력하는 단계를 포함한다.According to an aspect of the present invention, there is provided a method for extending a bandwidth of a voice signal, the method comprising: receiving a voice signal and extending the bandwidth, decoding the received voice signal and converting the received voice signal into a frequency domain; Performing normalization on the converted speech signal; Determining a voiced sound or unvoiced sound interval from the received voice signal; Extracting a first section including a harmonic component of the voiced sound section from the normalized voice signal based on the section determined as the voiced sound; Extracting a second section from the normalized speech signal based on a correlation between the section determined as the unvoiced sound and the normalized speech signal; Generating a high band speech signal based on the first interval and the second interval; And synthesizing the generated high-band speech signal with the converted speech signal and outputting the synthesized high-band speech signal.

또한, 상기 목적을 달성하기 위한 본 발명의 실시예에 따른 음성 신호의 대역폭 확장 장치는, 음성신호의 대역폭 확장 장치에 있어서, 음성 신호를 수신하는 수신부; 상기 수신된 음성 신호를 디코딩하는 디코더; 상기 디코딩된 음성 신호를 주파수 도메인으로 변환하는 도메인 변환부; 상기 변환된 음성 신호에 대한 정규화를 수행하는 정규화부; 상기 수신된 음성 신호로부터 유성음 또는 무성음 구간을 판별하는 판별부; 상기 유성음으로 판별된 구간에 기초하여 상기 정규화된 음성 신호로부터 상기 유성음 구간의 하모닉 성분을 포함하는 제1 구간을 추출하는 유성음 처리부; 상기 무성음으로 판별된 구간과 상기 정규화된 음성 신호의 상관도에 기초하여 상기 정규화된 음성 신호로부터 제2 구간을 추출하는 무성음 처리부; 상기 제1 구간 및 제2 구간에 기초하여 고대역 음성 신호를 생성하는 고대역 생성부; 및 상기 생성된 고대역 음성 신호와 상기 변환된 음성 신호를 합성하여 광대역 음성 신호로 출력하는 출력부를 포함한다.In addition, the apparatus for bandwidth expansion of a voice signal according to an embodiment of the present invention for achieving the above object, the apparatus for bandwidth expansion of a voice signal, receiving unit for receiving a voice signal; A decoder for decoding the received speech signal; A domain converter for converting the decoded speech signal into a frequency domain; A normalizer which normalizes the converted speech signal; A discriminating unit for discriminating voiced or unvoiced sections from the received voice signal; A voiced sound processor extracting a first section including a harmonic component of the voiced sound section from the normalized voice signal based on the section determined as the voiced sound; An unvoiced sound processor extracting a second section from the normalized speech signal based on a correlation between the section determined as the unvoiced sound and the normalized speech signal; A high band generator configured to generate a high band voice signal based on the first and second sections; And an output unit configured to synthesize the generated high-band speech signal and the converted speech signal and output the wideband speech signal.

본 발명의 실시예에 따르면, 추가 비트 없이도 협대역 음성 신호로부터 고음질의 광대역 음성 신호를 출력할 수 있게 된다.According to an embodiment of the present invention, it is possible to output a high quality wideband speech signal from a narrowband speech signal without additional bits.

특히, 유성음과 무성음을 판별하여 다른 연산을 수행함으로써, 연산량을 줄이면서도 음질을 향상시킬 수 있게 된다.In particular, by determining the voiced sound and the unvoiced sound and performing another operation, the sound quality can be improved while reducing the amount of calculation.

한편, 본 발명의 실시 예에 따르면, 종래의 협대역 음성 신호 시스템의 복호화기의 구성을 바꾸지 않고도 광대역 시스템으로 개선할 수 있어, 광대역 음성 서비스에 대한 비용을 절감할 수 있게 된다.On the other hand, according to an embodiment of the present invention, it is possible to improve to a broadband system without changing the configuration of the decoder of the conventional narrowband voice signal system, it is possible to reduce the cost for broadband voice services.

도 1은 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치를 개략적으로 나타낸 도면이다.
도 2는 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치를 보다 상세하게 나타낸 블록도이다.
도 3은 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 방법을 설명하기 위한 흐름도이다.
도 4는 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 방법을 이용한 실험 결과를 비교 도시한 도면이다.
도 5내지 도 8은 도 4에서 도시된 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 방법을 실험한 결과의 각 파형들을 그래프로 나타낸 도면이다.1 is a diagram schematically illustrating an apparatus for extending a bandwidth of a voice signal according to an embodiment of the present invention.
2 is a block diagram illustrating in more detail an apparatus for extending bandwidth of a voice signal according to an embodiment of the present invention.
3 is a flowchart illustrating a bandwidth extension method of a voice signal according to an embodiment of the present invention.
4 is a diagram illustrating an experimental result using a bandwidth extension method of a voice signal according to an embodiment of the present invention.
5 to 8 are graphs showing respective waveforms of a result of experimenting with a method for expanding a bandwidth of a voice signal according to an embodiment of the present invention illustrated in FIG. 4.

이하의 내용은 단지 본 발명의 원리를 예시한다. 그러므로 당업자는 비록 본 명세서에 명확히 설명되거나 도시되지 않았지만 본 발명의 원리를 구현하고 본 발명의 개념과 범위에 포함된 다양한 장치를 발명할 수 있는 것이다. 또한, 본 명세서에 열거된 모든 조건부 용어 및 실시예들은 원칙적으로, 본 발명의 개념이 이해되도록 하기 위한 목적으로만 명백히 의도되고, 이와 같이 특별히 열거된 실시예들 및 상태들에 제한적이지 않는 것으로 이해되어야 한다.The following merely illustrates the principles of the invention. Thus, those skilled in the art will be able to devise various apparatuses which, although not explicitly described or shown herein, embody the principles of the invention and are included in the concept and scope of the invention. Furthermore, all of the conditional terms and embodiments listed herein are, in principle, intended only for the purpose of enabling understanding of the concepts of the present invention, and are not intended to be limiting in any way to the specifically listed embodiments and conditions .

또한, 본 발명의 원리, 관점 및 실시예들 뿐만 아니라 특정 실시예를 열거하는 모든 상세한 설명은 이러한 사항의 구조적 및 기능적 균등물을 포함하도록 의도되는 것으로 이해되어야 한다. 또한 이러한 균등물들은 현재 공지된 균등물뿐만 아니라 장래에 개발될 균등물 즉 구조와 무관하게 동일한 기능을 수행하도록 발명된 모든 소자를 포함하는 것으로 이해되어야 한다.It is also to be understood that the detailed description, as well as the principles, aspects and embodiments of the invention, as well as specific embodiments thereof, are intended to cover structural and functional equivalents thereof. It is also to be understood that such equivalents include all elements contemplated to perform the same function irrespective of the currently known equivalents as well as the equivalents to be developed in the future, i.e., the structure.

따라서, 예를 들어, 본 명세서의 블록도는 본 발명의 원리를 구체화하는 예시적인 회로의 개념적인 관점을 나타내는 것으로 이해되어야 한다. 이와 유사하게, 모든 흐름도, 상태 변환도, 의사 코드 등은 컴퓨터가 판독 가능한 매체에 실질적으로 나타낼 수 있고 컴퓨터 또는 프로세서가 명백히 도시되었는지 여부를 불문하고 컴퓨터 또는 프로세서에 의해 수행되는 다양한 프로세스를 나타내는 것으로 이해되어야 한다.Thus, for example, the block diagrams herein should be understood to represent a conceptual view of example circuitry embodying the principles of the invention. Similarly, all flowcharts, state transition diagrams, pseudo code, and the like are representative of various processes that may be substantially represented on a computer-readable medium and executed by a computer or processor, whether or not the computer or processor is explicitly shown .

프로세서 또는 이와 유사한 개념으로 표시된 기능 블록을 포함하는 도면에 도시된 다양한 소자의 기능은 전용 하드웨어뿐만 아니라 적절한 소프트웨어와 관련하여 소프트웨어를 실행할 능력을 가진 하드웨어의 사용으로 제공될 수 있다. 프로세서에 의해 제공될 때, 상기 기능은 단일 전용 프로세서, 단일 공유 프로세서 또는 복수의 개별적 프로세서에 의해 제공될 수 있고, 이들 중 일부는 공유될 수 있다.The functionality of the various elements shown in the figures, including functional blocks represented by a processor or similar concept, can be provided by the use of dedicated hardware as well as hardware capable of executing software in association with appropriate software. When provided by a processor, the functions may be provided by a single dedicated processor, a single shared processor, or a plurality of individual processors, some of which may be shared.

또한 프로세서, 제어 또는 이와 유사한 개념으로 제시되는 용어의 명확한 사용은 소프트웨어를 실행할 능력을 가진 하드웨어를 배타적으로 인용하여 해석되어서는 아니되고, 제한 없이 디지털 신호 프로세서(DSP) 하드웨어, 소프트웨어를 저장하기 위한 롬(ROM), 램(RAM) 및 비 휘발성 메모리를 암시적으로 포함하는 것으로 이해되어야 한다. 주지관용의 다른 하드웨어도 포함될 수 있다.Also, the explicit use of terms such as processor, control, or similar concepts should not be interpreted exclusively as hardware capable of running software, and may be used without limitation as a digital signal processor (DSP) (ROM), random access memory (RAM), and non-volatile memory. Other hardware may also be included.

본 명세서의 청구범위에서, 상세한 설명에 기재된 기능을 수행하기 위한 수단으로 표현된 구성요소는 예를 들어 상기 기능을 수행하는 회로 소자의 조합 또는 펌웨어/마이크로 코드 등을 포함하는 모든 형식의 소프트웨어를 포함하는 기능을 수행하는 모든 방법을 포함하는 것으로 의도되었으며, 상기 기능을 수행하도록 상기 소프트웨어를 실행하기 위한 적절한 회로와 결합된다. 이러한 청구범위에 의해 정의되는 본 발명은 다양하게 열거된 수단에 의해 제공되는 기능들이 결합되고 청구항이 요구하는 방식과 결합되기 때문에 상기 기능을 제공할 수 있는 어떠한 수단도 본 명세서로부터 파악되는 것과 균등한 것으로 이해되어야 한다.In the claims hereof, the elements represented as means for performing the functions described in the detailed description include all types of software including, for example, a combination of circuit elements performing the function or firmware / microcode etc. , And is coupled with appropriate circuitry to execute the software to perform the function. It is to be understood that the invention defined by the appended claims is not to be construed as encompassing any means capable of providing such functionality, as the functions provided by the various listed means are combined and combined with the manner in which the claims require .

상술한 목적, 특징 및 장점은 첨부된 도면과 관련한 다음의 상세한 설명을 통하여 보다 분명해질 것이며, 그에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 또한, 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에 그 상세한 설명을 생략하기로 한다. BRIEF DESCRIPTION OF THE DRAWINGS The above and other objects, features and advantages of the present invention will become more apparent from the following detailed description of the present invention when taken in conjunction with the accompanying drawings, in which: There will be. In the following description, well-known functions or constructions are not described in detail since they would obscure the invention in unnecessary detail.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 일 실시예를 상세히 설명하기로 한다.Hereinafter, a preferred embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치를 개략적으로 나타낸 도면이다.1 is a diagram schematically illustrating an apparatus for extending a bandwidth of a voice signal according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치(100)는 협대역 음성 신호를 수신하여, 음질 개선된 광대역 음성 신호를 출력한다. 이와 같은 대역폭 확장 장치(100)는 협대역 음성 수신기의 복호화단에서 사용될 수 있으며, 협대역의 하모닉 성분이 유지된 광대역 음성 신호를 생성하여 출력할 수 있다. 대역폭 확장 장치(100)는 협대역 음성 신호를 디코딩하는 과정에서 획득된 정보를 이용하여 유성음 또는 무성음을 판별할 수 있다. 또한, 대역폭 확장 장치(100)는 유성음의 경우 피치 정보를 이용하여 하모닉 성분이 유지된 광대역 음성 신호를 획득하고, 무성음의 경우, 상관도(correlation)가 가장 높은 신호를 이용하여 광대역 음성 신호를 획득하며, 획득한 광대역 음성 신호의 에너지를 조정함으로써, 비트 추가 없이도 음질이 향상된 광대역 음성 신호를 출력할 수 있다.Referring to FIG. 1, the apparatus 100 for expanding a voice signal according to an embodiment of the present invention receives a narrowband voice signal and outputs a wideband voice signal having improved sound quality. Such a bandwidth extension device 100 may be used in a decoding stage of a narrowband speech receiver, and may generate and output a wideband speech signal in which the harmonic component of the narrowband is maintained. The bandwidth extension apparatus 100 may determine the voiced sound or the unvoiced sound by using information obtained in the process of decoding the narrowband speech signal. In addition, in the case of voiced sound, the bandwidth extension device 100 obtains a wideband voice signal in which a harmonic component is maintained by using pitch information, and obtains a wideband voice signal by using a signal having the highest correlation in the case of unvoiced sound. In addition, by adjusting the energy of the obtained wideband voice signal, a wideband voice signal having improved sound quality can be output without adding bits.

도 2는 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치를 보다 상세하게 나타낸 블록도이다.2 is a block diagram illustrating in more detail an apparatus for extending bandwidth of a voice signal according to an embodiment of the present invention.

도 2에 도시된 바와 같이, 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치(100)는 음성 신호를 수신하여 처리 가능한 데이터로 디코딩하는 디코더(110), 디코딩된 음성 신호를 주파수 도메인으로 변환하는 도메인 변환부(120), 도메인 변환된 음성 신호를 정규화하는 정규화부(130), 정규화된 음성 신호에 대하여 시간 도메인의 저대역 음성 신호로 역변환하는 저대역 역변환부(140), 도메인 변환된 음성 신호의 유성음 또는 무성음 구간을 판별하는 판별부(150), 유성음으로 판별된 구간으로부터 하모닉 구간을 포함하는 제1 구간을 획득하는 유성음 처리부(151), 무성음으로 판별된 구간으로부터 상관도가 가장 큰 제2 구간을 획득하는 무성음 처리부(152), 유성음 또는 무성음 처리되어 획득한 제1 또는 제2 구간에 대하여 에너지 스케일링을 수행하는 에너지 조절부(160), 에너지 조절된 음성 구간을 역변환하여 시간 도메인의 고대역 음성 신호로 출력하는 고대역 역변환부(170) 및 저대역 음성 신호 출력과 고대역 음성 신호 출력을 합성하여 광대역 음성 신호로 출력하는 음성 신호 합성부(180)를 포함하여 구성된다.As shown in FIG. 2, the apparatus 100 for expanding bandwidth of a voice signal according to an embodiment of the present invention includes a decoder 110 that receives a voice signal and decodes it into processable data, and decodes the decoded voice signal into a frequency domain. Domain transform unit 120 for converting, normalizer 130 for normalizing the domain-converted speech signal, low band inverse transform unit 140 for inverse transforming the normalized speech signal into a low-band speech signal in the time domain, and domain transformed Determination unit 150 for discriminating voiced or unvoiced sections of the voice signal, voiced sound processor 151 for obtaining a first section including the harmonic section from the section determined as voiced sound, the highest correlation from the section determined as unvoiced sound The unvoiced sound processor 152 for acquiring the second section, the energy scaling for the first or second section obtained by processing the voiced sound or unvoiced sound, Nudge control unit 160, a high-band inverse transformer 170 for inversely converting the energy-adjusted voice section to output a high-band speech signal in the time domain, and a low-band speech signal output and a high-band speech signal output by combining a wideband speech signal It is configured to include a voice signal synthesizer 180 for outputting.

디코더(110)는 음성 신호를 수신하여 처리 가능한 데이터로 디코딩한다. 음성 신호를 디코딩하는 방법은 여러 가지가 있을 수 있다. 예를 들어, 디코더(110)는 잘 알려진 협대역 디코딩 방식인 G.729[ITU-T Recommendation G.729, Coding of speech at 8 kbit/s using conjugate-structure code-excited linear prediction (CS-ACELP)]를 이용하여 디코딩할 수 있다. 또한, 디코더(110)는 스펙트럼 분석을 이용한 켈프(CELP, Code Exited Linear Prediction) 타입의 음성 디코더를 포함할 수 있다. The decoder 110 receives a voice signal and decodes it into processable data. There may be various ways to decode the speech signal. For example, decoder 110 is a well-known narrowband decoding method G.729 [ITU-T Recommendation G.729, Coding of speech at 8 kbit / s using conjugate-structure code-excited linear prediction (CS-ACELP) ] To decode. In addition, the decoder 110 may include a speech decoder of a code exited linear prediction (CELP) type using spectrum analysis.

일 실시 예에서, 디코더(110)는 디코딩 과정에서 음성 신호에 대한 피치 정보 또는 주파수 경사도를 추출하고, 판별부(150)로 전송할 수 있다. 예를 들어, 디코더(110)는 수신한 음성 신호를 G.729로 디코딩하기 위한 1차 반사(reflection) 계수를 이용하여 주파수 경사도를 획득하고, 이를 판별부(150)로 전송할 수 있다.In one embodiment, the decoder 110 may extract the pitch information or the frequency gradient for the speech signal in the decoding process, and transmit it to the determination unit 150. For example, the decoder 110 may obtain a frequency gradient using a primary reflection coefficient for decoding the received voice signal into G.729 and transmit it to the determination unit 150.

또한, 일 실시 예에서, 디코더(110)는 음성 신호에 따른 비트스트림을 협대역 음성 신호로 디코딩할 수 있다. 예를 들어, 디코더(110)에서 처리되는 G.729 형식의 음성 신호는 1 프레임 크기로서의 샘플 개수인 N은 80일 수 있다.In addition, in an embodiment, the decoder 110 may decode a bitstream according to the speech signal into a narrowband speech signal. For example, the G.729 format speech signal processed by the decoder 110 may be N, which is the number of samples as one frame size.

한편, 도메인 변환부(120)는 디코딩된 음성 신호를 주파수 도메인으로 변환한다. 도메인 변환부(120)는, 디코딩된 음성 신호에 기초하여 주파수 도메인의 데이터로 변환할 수 있다.Meanwhile, the domain converter 120 converts the decoded voice signal into the frequency domain. The domain converter 120 may convert the data into the frequency domain data based on the decoded voice signal.

예를 들어, 도메인 변환부(120)는 MDCT(modified discrete cosine transform)을 이용하여 음성 신호를 주파수 도메인으로 변환할 수 있다. 도메인 변환부(120)는 디코딩된 음성 신호를 시간 영역의 입력 신호로 수신하고, 주파수 영역의 입력 신호로 변환하며, 블록 간 오버랩 연산을 수행한다. 특히, MDCT 변환 방식은 오버랩 연산을 수행하더라도 비트레이트가 증가하지 않는 장점이 있다. 그리고, 상술한 바와 같이 G.729 형식의 음성 신호가 1프레임의 샘플 개수인 N이 80인 경우에는, 도메인 변환부(120)는 디코딩된 하나의 음성 프레임으로부터 2N개인 160개의 주파수 대역 포인트와 이에 대한 계수들을 출력하는 2N-point MDCT일 수 있다.For example, the domain converter 120 may convert the speech signal into the frequency domain using a modified discrete cosine transform (MDCT). The domain converter 120 receives the decoded voice signal as an input signal in the time domain, converts it to an input signal in the frequency domain, and performs an inter-block overlap operation. In particular, the MDCT conversion scheme does not increase the bit rate even when performing an overlap operation. As described above, when the G.729-format speech signal is N, which is the number of samples of one frame, 80, the domain converter 120 performs 160 frequency band points having 2N from one decoded speech frame and It may be a 2N-point MDCT that outputs coefficients for.

한편, 정규화부(130)는 도메인 변환된 음성 신호에 대하여 정규화를 수행한다. 정규화부(130)는 도메인 변환된 음성 신호 데이터를 복수의 서브 밴드(sub-band)로 구획하고, 각 서브 밴드에 대한 주파수 대역 계수들에 대하여 각 서브밴드에 대한 에너지로 정규화를 수행할 수 있다. 예를 들어, 80개의 주파수 대역 포인트를 8개의 서브 밴드로 구획하는 경우, 각 서브 밴드는 10개의 MDCT 계수를 포함할 수 있다. 이 경우 정규화 과정을 수학식으로 나타내면 다음과 같을 수 있다.Meanwhile, the normalization unit 130 performs normalization on the domain transformed speech signal. The normalizer 130 may divide the domain-converted speech signal data into a plurality of subbands, and normalize the energy of each subband with respect to frequency band coefficients for each subband. . For example, when 80 frequency band points are partitioned into 8 subbands, each subband may include 10 MDCT coefficients. In this case, the normalization process may be expressed as an equation.

여기서, 수학식 1의 E(b)는 MDCT 변환된 음성 신호의 주파수 대역 포인트상의 b번째 서브밴드의 에너지를 나타낼 수 있다. 본 실시 예에서, 서브밴드의 개수가 16개이므로, b는 0에서 15까지의 정수일 수 있다.Here, E (b) of Equation 1 may represent the energy of the b-th subband on the frequency band point of the MDCT-converted speech signal. In the present embodiment, since the number of subbands is 16, b may be an integer from 0 to 15.

수학식 2는 이와 같이 구해진 서브밴드의 에너지를 이용하여 MDCT 변환된 각각의 주파수에 대한 계수들을 정규화하는 방법을 나타내고 있다.

는 k번째 정규화된 MDCT 계수를 의미할 수 있다.Equation 2 shows a method of normalizing the coefficients for each of the MDCT transformed frequencies using the energy of the subbands thus obtained.

May mean the k th normalized MDCT coefficient.

한편, 판별부(150)는 정규화된 음성 신호에 기초하여 유성음 또는 무성음 구간을 판별한다. 판별부(150)는 상술한 디코더(110)에서 디코딩 과정에서 획득된 주파수 경사도를 수신하고, 주파수 경사도가 소정 값 이상인 경우에 유성음 구간임을 판별할 수 있다.Meanwhile, the determination unit 150 determines the voiced sound or the unvoiced sound interval based on the normalized voice signal. The determination unit 150 may receive the frequency gradient obtained in the decoding process by the decoder 110 described above, and may determine that it is a voiced sound section when the frequency gradient is greater than or equal to a predetermined value.

예를 들어, 디코더(110)가 CELP 타입의 스피치 디코더인 경우에는, 디코더(110)로부터 출력되는 정보 중 주파수 경사도 정보를 추출하여 유성음 구간을 판별할 수 있다.For example, when the decoder 110 is a speech decoder of the CELP type, frequency gradient information may be extracted from information output from the decoder 110 to determine a voiced sound section.

또한, G.729를 이용한 디코더(110)인 경우에는 디코딩 과정 중, 1차 반사 계수인 주파수 경사도

를 다음과 같은 수학식 3을 통하여 획득하고, 판별부(150)로 전달할 수 있다.In addition, in the case of the decoder 110 using G.729, the frequency gradient which is the first order reflection coefficient during the decoding process.

Can be obtained through Equation 3 below, and transmitted to the determination unit 150.

여기서,

은 수신한 음성 신호의 시간 도메인에서 한 프레임의 n번째 샘플의 값을 의미할 수 있다.here,

May mean the value of the n th sample of one frame in the time domain of the received voice signal.

그리고, 판별부(150)는 이와 같이 획득된 주파수 경사도

를 디코더(110)로부터 수신하거나 이를 직접 계산하고, 기 정의된

와 비교하여

가

이상인 경우, 유성음 구간으로 판단할 수 있다.

는 사용자에 의해 미리 설정될 수 있으며, 실험 결과 바람직하게는 0.25가 사용될 수 있다. 또한, 판별부(150)는 유성음 구간으로 판단되지 않은 구간을 무성음 구간으로 판단할 수 있다. 여기서, 판별부(150)는 주파수 경사도를 직접 계산하지 않고, 음성 신호의 디코딩 과정에서 발생한 주파수 경사도 정보를 수신하여 판별함으로써, 연산량을 줄일 수 있게 된다.The determination unit 150 obtains the frequency gradient obtained as described above.

Is received from the decoder 110 or calculated directly, and

In comparison with

end

In this case, it may be determined as a voiced sound section.

May be preset by the user, and preferably 0.25 may be used as a result of the experiment. In addition, the determination unit 150 may determine a section that is not determined as a voiced sound section as an unvoiced sound section. Here, the determination unit 150 may reduce the amount of computation by receiving and determining the frequency gradient information generated in the decoding process of the voice signal without directly calculating the frequency gradient.

한편, 유성음 처리부(151)는 유성음으로 판별된 구간으로부터 하모닉 구간을 포함하는 제1 구간을 획득한다. 유성음 처리부(151)는, 정규화된 음성 신호의 유성음으로 판별된 구간으로부터 상술한 디코더(110)에서 획득한 피치 정보를 이용하여 하모닉 구간을 추출할 수 있다. 이와 같은 제1 구간은 하모닉 구간으로서, 복수 개의 구간들을 포함할 수 있으며, 제1 구간이 복수일 수도 있다. The voiced sound processor 151 obtains a first section including a harmonic section from the section determined as voiced sound. The voiced sound processor 151 may extract the harmonic section using the pitch information obtained by the decoder 110 described above from the section determined as the voiced sound of the normalized voice signal. Such a first section may be a harmonic section, and may include a plurality of sections, and the first section may be a plurality of sections.

피치 정보는 음성의 피치 주기를 의미할 수 있으며, 주파수 도메인에서의 각 고조파간의 위치 및 간격 정보를 포함할 수 있다. 유성음의 경우, 음성 신호는 피치 정보에 따른 주기에 의해 하모닉을 이루게 되므로, 유성음 처리부(151)는 이와 같은 피치 정보를 이용하여 유성음 구간의 하모닉 구간을 추출할 수 있게 된다. 특히, 디코더(110)에서 음성 신호의 디코딩 과정에서도 피치 정보가 추출될 수 있으며, 이를 이용하여 연산하는 경우 연산량을 줄일 수 있고, 유성음 구간에서만 계산을 수행하므로 빠르게 하모닉 구간을 추출할 수 있게 된다.The pitch information may mean a pitch period of voice, and may include location and spacing information between harmonics in the frequency domain. In the case of voiced sound, the voice signal forms a harmonic according to the period according to the pitch information, so that the voiced sound processor 151 can extract the harmonic section of the voiced sound section using the pitch information. In particular, the decoder 110 may extract the pitch information in the decoding process of the voice signal. When the calculation is performed using this, the computation amount may be reduced, and the calculation may be performed only in the voiced sound section, thereby quickly extracting the harmonic section.

여기서, 디코더(110) 또는 유성음 처리부(151)는 다음과 같은 수학식 4 및 수학식 5를 이용하여 피치 정보, T를 추출할 수 있다.Here, the decoder 110 or the voiced sound processor 151 may extract pitch information, T, by using Equations 4 and 5 shown below.

여기서, T는 수학식 4를 통하여

를 최대화시키는

값을 의미할 수 있다. T는 피치 값으로서,

과

는 각각 20 및 147이 바람직하다.Here, T is through

To maximize

It can mean a value. T is the pitch value

and

Is preferably 20 and 147, respectively.

유성음 처리부(151)는, 이와 같이 획득된 피치 정보 T에 기초하여 유성음 구간으로부터 하모닉 구간을 추출할 수 있다. 피치 정보 T에 따라서, 2N-point 변환된 MDCT 주파수 도메인에서의 하모닉 구간은 하기의 수학식 6과 수학식 7에 의해서 계산될 수 있다.The voiced sound processor 151 may extract the harmonic section from the voiced sound section based on the pitch information T thus obtained. According to the pitch information T, the harmonic interval in the 2N-point transformed MDCT frequency domain may be calculated by Equations 6 and 7 below.

여기서, T는 피치 정보, N은 한 프레임당 샘플 개수를 의미하며,

는 수학식 2를 통하여 정규화부(130)에서 정규화된 음성 신호의 MDCT 계수들을 의미할 수 있다. Mod(x,y)는 x%y의 모듈러 연산을 의미할 수 있으며,

는 x를 초과하지 않는 가장 큰 정수를 의미할 수 있다. k는 샘플 개수에 따라 0에서 N/2-1의 값을 가질 수 있다. 수학식 6과 수학식 7의 연산에 의해, 출력되는

는 판별부(150)에서 판별된 유성음 구간으로부터 하모닉 구간을 추출한 MDCT 계수들을 포함할 수 있다. 따라서, 유성음 처리부(151)는

를 출력함으로써, 유성음 구간에 대한 하모닉 구간의 데이터를 추출하여 출력할 수 있다.Here, T means pitch information, N means the number of samples per frame,

Denotes MDCT coefficients of the speech signal normalized by the normalization unit 130 through Equation 2. Mod (x, y) can mean a modular operation of x% y,

May mean the largest integer that does not exceed x. k may have a value from 0 to N / 2-1 according to the number of samples. Outputted by the operations of Equations 6 and 7

May include MDCT coefficients obtained by extracting a harmonic section from the voiced sound section determined by the determination unit 150. Therefore, the voiced sound processor 151

By outputting the data of the harmonic section for the voiced sound section can be extracted and output.

한편, 무성음 처리부(152)는 무성음으로 판별된 구간으로부터 상관도가 가장 큰 제2 구간을 획득한다. 무성음 처리부(152)는 정규화된 음성 신호에서 무성음으로 판별된 구간에 대한 각 주파수 구간별 상호상관도를 판단하고, 상호상관도가 가장 큰 구간을 추출하여 상술한 제2 구간을 획득할 수 있다. 획득한 제2 구간은 3kHz내지 4kHz 주파수대역 범위일 수 있다. 이후, 제2 구간은 고대역으로 증폭 및 천이되어 고대역 음성 신호의 무성음 구간으로 사용될 수 있다. 이는 다음과 같은 수학식들로 연산될 수 있다.On the other hand, the unvoiced sound processing unit 152 obtains the second section having the largest correlation from the section determined as unvoiced sound. The unvoiced sound processor 152 may determine the cross-correlation for each frequency section for the section determined as the unvoiced sound in the normalized voice signal, extract the section having the largest cross-correlation, and obtain the above-described second section. The acquired second period may range from 3 kHz to 4 kHz frequency band. Thereafter, the second section may be amplified and shifted to a high band and used as an unvoiced section of the high band voice signal. This can be calculated by the following equation.

수학식 8에서와 같이,

는 정규화된 음성 신호중 무성음 구간인

에서 주파수 대역 차수k에 따른 최대 상관도를 만족하는 값 m을 의미할 수 있다. 따라서, m은 0에서 N/4-1 중 어느 하나의 정수일 수 있다. 그리고, 수학식 8에서의 상관도 계산을 보다 자세히 나타내면 하기의 수학식 9와 같다.As in Equation 8,

Is the unvoiced interval of the normalized speech signal.

May mean a value m that satisfies the maximum correlation according to the frequency band order k. Accordingly, m may be an integer of any one of 0 to N / 4-1. In addition, the correlation calculation in Equation 8 is shown in Equation 9 below.

따라서, 가장 상관도가 큰 주파수 구간에 대응되는 MDCT 계수는 수학식 10과 같이 연산될 수 있다.Therefore, the MDCT coefficient corresponding to the frequency section with the highest correlation may be calculated as in Equation 10.

여기서, k는 0에서 N/2-1인 정수 중 어느 하나를 의미할 수 있다. 무성음 처리부(15)는 이와 같이 상관도에 기초하여 연산된

를 고주파 대역의 무성음 구간으로 출력한다. 출력되는 무성음 구간인 제2 구간은, 제1 구간과 같이 복수 개의 구간들을 포함할 수 있으며, 제2 구간이 복수개일 수도 있다.Here, k may mean any one of integers from 0 to N / 2-1. The unvoiced sound processor 15 is calculated based on the degree of correlation

Is output as unvoiced sound section of high frequency band. The second section, which is the output unvoiced section, may include a plurality of sections, like the first section, and may have a plurality of second sections.

한편, 유성음 처리부(151) 또는 무성음 처리부(152)에서 획득된 제1 구간 또는 제2 구간 출력에 대하여 대역폭 증폭 처리를 수행한다. 유성음 처리부(151) 또는 무성음 처리부(152)는 상술한 수학식 7 또는 수학식 10에 따라 출력되는

를 제1 구간 또는 제2 구간으로 출력하고 있으나, 이와 같은 과정을 이용하면 주파수 구간의 대역폭이 절반으로 줄게 된다. 예를 들어, 구하고자 하는 대역폭이 4kHz인 경우에

는 2kHz의 대역폭을 가질 수 있다. 따라서, 유성음 처리부(151) 또는 무성음 처리부(152)는 하기의 수학식 11과 같은 연산을 수행함으로써, 대역폭을 증폭시킬 수 있다.Meanwhile, the bandwidth amplification process is performed on the output of the first section or the second section obtained by the voiced sound processor 151 or the unvoiced voice processor 152. The voiced sound processing unit 151 or the unvoiced sound processing unit 152 is output according to Equation 7 or Equation 10 described above.

Is output as the first interval or the second interval, but using this process, the bandwidth of the frequency interval is reduced by half. For example, if the bandwidth you want to find is 4 kHz

May have a bandwidth of 2 kHz. Therefore, the voiced sound processor 151 or the unvoiced sound processor 152 may amplify a bandwidth by performing an operation as shown in Equation 11 below.

여기서,

는 k번째의 차수로 정규화된 주파수 도메인 영역의 MDCT 계수를 의미할 수 있다.here,

Denotes the MDCT coefficient of the frequency domain region normalized to the kth order.

한편, 에너지 조절부(160)는 상술한 유성음 또는 무성음 처리되어 획득한 제1 또는 제2 구간의 MDCT 주파수 도메인 음성 신호에 대하여 에너지 스케일링을 수행한다.Meanwhile, the energy controller 160 performs energy scaling on the MDCT frequency domain voice signal obtained by processing the voiced or unvoiced sound described above.

에너지 조절부(160)는, 상술한 제1 또는 제2 구간의 MDCT 음성 신호의 각 계수를 조절하여 고대역 신호로 변환하는 경우의 급격한 에너지 변화를 줄이는 역할을 수행한다.The energy control unit 160 serves to reduce a sudden energy change when converting the MDCT speech signal of the first or second section into a high band signal.

따라서, 에너지 조절부(160)는 저대역 음성 신호와 상술한 제1 구간 또는 제2 구간을 고대역으로 천이한 경우의 음성 신호간의 경계부분의 에너지를 맞춤으로써, 에너지의 급격한 변화를 스케일 조절을 이용하여 조정할 수 있다. 예를 들어, 에너지 조절부(160)는 하기의 수학식 12 내지 수학식 14와 같은 과정에 따라 에너지 스케일을 조절할 수 있다.Therefore, the energy adjusting unit 160 adjusts the energy of the boundary portion between the low band voice signal and the voice signal when the above-described first or second section transitions to the high band, thereby scaling the abrupt change in energy. Can be adjusted. For example, the energy control unit 160 may adjust the energy scale according to a process as in Equation 12 to Equation 14 below.

여기서,

는 고대역 구간의 b번째 주파수 대역에 대한 에너지를 의미할 수 있다. b는 0에서 7 사이의 정수일 수 있다. 그리고,

는 수학식 1에서 정의된 바와 같은 저대역 주파수 구간의 b번째 주파수 대역에 대한 에너지를 의미할 수 있다.here,

May mean energy for the b th frequency band of the high band period. b may be an integer between 0 and 7. And,

May mean energy for the b th frequency band of the low band frequency period as defined in Equation 1.

그리고, 저대역 구간과 고대역 구간의 경계 부분에서의 에너지 스케일링을 위한 스케일 팩터

는 하기의 수학식 13과 같이 정해질 수 있다.And a scale factor for energy scaling at the boundary between the low band and the high band.

May be determined as in Equation 13 below.

E(15)는 저대역 구간에서의 상술한 0 내지 15개의 서브밴드 주파수 대역 중 가장 높은 대역의 서브 밴드 대역에 대한 에너지를 의미하며, E_h(0)는 고대역 구간에서의 서브밴드 주파수 대역 중 최초의 서브밴드 주파수 대역에 대한 에너지를 의미할 수 있다. 이와 같이, 에너지 조절부(160)는 상술한 두 주파수 대역에 대한 에너지의 비율을 구하여 에너지 스케일링 팩터를 구할 수 있다.E (15) means the energy for the subband band of the highest band among the above-described 0 to 15 subband frequency bands in the low band period, E _h (0) is the subband frequency band in the high band period It may mean energy for the first subband frequency band. As such, the energy controller 160 may obtain an energy scaling factor by obtaining a ratio of energy for the aforementioned two frequency bands.

그리고, 스케일 조절된 고대역의 에너지 값은 하기의 수학식 14와 같다.The energy value of the scaled high band is expressed by Equation 14 below.

수학식 14에서 구해지는 고대역 음성 신호 데이터는 수학식 11에서 설명한 바와 같은 대역폭 확장을 필요로 하므로, 에너지 조절부(160)는 수학식 15와 같은 연산을 수행하여 고대역 주파수의 음성 신호의 대역폭을 증가시킬 수 있다.Since the high-band speech signal data obtained in Equation 14 requires bandwidth extension as described in Equation 11, the energy control unit 160 performs the same operation as in Equation 15 so that the bandwidth of the high-band speech signal is obtained. Can be increased.

그리고, 에너지 조절부(160)는 수학식 11과 수학식 15를 이용하여 고대역 주파수의 음성 신호를 수학식 16과 같이 출력할 수 있다.In addition, the energy adjusting unit 160 may output a voice signal having a high band frequency using Equation 11 and Equation 15 as shown in Equation 16 below.

위와 같이, 에너지 조절부(160)는 정규화된 음성 신호의 에너지 값에 기초하여 고대역 음성 신호로 변환될 제1 구간 또는 제2 구간에 대한 에너지 조절을 수행하여, 에너지 조절된

를 출력할 수 있다.As described above, the energy controller 160 performs energy control on the first section or the second section to be converted into the high-band speech signal based on the energy value of the normalized speech signal.

You can output

음성 신호 합성부(180)는 에너지 조절된 고대역 음성 신호와 정규화부(130)에서 출력된 음성 신호를 합성함으로써, 광대역 음성 신호를 생성하고 주파수 대역에서 시간 대역으로 변환하여 출력할 수 있다. 이를 위하여 음성 신호 합성부(180)는 하기의 수학식 17과 같은 연산을 수행하고, 출력되는 데이터를 시간 대역으로 변환하여 출력함으로써, 광대역의 음성 신호를 출력할 수 있다.The voice signal synthesizing unit 180 may generate a wideband voice signal by converting the energy-adjusted high band voice signal and the voice signal output from the normalization unit 130, and convert the frequency signal into a time band. To this end, the speech signal synthesizing unit 180 may output a wideband speech signal by performing an operation as shown in Equation 17 below, converting the output data into a time band, and outputting the same.

한편, 본 발명의 다른 일 실시 예에 따르면, 음성 신호의 대역폭 확장 장치(100)는 도 2에 도시된 바와 같이 저대역 역변환부(140) 및 고대역 역변환부(170)를 더 포함할 수 있다.Meanwhile, according to another exemplary embodiment of the present disclosure, the apparatus 100 for expanding a voice signal may further include a low band inverse transform unit 140 and a high band inverse transform unit 170 as shown in FIG. 2. .

저대역 역변환부(140)는 정규화된 음성 신호

에 대하여 시간 도메인의 저대역 음성 신호로 역변환하여 시간 도메인의

으로 출력할 수 있다.The low band inverse transform unit 140 is a normalized speech signal

Inversely transforms the time-domain low-band speech signal into

You can output

그리고, 고대역 역변환부(170)는 에너지 조절된 음성 신호

에 대하여 시간 도메인의 고대역 음성 신호로 역변환하고, 시간 도메인의

로 출력할 수 있다.And, the high band inverse transform unit 170 is an energy-controlled speech signal

Inverse to the time-domain high-band speech signal,

Can be printed as

그리고, 음성 신호 합성부(180)는 이와 같이 시간 도메인으로 출력되는 저대역 음성 신호 및 고대역 음성 신호를 합성하여 필터링된 음성 신호로 출력할 수 있다. 이를 위하여 음성 신호 합성부(180)는 QMF(quadrature mirror filterbanks) 방식을 이용한 음성 합성을 수행할 수 있다. QMF는 64밴드 복소 QMF가 사용될 수 있다.The voice signal synthesizing unit 180 may synthesize the low band voice signal and the high band voice signal output in the time domain and output the synthesized voice signal as a filtered voice signal. To this end, the speech signal synthesis unit 180 may perform speech synthesis using a quadrature mirror filterbanks (QMF) method. QMF can be a 64-band complex QMF.

도 3은 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 방법을 설명하기 위한 도면이다.3 is a diagram for describing a bandwidth extension method of a voice signal according to an embodiment of the present invention.

도 3을 참조하면, 먼저, 디코더(110)는 협대역 음성 신호를 수신한다(S100). 음성 신호를 디코딩하는 방법은 상술한 바와 같이 협대역 디코딩 방식인 G.729[ITU-T Recommendation G.729, Coding of speech at 8 kbit/s using conjugate-structure code-excited linear prediction (CS-ACELP)] 를 이용하여 디코딩하는 방식이 있을 수 있다. 또한, 디코더(110)는 스펙트럼 분석을 이용한 켈프(CELP, Code Exited Linear Prediction) 타입의 음성 디코더를 이용하여 디코딩을 수행할 수 있다.Referring to FIG. 3, first, the decoder 110 receives a narrowband speech signal (S100). As described above, the method for decoding the speech signal is G.729 [ITU-T Recommendation G.729, Coding of speech at 8 kbit / s using conjugate-structure code-excited linear prediction (CS-ACELP)] ] May be used to decode. In addition, the decoder 110 may perform decoding by using a speech decoder of a code exit linear prediction (CELP) type using spectrum analysis.

그리고, 도메인 변환부(120)는 디코딩된 음성 신호를 주파수 도메인으로 변환한다(S110). 상술한 바와 같이, 도메인 변환부(120)는 MDCT(modified discrete cosine transform)을 이용하여 음성 신호를 주파수 도메인으로 변환할 수 있다. The domain converter 120 converts the decoded speech signal into the frequency domain (S110). As described above, the domain converter 120 may convert the voice signal into the frequency domain by using a modified discrete cosine transform (MDCT).

도메인 변환부(120)는 상술한 바와 같이, 디코딩된 음성 신호를 시간 영역의 입력 신호로 수신하고, 주파수 영역의 입력 신호로 변환하며, 블록 간 오버랩 연산을 수행할 수 있다. 또한, MDCT 변환 방식을 사용하는 경우, 비트레이트가 증가하지 않는 장점이 있다.As described above, the domain converter 120 may receive the decoded voice signal as an input signal in the time domain, convert it to an input signal in the frequency domain, and perform an inter-block overlap operation. In addition, when using the MDCT conversion method, there is an advantage that the bit rate does not increase.

그리고, 정규화부(130)는 변환된 음성 신호의 정규화를 수행한다(S120). 정규화부(130)는 상술한 바와 같이, 도메인 변환된 음성 신호 데이터를 복수의 서브 밴드(sub-band)로 구획하고, 각 서브 밴드에 대한 주파수 대역 계수들에 대하여 각 서브밴드에 대한 에너지를 구하여 정규화를 수행할 수 있다. 예를 들어, 80개의 주파수 대역 포인트를 16개의 서브 밴드로 구획하는 경우, 각 서브 밴드는 5개의 MDCT 계수를 포함할 수 있다.In operation S120, the normalization unit 130 normalizes the converted speech signal. As described above, the normalizer 130 divides the domain-converted speech signal data into a plurality of subbands, obtains energy for each subband with respect to frequency band coefficients for each subband, Normalization can be performed. For example, when 80 frequency band points are divided into 16 subbands, each subband may include 5 MDCT coefficients.

이후, 판별부(150)는 정규화된 음성 신호로부터 유성음 또는 무성음 구간을 판별한다(S130). 판별부(150)는 상술한바와 같이, 디코더(110)에서 디코딩 과정에서 획득된 주파수 경사도를 수신하고, 주파수 경사도가 소정 값 이상인 경우에 유성음 구간임을 판별할 수 있다. 예를 들어, 디코더(110)가 CELP 타입의 스피치 디코더인 경우, 판별부(150)는 디코더(110)로부터 출력되는 정보 중 주파수 경사도 정보를 추출하여 유성음 구간을 판별할 수 있으며, 또한, G.729를 이용한 디코더(110)인 경우에는 디코딩 과정 중, 1차 반사 계수인 주파수 경사도

를 상술한 수학식 3을 통하여 획득하여 판별할 수 있다.Thereafter, the determination unit 150 determines the voiced sound or unvoiced sound interval from the normalized voice signal (S130). As described above, the determination unit 150 may receive the frequency gradient obtained during the decoding process in the decoder 110 and may determine that the voiced sound interval is when the frequency gradient is greater than or equal to a predetermined value. For example, when the decoder 110 is a CELP type speech decoder, the determination unit 150 may extract the frequency gradient information from the information output from the decoder 110 to determine the voiced sound section. In the case of the decoder 110 using the 729, during the decoding process, the frequency gradient which is the first order reflection coefficient

It can be obtained by determining through Equation 3 described above.

그리고, 유성음 처리부(151)는, 유성음 구간인 경우 상술한 피치 정보에 기초하여 연산된 하모닉 구간을 포함하는 제1 구간을 추출하고(S140), 무성음 처리부(152)는, 무성음 구간인 경우 상관도에 기초하여 정규화된 음성 신호와 가장 상관도가 높은 구간을 제2 구간으로 추출하며(S135), 각 처리부(151, 152)는 추출된 각 구간에 대한 대역폭을 증폭하고, 고대역으로 천이시킨다(S150).The voiced sound processor 151 extracts a first section including a harmonic section calculated based on the pitch information in the case of the voiced sound section (S140), and the unvoiced sound processor 152 correlates the unvoiced sound section. Based on the extracted normalized speech signal, the section having the highest correlation with the second section (S135), each processing unit 151, 152 amplifies the bandwidth for each extracted section and transitions to a high band ( S150).

유성음 처리부(151)는, 상술한 바와 같이 정규화된 음성 신호의 유성음으로 판별된 구간으로부터 상술한 디코더(110)에서 획득한 피치 정보를 이용하여 하모닉 구간을 추출할 수 있다. 이와 같은 제1 구간은 하모닉 구간으로서, 복수 개의 구간들을 포함할 수 있으며, 제1 구간이 복수일 수도 있다. 또한, 무성음 처리부(152)는 상술한 바와 수학식을 이용하여, 정규화된 음성 신호에서 무성음으로 판별된 구간에 대한 각 주파수 구간별 상호상관도를 판단하고, 상호상관도가 가장 큰 구간을 추출하여 상술한 제2 구간을 획득할 수 있다. 그리고, 각 처리부(151, 152)는 획득한 구간의 대역폭이 확장하고자 하는 대역폭의 절반으로 감소하기 때문에 이에 대한 대역폭 증폭을 수행하고, 고대역으로 천이시킨다.The voiced sound processor 151 may extract the harmonic section using the pitch information obtained by the decoder 110 from the section determined as the voiced sound of the normalized voice signal as described above. Such a first section may be a harmonic section, and may include a plurality of sections, and the first section may be a plurality of sections. In addition, the unvoiced sound processor 152 determines the cross-correlation of each frequency section for the section determined as unvoiced sound in the normalized speech signal, and extracts a section having the largest cross-correlation using the above-described equation. The second section described above may be obtained. Each of the processing units 151 and 152 performs bandwidth amplification for the bandwidth of the acquired section and decreases to half of the bandwidth to be expanded, and then transitions to the high band.

이후, 에너지 조절부(160)는, 출력된 제1 구간 또는 제2 구간의 에너지 스케일을 조절한다(S160). 에너지 조절부(160)는, 상술한 바와 같이, 제1 또는 제2 구간의 MDCT 음성 신호의 각 계수를 조절하여 고대역 신호로 변환하는 경우의 급격한 에너지 변화를 줄이는 역할을 수행할 수 있다. 따라서, 에너지 조절부(160)는 저대역 음성 신호와 상술한 제1 구간 또는 제2 구간을 고대역으로 천이한 경우의 음성 신호간의 경계부분의 에너지를 맞춤으로써, 에너지의 급격한 변화를 스케일 조절을 이용하여 조정할 수 있다.Thereafter, the energy controller 160 adjusts the energy scale of the output first or second section (S160). As described above, the energy adjusting unit 160 may serve to reduce a sudden energy change when converting the MDCT speech signal of the first or second section into a high band signal. Therefore, the energy adjusting unit 160 adjusts the energy of the boundary portion between the low band voice signal and the voice signal when the above-described first or second section transitions to the high band, thereby scaling the abrupt change in energy. Can be adjusted.

그리고, 음성 신호 합성부(180)는, 스케일 조절된 고대역 음성 신호와 정규화된 음성 신호, 즉 저대역 음성 신호를 합성하여 광대역 신호를 획득하고(S170), 이를 광대역 음성 신호로 변환하여 출력한다(S180). 음성 싱호 합성부(180)는, 음성 합성 및 변환을 위하여 MDCT 역변환(inverse MDCT)을 수행할 수 있으며, 광대역 음성 신호 합성을 위하여 상술한 QMF방식을 이용한 음성 합성을 수행할 수 있다.The voice signal synthesizer 180 synthesizes the scaled high band voice signal and the normalized voice signal, that is, the low band voice signal to obtain a wideband signal (S170), and converts the wideband voice signal into a wideband voice signal. (S180). The speech singer synthesis unit 180 may perform inverse MDCT for speech synthesis and conversion, and perform speech synthesis using the above-described QMF method for wideband speech signal synthesis.

도 4는 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 장치(100)의 성능을 실험하여 얻은 결과 그래프이다.4 is a result graph obtained by experimenting with the performance of the apparatus 100 for expanding a bandwidth of a voice signal according to an embodiment of the present invention.

본 발명의 실험을 위하여 MUSHRA 테스트(ITU/ITU-R BS 1534, Method for Subjective Assessment of Intermediate Quality Level of Coding Systems, 2001.)가 수행되었으며, 테스트에 따른 스펙트럼 비교를 수행하여 음질의 성능을 측정하였다. MUSHRA 테스트를 위해서 speech quality assessment material (SQAM) 데이터베이스(EBU, Sound Quality Assessment Material Recording for Subjective Tests, 1988.)로부터 3개의 남자음성과 3개의 여자음성 파일이 사용되었다.MUSHRA test (ITU / ITU-R BS 1534, Method for Subjective Assessment of Intermediate Quality Level of Coding Systems, 2001.) was performed for the experiment of the present invention, and the performance of sound quality was measured by performing a spectral comparison according to the test. . Three male and three female files were used for the MUSHRA test from the speech quality assessment material (SQAM) database (EBU, Sound Quality Assessment Material Recording for Subjective Tests, 1988.).

특히 SQAM 음성 파일은 스테레오와 44.1 kHz로 샘플링되어 있기 때문에 성능 측정을 위해, 각각 8 kHz, 16 kHz로 다운샘플링을 한 후 모노 신호로 다시 생성하였다. 이는 8 kHz로 다운샘플링된 음원에 대해서 종래 기술(G.729) 및 본 발명의 실시 예에 의해 처리된 신호를 생성하기 위함이며 16 kHz로 다운샘플링된 음원은 일반적인 광대역 전송기술(G.729.1)에 의해 처리된 신호를 획득하기 위함이다. 실험에 참가한 실험자는 청각적 질병이 없는 7명의 실험자가 테스트에 참여하였으며 각 테스트 파일에 대해서 위에서 설명한 6개의 파일에 대해서 각 음질에 대해서 0점부터 100까지 점수를 부여하게 하였다. In particular, since the SQAM voice files are sampled at stereo and 44.1 kHz, they were downsampled to 8 kHz and 16 kHz, respectively, and regenerated into mono signals for performance measurements. This is to generate a signal processed by the prior art (G.729) and the embodiment of the present invention for a sound source downsampled at 8 kHz, and the sound source downsampled at 16 kHz is a general broadband transmission technology (G.729.1). This is to obtain a signal processed by. The experimenter participated in the test and seven test subjects who did not have an auditory disease participated in the test. For each test file, the six files described above were assigned a score from 0 to 100 for each sound quality.

MUSHRA 테스트의 결과, 도 4에서 나타낸 바와 같이 원음의 100점을 기준으로, 본 발명이 약 75.5를 기록하였다. 이는 예상했던 바와 같이 종래의 협대역 처리 및 출력 기술인 G.729 의 66점보다는 높고, 광대역 신호 생성을 위해 추가 비트를 할당하는 방식인 광대역 전송기술(G.729.1 (Layer 3))의 87점보다는 다소 낮음을 알 수 있다. 그러나, 추가비트 사용 없이도, 일반적인 협대역 전송기술 G.729 보다 음질이 약 43%의 음질 향상이 있음을 알 수 있으며, 추가 비트를 사용한 광대역 전송 방식에 비하여도 음질 저하가 크지 않음을 알 수 있다. As a result of the MUSHRA test, the present invention recorded about 75.5 based on 100 points of the original sound as shown in FIG. As expected, this is higher than 66 points in G.729, a conventional narrowband processing and output technique, and 87 points in G.729.1 (Layer 3), which is a method of allocating additional bits for wideband signal generation. It can be seen that it is somewhat low. However, even without the use of additional bits, it can be seen that the sound quality is improved by about 43% compared to the general narrowband transmission technology G.729, and that the degradation of sound quality is not much higher than that of the broadband transmission method using additional bits. .

도 5내지 도 8은 도 4에서 도시된 본 발명의 일 실시 예에 따른 음성 신호의 대역폭 확장 방법을 실험한 결과와 다른 기술들의 각 스펙트럼 파형들을 비교하기 위하여 그래프로 나타낸 도면이다5 to 8 are graphs for comparing the results of an experiment on a method of expanding a bandwidth of a voice signal according to an embodiment of the present invention shown in FIG. 4 with respective spectral waveforms of other techniques.

도 5는 전송 전의 원음에 대한 스펙트럼을 나타낸다. 도 5의 고대역 부분은 협대역 전송시 전송되지 않는다.5 shows the spectrum of the original sound before transmission. The high band portion of FIG. 5 is not transmitted in narrowband transmission.

그리고, 도 6은 종래 협대역 출력 기술(G.729)에 의해 복원된 신호의 스펙트럼을 나타낸다. 도 6에 도시된 바와 같이, 종래 기술의 고대역 신호 부분은 음성 데이터가 복원되지 않아 음질 저하를 나타내고 있음을 알 수 있다.6 shows the spectrum of the signal reconstructed by the conventional narrowband output technique (G.729). As shown in Fig. 6, it can be seen that the high-band signal portion of the prior art shows that the voice data is not restored and thus the sound quality is degraded.

그리고, 도 7은 추가 비트를 사용하는 일반적인 광대역 전송 기술(G.729.1)에 의해 복원된 신호의 스펙트럼을 나타낸다. 도 7에 도시된 바와 같이, 광대역 전송 기술을 사용하더라도 고대역 부분의 데이터가 완전히 복원되지 않음을 알 수 있으며 이 방식을 사용하는 경우, 추가 비트에 의한 연산량 증가가 존재하며, 설비 교체가 필요하게 된다.And, Figure 7 shows the spectrum of the signal reconstructed by the general wideband transmission technique (G.729.1) using additional bits. As shown in FIG. 7, it can be seen that even when using the broadband transmission technology, the data of the high-band portion is not completely recovered. When using this scheme, there is an increase in the amount of calculation due to additional bits, which requires equipment replacement. do.

도 8은 본 발명의 실시 예에 따라 협대역 신호(예를 들어, G.729에 의해 코딩된 신호)를 수신하여 광대역 신호로 복원한 신호의 스펙트럼을 나타낸다. 도 8에 도시된 바와 같이, 고대역 부분의 신호가 원음과 다소 차이는 있으나, 도 6에서의 종래 방식에 비해 개선된 것을 확인할 수 있다. 또한, 추가 비트를 이용한 광대역 전송의 도 8의 결과와도 크게 차이나지 않음을 확인할 수 있다.8 illustrates a spectrum of a signal that receives a narrowband signal (for example, a signal coded by G.729) and restores the wideband signal according to an embodiment of the present invention. As shown in FIG. 8, although the signal of the high band part is slightly different from the original sound, it can be seen that the signal is improved compared to the conventional method of FIG. 6. In addition, it can be seen that the results of the broadband transmission using additional bits are not significantly different from those of FIG. 8.

따라서, 본 발명의 실시 예에 따르면, 추가적인 비트할당 없이 복호화기단에서 후처리로 인한 음질을 향상시킬 수 있게 된다. 또한, 본 발명의 실시 예에 의해 고음질을 유지하면서도 단말기간 통신 대역폭을 확보할 수 있게 되며, 기존의 네트워크 교체 및 정비가 불필요하기 때문에 광대역 설비 설치에 따른 시간과 비용을 절감할 수 있다.Therefore, according to an embodiment of the present invention, it is possible to improve the sound quality due to post processing at the decoder stage without additional bit allocation. In addition, according to an embodiment of the present invention, it is possible to secure communication bandwidth between terminals while maintaining high sound quality, and it is possible to reduce time and cost due to the installation of broadband facilities because the existing network replacement and maintenance are unnecessary.

상술한 본 발명에 따른 음성 신호의 대역폭 확장 방법은 컴퓨터에서 실행되기 위한 프로그램으로 제작되어 컴퓨터가 읽을 수 있는 기록 매체에 저장될 수 있으며, 컴퓨터가 읽을 수 있는 기록 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있으며, 또한 캐리어 웨이브(예를 들어 인터넷을 통한 전송)의 형태로 구현되는 것도 포함한다.The above-described method for extending the bandwidth of an audio signal according to the present invention may be stored in a computer-readable recording medium that is produced as a program for execution in a computer, and examples of the computer-readable recording medium include ROM, RAM, CD. ROMs, magnetic tapes, floppy disks, optical data storage devices, and the like, as well as those implemented in the form of carrier waves (eg, transmission over the Internet).

컴퓨터가 읽을 수 있는 기록 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장되고 실행될 수 있다. 그리고, 상기 방법을 구현하기 위한 기능적인(function) 프로그램, 코드 및 코드 세그먼트들은 본 발명이 속하는 기술분야의 프로그래머들에 의해 용이하게 추론될 수 있다.The computer readable recording medium may be distributed over a networked computer system so that computer readable code can be stored and executed in a distributed manner. And, functional programs, codes and code segments for implementing the above method can be easily inferred by programmers of the technical field to which the present invention belongs.

또한, 이상에서는 본 발명의 바람직한 실시예에 대하여 도시하고 설명하였지만, 본 발명은 상술한 특정의 실시예에 한정되지 아니하며, 청구범위에서 청구하는 본 발명의 요지를 벗어남이 없이 당해 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형 실시가 가능한 것은 물론이고, 이러한 변형 실시들은 본 발명의 기술적 사상이나 전망으로부터 개별적으로 이해 되어서는 안될 것이다.
In addition, although the preferred embodiment of the present invention has been shown and described above, the present invention is not limited to the specific embodiments described above, but the technical field to which the invention belongs without departing from the spirit of the invention claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be individually understood from the technical spirit or prospect of the present invention.

Claims

A method of receiving a voice signal to expand the bandwidth,
Decoding the received speech signal into a frequency domain;
Performing normalization on the converted speech signal;
Determining a voiced sound or unvoiced sound interval from the received voice signal;
Extracting a first section including a harmonic component of the voiced sound section from the normalized voice signal based on the section determined as the voiced sound;
Extracting a second section from the normalized speech signal based on a correlation between the section determined as the unvoiced sound and the normalized speech signal;
Generating a high band speech signal based on the first interval and the second interval; And
And synthesizing the generated high band speech signal and the converted speech signal into a wideband speech signal.

The method of claim 1,
The determining of the voiced sound or unvoiced sound interval may include:
Extracting a frequency gradient from the received voice signal; And
And determining the voiced sound interval when the extracted frequency gradient is greater than a preset value.

The method of claim 1,
Extracting the first section may include:
Extracting pitch information from the received voice signal;
Obtaining a harmonic section of the section determined as the voiced sound based on the extracted pitch information; And
And extracting the harmonic section into the first section.

The method of claim 1,
Extracting the second section may include:
And extracting, as the second section, a section having the greatest correlation with the normalized speech signal from the section determined as the unvoiced sound.

The method of claim 1,
Generating the high band speech signal,
Transitioning a bandwidth of at least one of the first section or the second section to a high band frequency band; And
And generating the high-band speech signal by performing energy compensation of the transitioned section.

The method of claim 5,
Performing the energy compensation is
Dividing the normalized speech signal into a plurality of first subbands according to a frequency band;
Dividing the voice signal of the transitioned section into a plurality of second subbands;
Obtaining a scaling factor based on the first subband and the second subband; And
And performing energy compensation of the transitioned period by using the scaling factor.

A device for extending the bandwidth of a voice signal,
Receiving unit for receiving a voice signal;
A decoder for decoding the received speech signal;
A domain converter for converting the decoded speech signal into a frequency domain;
A normalizer which normalizes the converted speech signal;
A discriminating unit for discriminating voiced or unvoiced sections from the received voice signal;
A voiced sound processor extracting a first section including a harmonic component of the voiced sound section from the normalized voice signal based on the section determined as the voiced sound;
An unvoiced sound processor extracting a second section from the normalized speech signal based on a correlation between the section determined as the unvoiced sound and the normalized speech signal;
A high band generator configured to generate a high band voice signal based on the first and second sections; And
And an output unit configured to synthesize the generated high-band speech signal and the converted speech signal and output a wideband speech signal.

The method of claim 7, wherein
Wherein,
And extracting a frequency gradient from the received voice signal, and determining a voiced sound interval when the extracted frequency gradient is greater than a preset value.

The method of claim 7, wherein
The voiced sound processor,
Bandwidth extension apparatus for extracting pitch information from the received speech signal, acquiring a harmonic section of the section determined as the voiced sound based on the extracted pitch information, and outputting the harmonic section to the first section .

The method of claim 7, wherein
The unvoiced sound processing unit,
And a band having the greatest correlation with the normalized voice signal from the section determined as the unvoiced sound to the second section.

The method of claim 7, wherein
The high band generation unit,
And a bandwidth of the first section or the second section transitions to a high band frequency band, and performs energy compensation of the section transitioned to the high band frequency band to generate the high band speech signal.

12. The method of claim 11,
The high band generator
The transitioned interval using the normalized speech signal divided into a plurality of first subbands according to a frequency band and a scaling factor obtained based on a speech signal of the transitioned interval divided into a plurality of second subbands Bandwidth extension device for speech signal to perform energy compensation.