KR19990079718A

KR19990079718A - Speech Signal Reproduction Using Multiband Stimulation Algorithm

Info

Publication number: KR19990079718A
Application number: KR1019980012456A
Authority: KR
Inventors: 장계용
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1998-04-08
Filing date: 1998-04-08
Publication date: 1999-11-05

Abstract

본 발명은 MBE(Multi-Band Excitation) 음성 신호 파라메터 추출 알고리즘을 이용한 음성 신호 재생 방법에 관한 것으로서, 원음과 동일한 고품질의 음성 신호를 재생하는데 있어서 MBE 모델을 이용하여 음성 신호 피치 및 스펙트럼 엔벌로우프 파라메터를 변경하지 않고 프레임의 반복 또는 삭제를 통해 재생 시간을 변경하여 재생 속도를 조정한다. 따라서 음성 신호 재생시 원음과 동일한 주파수 범위에서의 피치 및 스펙트럼 엔벨로우프 파라메터를 변경하지 않고 재생 속도를 조정해야 하는 문제점을 해결하고 이를 구비하는 보코더의 기능을 개선한다.The present invention relates to a speech signal reproducing method using a multi-band excitation (MBE) speech signal parameter extraction algorithm. Adjust the playback speed by changing the playback time by repeating or deleting frames without changing. Therefore, the problem of adjusting the playback speed without changing the pitch and spectral envelope parameters in the same frequency range as the original sound when reproducing the voice signal, and improves the function of the vocoder having the same.

Description

SPEECH PLAYBACK METHOD USING MULTI BAND EXCITATION SPEECH PRODUCTION MODEL

본 발명은 음성 신호 재생 방법에 관한 것으로 좀 더 구체적으로, 음성 신호 파라메터 추출 알고리즘에 있어서 다중 대역 자극(MBE : Multi Band Excitation)을 이용한 음성 신호 재생 방법에 관한 것이다.The present invention relates to a method for reproducing a voice signal, and more particularly, to a method for reproducing a voice signal using multi band excitation (MBE) in a voice signal parameter extraction algorithm.

때때로 음성 신호는 빠르게 또는 느리게 재생되는 경우가 필요로 한다. 예를 들어 전화 응답기에 있어서 사용자는 녹음된 내용을 자세하게 듣기를 원하며, 이 때에 전화 응답기는 천천히 작동시킬 수가 있다. 또한 사용자가 녹음된 내용의 주제를 정확히 알고 있는 경우에도 사용자는 빠르게 재생하여 들으므로서 불필요한 시간 낭비를 막을 수가 있다. 그리고 이러한 전화 응답기의 재생 속도는 재생 알고리즘에 의하여 이루어진다.Sometimes a voice signal needs to be played back quickly or slowly. For example, in a telephone answering machine, the user wants to listen to the recording in detail, and the telephone answering machine can be operated slowly. In addition, even if the user knows exactly the subject of the recorded content, the user can quickly play back and prevent unnecessary waste of time. And the reproduction speed of such a telephone answering machine is made by a reproduction algorithm.

일반적인 재생 방법은 재생 시간 또는 합성된 음성 신호의 주파수의 상대적인 크기에 의해 인식된다. 그러나 이러한 기술은 원음과 동일한 주파수 범위에서의 피치와 스펙트럼 엔벌로우프를 조절해야 하는 문제점이 있다. 그리고 이러한 기술을 이용하는 음성 신호 프로세서 예컨데, IMBE, LPC등과 같은 다수의 보코더의 개선이 쉽지 않다.The general reproduction method is recognized by the reproduction time or the relative magnitude of the frequency of the synthesized speech signal. However, this technique has a problem of adjusting the pitch and spectral envelope in the same frequency range as the original sound. And voice signal processors using this technology, for example, the improvement of many vocoders such as IMBE, LPC, etc. is not easy.

LPC 모델은 음성 신호 분석이 전형적인 모델이다. 그리고 이는 여러 음성 코덱 예를 들어 LPC-10, CELP에서 사용되고 있다. 이 모델은 음성 신호 파라메터, 음성 신호 특징 등을 추출하는데 이용된다.The LPC model is typical of speech signal analysis. It is also used in many voice codecs, such as LPC-10 and CELP. This model is used to extract voice signal parameters, voice signal characteristics, and so on.

최근에는 MBE 음성 모델이 사용되고 있으며, 이는 보다 합리적으로 음성 파라메터들을 추출한다. 그리고 실제에도 많은 고품질의 보코더에서는 MBE 모델을 근간으로 하는 경우가 많다.Recently, the MBE speech model has been used, which more efficiently extracts speech parameters. And in practice, many high quality vocoders are often based on the MBE model.

재생 기술은 본래의 음성 신호에 대한 음의 높낮이를 변화시키지 않고, 천천히 재생할 때는 원음의 재생 시간을 증가시키고, 빠르게 재생할 때는 감소시킨다.The reproduction technique does not change the pitch of the original audio signal, but increases the reproduction time of the original sound when it is played slowly and decreases it when it is played fast.

음성 신호 녹음/재생 알고리즘에 의하면, 임의의 음성 신호는 2개의 파라메타에 의존하여 녹음 및 재생된다. 그 중 하나는 음의 높낮이(음조 : pitch)이고, 다른 하나는 스펙트럼 엔벌로우프(spectral envelope)이다. 그리고 음성 신호를 재생시키기 위해서는 분석될 수 있어야 하고 명료한 음질의 음성 신호로서 가능하다. 그리고 스펙트럼 엔벌로우프 파라메타는 그대로 유지되어야 한다. 그러므로 음성 신호 재생은 피치와 스펙트럼 엔벨로우프 파라메타의 상태가 변하지 않는 상태에서 시간 간격을 변화시키므로서 원음에 가까운 품질의 음성 데이터로 재생 가능하다.According to the audio signal recording / reproducing algorithm, any audio signal is recorded and reproduced depending on two parameters. One is pitch, the other is the spectral envelope. In order to reproduce the voice signal, it must be able to be analyzed and can be used as a voice signal of clear sound quality. And the spectral envelope parameters must remain the same. Therefore, speech signal reproduction can reproduce speech data of quality close to the original sound by changing time intervals in a state where the pitch and spectral envelope parameters do not change.

따라서 전형적인 재생 방법은 시간 또는 주파수의 상대적인 크기에 의해 인식된다. 그러나 이러한 기술은 주파수 범위에서의 피치와 스펙트럼 엔벨로우프를 조절해야 하는 문제점이 있다. 그리고 이러한 기술을 이용하는 음성 신호 프로세서 예컨데, IMBE 등과 같은 다수의 보코더의 개선이 쉽지 않다.Thus a typical playback method is recognized by the relative magnitude of time or frequency. However, this technique has a problem of adjusting the pitch and spectral envelope in the frequency range. And it is not easy to improve many vocoders, such as voice signal processors using IMBE, for example IMBE.

본 발명의 목적은 상술한 문제점을 해결하기 위한 것으로, 본 발명은 MBE 파라메터 추출 알고리즘을 이용하여 음성 신호 재생 알고리즘을 구현하는데 있다.An object of the present invention is to solve the above-described problem, the present invention is to implement a speech signal reproduction algorithm using the MBE parameter extraction algorithm.

도 1은 본 발명에 따른 음성 신호 재생 방법의 수순을 도시한 흐름도;1 is a flowchart showing a procedure of a voice signal reproducing method according to the present invention;

도 2는 도 1에 도시한 음성 신호 파라메터를 추출하는 단계의 수순을 도시한 흐름도;FIG. 2 is a flowchart showing the procedure of extracting the voice signal parameter shown in FIG. 1; FIG.

도 3a는 본 발명의 일 실시예에 따른 아날로그 음성 신호를 도시한 도면;3A illustrates an analog voice signal according to an embodiment of the present invention;

도 3b는 도 3a에 도시한 아날로그 음성 신호를 느린 재생 속도에 의한 음성 신호 파형을 도시한 도면; 그리고FIG. 3B is a diagram showing an audio signal waveform at a slow reproduction speed of the analog audio signal shown in FIG. 3A; FIG. And

도 3c는 도 3a에 도시한 아날로그 음성 신호를 빠른 재생 속도에 의한 음성 신호 파형을 도시한 도면이다.FIG. 3C is a diagram showing an audio signal waveform at a high reproduction speed of the analog audio signal shown in FIG. 3A.

*도면의 주요 부분에 대한 부호 설명** Description of symbols on the main parts of the drawings *

1 : 아날로그 음성 신호 2 : 제 1 합성 신호1: analog voice signal 2: first composite signal

3 : 제 2 합성 신호 4 : 제 3 합성 신호3: second composite signal 4: third composite signal

5 : 제 4 합성 신호 6 : 제 5 합성 신호5: fourth composite signal 6: fifth composite signal

7 : 제 6 합성 신호7: sixth composite signal

상술한 목적을 달성하기 위한 본 발명의 특징에 의하면, MBE 알고리즘을 이용하여 음성 신호를 재생하는 음성 재생 방법에 있어서: 상기 아날로그 신호를 받아들여서 특정의 프레임 단위로 음성 신호 파라메터를 추출하는 단계와; 상기 추출된 파라메터에 의해서 음성 신호를 합성하는 단계와; 상기 음성 신호를 재생하기 위한 제 1 또는 제 2 재생 속도를 결정하는 단계와; 상기 제 1 재생 속도를 이용하는 경우에는, 적어도 하나의 프레임을 반복하여 삽입하고 삽입된 상기 프레임과 상기 프레임의 전후 프레임들을 연속적으로 연결하여 재생하는 제 1 재생 단계 및; 상기 제 2 재생 속도를 이용하는 경우에는, 적어도 하나의 프레임을 삭제하고, 삭제된 상기 프레임의 전후 프레임의 위상을 연결하여 재생하는 제 2 재생 단계를 포함한다.According to a feature of the present invention for achieving the above object, a voice reproduction method for reproducing a voice signal using the MBE algorithm, comprising: extracting a voice signal parameter in a specific frame unit by receiving the analog signal; Synthesizing a speech signal based on the extracted parameters; Determining a first or second reproduction speed for reproducing the voice signal; A first reproducing step of repeatedly inserting at least one frame and reproducing the inserted frame and the frames before and after the frame by using the first reproducing speed; In the case of using the second reproduction speed, a second reproduction step of deleting at least one frame and concatenating and reproducing the phases of the frames before and after the deleted frame is performed.

상기 추출 단계는: 초기 피치 파라메터를 추출하는 단계와; 상기 추출된 피치 파라메터의 오차를 줄이기 위하여 상기 피치 파라메터를 세분화하는 단계와; 상기 세분된 피치 파라메터를 통하여 V/UV 정보를 분류하는 단계 및; 상기 추출 및 세분화된 피치 파라메터와 상기 V/UV 정보를 이용하여 피치를 개선하는 단계를 포함한다.The extracting step includes: extracting an initial pitch parameter; Subdividing the pitch parameter to reduce the error of the extracted pitch parameter; Classifying V / UV information through the subdivided pitch parameters; Improving pitch using the extracted and subdivided pitch parameters and the V / UV information.

따라서 본 발명에 의하면, 원래의 아날로그 음성 신호를 받아들여서 프레임 단위로 음성 신호 파라메터들을 추출한다. 추출된 파라메터 및 정보에 의하여 오차를 줄이고 이를 합성한다. 이어서 음성 신호 재생시 재생 속도에 따라서 특정의 프레임을 삽입하거나 삭제하고 변경된 프레임의 전후 프레임들을 동일한 위상으로 연결한다.Therefore, according to the present invention, the original analog voice signal is received and voice signal parameters are extracted in units of frames. The error is reduced and synthesized by the extracted parameters and information. Subsequently, during reproduction of the audio signal, a specific frame is inserted or deleted according to the reproduction speed, and the frames before and after the changed frame are connected in the same phase.

이하 본 발명의 실시예를 첨부된 도면에 의거하여 상세히 설명한다.DETAILED DESCRIPTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 음성 재생 방법의 수순을 도시하고 있다.1 shows a procedure of a voice reproducing method according to an embodiment of the present invention.

도면을 참조하면, 단계 S10에서 음성 신호로부터 MBE 음성 추출 알고리즘 모델을 이용하여 음성 신호 파라메터를 추출한다. 도 2에 도시된 바와 같이, 구체적으로 보면, 단계 S30에서 초기 피치 파라메터를 추출한다. 이어서 단계 S32에서는 추출된 피치 파라메터를 세분화한다. 이는 음성 신호의 오차를 줄이기 위해서이다. 그리고 단계 S34에서 세분화된 피치 파라메터으로부터 V(Voiced) / UV(UnVoiced) 정보를 분류한다. 그리고 단계 S36에서는 상기 피치 파라메터 및 V/UV 정보를 이용하여 음성 신호의 오차를 최소화 한다.Referring to the figure, in step S10, the speech signal parameters are extracted from the speech signal using the MBE speech extraction algorithm model. As shown in FIG. 2, in detail, the initial pitch parameter is extracted in step S30. In step S32, the extracted pitch parameters are subdivided. This is to reduce the error of the speech signal. Then, the V (Voiced) / UV (UnVoiced) information is classified from the pitch parameters subdivided in step S34. In step S36, the error of the voice signal is minimized using the pitch parameter and the V / UV information.

계속해서 도 1를 참조하면, 단계 S12에서는 상기 단계(S10)에서 원래의 음성 신호와 동일한 품질의 음성 신호로 합성한다. 이어서 단계 S14에서는 상기 합성된 음성 신호를 재생하기 위한 속도를 결정한다. 즉, 느린 속도의 재생 방법을 이용할 경우에는 단계 S16으로 진행하여 적어도 하나의 현재 프레임을 반복하여 삽입한다. 이를 통하여 원래의 음성 신호가 큰 변형이 이루어져서는 안된다. 그리고 단계 S18에서는 상기 삽입된 현재 플레임과 삽입된 프레임의 전후 프레임들의 위상을 동일하게 연결한다. 이어서 단계 S24에서 해당 재생 속도에 의한 음성 신호를 재생한다. 그리고 빠른 속도로 재생을 할 경우에는 단계 S20으로 진행하여 원래의 음성 신호에 근접되도록하여 적어도 하나의 프레임을 삭제한다. 그리고 단계 S22에서는 상기 삭제된 프레임의 전후 프레임들의 위상을 동일하게 연결한다. 이어서 빠른 재생 속도에 의하여 음성 신호를 재생한다.Subsequently, referring to FIG. 1, in step S12, the synthesized voice signal is of the same quality as the original voice signal in step S10. In step S14, the speed for reproducing the synthesized speech signal is determined. That is, when using a slow playback method, the flow advances to step S16 to insert at least one current frame repeatedly. Through this, the original voice signal should not be greatly modified. In step S18, the inserted current frame and the phase of the frames before and after the inserted frame are equally connected. In step S24, the audio signal at the reproduction speed is reproduced. If the playback is to be performed at a high speed, the process proceeds to step S20 so as to be close to the original audio signal and to delete at least one frame. In step S22, the phases of the frames before and after the deleted frame are equally connected. Subsequently, the audio signal is reproduced at a high reproduction speed.

그리고 하나의 음성 신호 파형의 예를 도 3a 내지 도 3c를 참조하여 설명하면, 도 3a는 원래의 아날로그 음성 신호(1)이다. 하나의 음성 신호(1)는 느린 재생 속도로서 도 3b에 도시된 합성 신호(2, 3, 4)의 반복에 의해 생성된다. 즉, 제 1 합성 신호(2)는 피치 주파수인 제 1 고조파에 의해 합성된 음성 신호이다. 그리고 제 2 합성 신호(3)는 상기 제 1 고조파의 3 배의 주파수에 의해 합성된다. 그리고 제 3 합성 신호(4)는 모든 고조파에 의해 합성된 음성 신호이다.And an example of one audio signal waveform will be described with reference to Figs. 3A to 3C. Fig. 3A is the original analog voice signal 1. One audio signal 1 is generated by repetition of the synthesized signals 2, 3, 4 shown in Fig. 3B at a slow reproduction speed. That is, the first synthesized signal 2 is an audio signal synthesized by the first harmonic which is the pitch frequency. The second synthesized signal 3 is synthesized by three times the frequency of the first harmonic. The third synthesized signal 4 is an audio signal synthesized by all harmonics.

또한 빠른 재생 속도에 의하여 하나의 음성 신호(1)는 특정의 프레임을 삽입하여 도 3c에 도시된 합성 신호(5, 6, 7)를 생성한다.In addition, due to the high reproduction speed, one voice signal 1 inserts a specific frame to generate the synthesized signals 5, 6 and 7 shown in FIG. 3C.

구체적으로 MBE 음성 파라메터에는 피치, 모든 서브 대역에서의 V/UV 정보 및 스펙트럼 폭 등이 있다.Specifically, the MBE voice parameters include pitch, V / UV information, and spectral width in all subbands.

피치 추출은 초기 피치 평가와 피치 개선이 이에 속한다. 초기 피치 평가에서는 대략의 피치값을 전후 피치 트랙과 현 프레임의 자동 상호 관계에 바탕을 두고 구할 수 있다. 그리고 항상 처리되는 프레임 길이는 160 샘플링 포인트 또는 20 ms이다.Pitch extraction includes initial pitch evaluation and pitch improvement. In the initial pitch evaluation, an approximate pitch value can be obtained based on the automatic correlation between the front and rear pitch tracks and the current frame. And the frame length that is always processed is 160 sampling points or 20 ms.

피치 개선에서는 음성 스팩트럼과 합성된 음성 스펙트럼 사이에 발생되는 에러를 최소화하는 세분화된 피치에 의해서 이루어진다.Pitch improvement is achieved by a finer pitch that minimizes the error that occurs between the speech spectrum and the synthesized speech spectrum.

모든 서브 대역애서의 이러한 에러에 따라서 서브 대역은 음성 신호 또는 음성이 아닌 신호를 분류하게 된다. 그러므로 서브 대역 또는 고조파의 수와 동일한 수의 V/UV 정보를 얻는다.This error in all subbands causes the subbands to categorize voice or non-voice signals. Therefore, the same number of V / UV information as the number of subbands or harmonics is obtained.

고속 포리에 변환(FFT : Fast Fourier Transform)에 의하여 모든 서브 대역에서의 스펙트럼 진폭은 계산될 수 있다.By Fast Fourier Transform (FFT) the spectral amplitudes in all subbands can be calculated.

즉 파라메터 추출 후, 음성 신호 피치 ω0, 음성 V/UV 정보 그리고 음성 스펙트럼 진폭 A_l(l은서브 대역 인덱스)의 파라메터들에서, 만약에 이러한 파라메터들은 매우 정확하고 음성 특성을 결정하기에 충분하게 추출되어 졌다면, 매우 고품질의 음성 신호를 아래 식에 의해서 합성될 수 있다.That is, after parameter extraction, in the parameters of voice signal pitch ω 0, voice V / UV information and voice spectral amplitude A _l (l is the subband index), if these parameters are very accurate and sufficiently extracted to determine the voice characteristics If so, a very high quality voice signal can be synthesized by the following equation.

[식 1][Equation 1]

상기 [식 1]의 S(t)는 합성 음성 신호를 의미하며, A_l은 l 번째 고조파 크기이고, w₀는 음성 신호의 주파수의 초기 위상을 나타내고, φ 는 현재 프레임의 시작 위상으로서, 전 프레임의 끝 위상과 동일하다. 그리고 t는 파라메터를 평가하기 위한 샘플링 시간이다. 따라서 전 프레임과 현재의 프레임 사이에 연속성을 유지하기 위하여 이러한 두 개의 프레임 파라메터들은 삽입된다.S (t) of the above [Formula 1] means a synthesized speech signal, A ₁ is the l-th harmonic magnitude, w ₀ represents the initial phase of the frequency of the speech signal, φ Is the start phase of the current frame, which is the same as the end phase of the previous frame. And t is the sampling time for evaluating the parameter. Thus, these two frame parameters are inserted to maintain continuity between the previous frame and the current frame.

MBE 모델을 이용한 음성 파라메터 추출 방법에 있어서 느린 재생 방법은 현재 프레임의 반복 합성에 의해 개선될 수 있다. 반복 합성이 이루어지면, 현재 프레임의 파라메터들은 재 사용되며, 그리고 현재 합성된 프레임과 동일한 즉, 현재 합성된 프레임의 위상은 전 프레임의 위상으로서 간주된다. 이러한 방법은 음성 피치와 스펙트럼 엔벨로우프가 원음과 동일하다. 그러므로 느린 동작은 개선된다. 만약 느린 동작 전송율은 r=p/q이며, p와 q는 양의 정수이다. 다시 말하면, 느린 동작의 시간 동안은 원음의 시간 동안만큼의 시간 r을 갖는다(p>q). 그런 다음 q 프레임들과 합성된 q 프레임들을 갖는다. p-q 프레임들은 반복적으로 합성한다. 이러한 반복된 프레임들은 순차적으로 또는 무작위로 배열된다.In the speech parameter extraction method using the MBE model, the slow playback method can be improved by iterative synthesis of the current frame. If iterative synthesis is made, the parameters of the current frame are reused, and the phase of the current synthesized frame that is the same as the currently synthesized frame is considered as the phase of the previous frame. This method has the same voice pitch and spectral envelope as the original sound. Therefore slow motion is improved. If the slow operation rate is r = p / q, p and q are positive integers. In other words, the time of slow motion has the same time r as the time of the original sound (p> q). Then we have q frames combined with q frames. p-q frames are iteratively synthesized. These repeated frames are arranged sequentially or randomly.

그리고 MBE 모델을 이용한 음성 파라메터 추출 방법에 있어서 빠른 재생 방법은 만약 r=p/q<1이면, 빠른 동작은 몇몇의 프레임들을 삭제하는 수단에 의해 개선될 수 있다. 이를테면, q 프레임에서 q-p 프레임들을 삭제한다. 프레임들이 삭제될 때 이들의 파라메터들은 사용되지 않는다. 그리고 합성기는 다음의 프레임을 위해 기다린다. 그러므로 현재 프레임은 삭제되고 이어서 전 프레임과 다음 프레임 사이의 위상은 연속이 되도록 한다.And in the voice parameter extraction method using the MBE model, if r = p / q <1, the fast operation can be improved by means of deleting some frames. For example, delete q-p frames from q frames. Their parameters are not used when frames are deleted. And the synthesizer waits for the next frame. Therefore, the current frame is deleted and then the phase between the previous frame and the next frame is made continuous.

프레임 길이가 20ms인 재생 알고리즘을 가정하면 삭제된 몇몇의 프레임들 또는 반복되는 몇몇의 프레임들은 음성 신호의 지각력있는 음질을 변화시키지 않는다 일반적으로 느린 동작과 빠른 동작에 의해 합성된 음성 신호는 유용하게 된다. 따라서 더 좋은 음질이 확인될 수 있다.Assuming a playback algorithm with a frame length of 20 ms, some deleted frames or some repeated frames do not change the perceptual sound quality of the speech signal. In general, speech signals synthesized by slow motion and fast motion become useful. . Therefore, better sound quality can be confirmed.

상술한 바와 같이 본 발명은 MBE 모델을 이용하여 음성 신호 재생시 특정의 프레임을 반복 또는 삭제하여 재생함으로서 이를 구비한 보코더의 구현이 용이하다.As described above, the present invention can easily implement a vocoder having the same frame by repeating or deleting a specific frame when the voice signal is reproduced using the MBE model.

Claims

In the method of reproducing a speech signal using the MBE algorithm:

Receiving one voice signal and extracting voice signal parameters in units of frames (S10);

Synthesizing a speech signal based on the extracted parameters (S12);

Determining a first or second reproduction speed for reproducing the synthesized speech signal (S14);

In the case of using the first reproducing speed, the first reproducing step (S16, S18, S24) of repeatedly inserting at least one frame and continuously connecting the inserted frame and the phases of the preceding and following frames of the frame and ;

In the case of using the second reproduction speed, a second reproduction step (S20, S22, S24) of deleting at least one frame and continuously connecting the phases of the frames before and after the deleted frame is reproduced. Audio playback method.

The method of claim 1,

The extraction step (S10) is:

Extracting an initial pitch parameter (S30);

Subdividing the pitch parameter to reduce an error of the extracted pitch parameter (S32);

Classifying voice (Voiced) and noise (UnVoiced) information through the subdivided pitch parameters (S34);

And improving the pitch by using the extracted pitch parameter and the voice and noise information (S36).