KR100445342B1

KR100445342B1 - Time scale modification method and system using Dual-SOLA algorithm

Info

Publication number: KR100445342B1
Application number: KR10-2001-0076957A
Authority: KR
Inventors: 박규식; 김선영; 박재현; 온승엽
Original assignee: 박규식; 박재현; 온승엽; 김선영
Priority date: 2001-12-06
Filing date: 2001-12-06
Publication date: 2004-08-25
Also published as: KR20030046726A

Abstract

본 발명은 압축모드일 경우 삭제되는 정보가 감소하여 "clicking" 현상이 적게 발생하며, 확장 모드일 경우에는 반복되는 정보를 감소시켜 "reverberation" 현상이 적게 발생하도록 음성속도를 변환시키기 위하여 듀얼 SOLA알고리듬을 이용한 음성속도변환방법 및 시스템에 관한 것이다.According to the present invention, the dual SOLA algorithm is used to change the voice speed so that the information that is deleted is reduced in the compression mode so that the " clicking " phenomenon occurs less. In the extended mode, the repeated information is reduced so that the " reverberation " The present invention relates to a voice speed conversion method and system.

본 발명에 의한 듀얼 SOLA알고리듬을 이용한 음성속도변환방법은 사용자가 원하는 음성속도에 맞추어 음성신호를 출력하는 방법에 있어서, (a)음성신호에 윈도우를 곱해 소정의 길이를 갖는 프레임으로 분할하는 단계; (b)상기 프레임을 속도변화율에 따라 소정의 값으로 고정된 출력신호의 합성시프트 값을 선택하고, 입력신호의 분석시프트 값을 변경시켜 다음 프레임을 구하는 단계; (c)상기 프레임간의 오버랩 구간에서 교차상관에 의해 구해진 동기화 길이만큼 이동시켜 프레임을 재배치하는 단계; 및 (d)상기 오버랩 구간에서 중복된 프레임 샘플에 가중치를 더하여 속도가 변환된 음성신호를 출력하는 단계를 포함함을 특징으로 한다.A voice speed conversion method using a dual SOLA algorithm according to the present invention comprises the steps of: (a) multiplying a voice signal by a window and dividing the voice signal into a frame having a predetermined length; (b) selecting a composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame, and changing the analysis shift value of the input signal to obtain a next frame; (c) rearranging the frames by moving by the synchronization length obtained by cross correlation in the overlap period between the frames; And (d) outputting a voice signal whose speed is converted by adding weights to the overlapped frame samples in the overlap period.

Description

Time scale modification method and system using Dual-SOLA algorithm

본 발명은 음성처리방법 및 시스템에 관한 것으로서, 보다 상세하게는 듀얼SOLA알고리듬을 이용한 음성속도의 변환방법 및 그 시스템에 관한 것이다.The present invention relates to a speech processing method and system, and more particularly, to a method and system for converting speech speed using a dual SOLA algorithm.

음성속도변환(Time Scale Modification:이하 TSM)은 시간 축에서 입력신호를 압축하거나 확장하여 신호의 재생속도를 변환하는 것으로서, 노래방기기의 음악 템포변환, 외국어학습을 위한 음성재생 속도변환, 그리고 데이터압축 및 복원 등 다양한 분야에 응용되어질 수 있다. 특히 최근에는 MP3 플레이어 같은 휴대용 오디오 기기에 외국어 학습 같은 부가적인 기능 구현을 위해 특정음성 디코더와 같이 사용되어 지기도 한다.Time Scale Modification (TSM) is a method of converting the playback speed of a signal by compressing or extending an input signal on the time axis. The music tempo conversion of karaoke equipment, the speed of voice playback for foreign language learning, and data compression And it can be applied to various fields such as restoration. In particular, in recent years, portable audio devices such as MP3 players have been used as a specific voice decoder to implement additional functions such as foreign language learning.

TSM알고리듬은 시간 축을 변환하는 방법으로 크게 시간영역방법과 주파수영역방법으로 나누어 질 수 있다.The TSM algorithm is a method of transforming the time axis and can be divided into a time domain method and a frequency domain method.

대표적인 시간영역방법으로는 입력신호를 window단위로 세그먼트하여 이웃한 window간에 overlap & add 연산과정을 거쳐 입력신호를 압축하거나 확장하는 OLA(Overlap-Add)알고리즘과 이웃한 windows간의 pitch동기를 이용하여 overlap & add 연산을 함으로써 OLA의 "clicking" 현상(압축시)과 "reverberation" 현상(확장시)의 단점을 극복하여 보다 자연스러운 출력음성을 얻을 수 있는 SOLA 알고리듬과 다양한 SOLA변형 알고리듬이 존재한다.The typical time domain method uses the OLA (Overlap-Add) algorithm, which compresses or expands the input signal by overlapping and adding the adjacent signals by segmenting the input signal in window units, and using the pitch synchronization between neighboring windows. There are various SOLA algorithms and various SOLA transformation algorithms that can obtain more natural output sound by overcoming the disadvantages of OLA "clicking" (compression) and "reverberation" (expansion).

대표적인 주파수영역방법으로는 STFT를 이용한 Griffin and Lim 알고리듬 등이 있다.Typical frequency domain methods include the Griffin and Lim algorithm using STFT.

상기 TSM알고리듬 중에서 일반적으로 사용되는 알고리듬은 SOLA 알고리듬이다.Algorithms generally used among the TSM algorithms are SOLA algorithms.

SOLA 알고리즘은 시간영역에서 템포를 변환시키는 대표적인 방법으로 이웃한window간의 pitch정보를 이용하여 overlap-add연산을 수행함으로서 기존 OLA 방법의 단점을 개선한 알고리즘이다.The SOLA algorithm is a representative method of converting tempo in the time domain and improves the disadvantages of the existing OLA method by performing overlap-add operation using pitch information between neighboring windows.

도 1은 SOLA 알고리즘에서 정의되어 사용되는 각 parameter를 보이고 있다. 도 1에서 winlen는 원 입력신호에 window를 곱해 일정한 크기를 가지는 프레임 길이를 나타내고, Sa는 분석 시프트(Analysys Shift)로서 입력신호의 분석 세그먼트 단위, Ss는 합성 시프트(Synthesis Shift)로서 출력신호의 합성 세그먼트 단위, Kmax는 연속된 2개의 프레임간 pitch 동기를 맞추기 위한 것으로 pitch 검색의 최대 이동범위를 정의한다.1 shows each parameter defined and used in the SOLA algorithm. In FIG. 1, winlen represents a frame length having a constant size by multiplying the original input signal by a window, Sa is an Analysys Shift, an analysis segment unit of the input signal, and Ss is a Synthesis Shift. The segment unit, Kmax, is used to match the pitch synchronization between two consecutive frames, and defines the maximum range of pitch search.

또한, 속도 변화율(α)은 Ss /Sa 의 값으로 정의되어, 만약 α의 값이 1보다 작으면(α＜1) 음성압축효과에 의해 음성속도는 원음보다 빠르게 되고, 1보다 클(α＞1) 경우에는 음성확장효과에 의해 원음속도보다 느리게 된다. 일반적으로 속도 변화율(α)은 0.5(2배 빠르기)-2.0(2배 느리기) 사이의 값으로 재현한다.Also, the rate of change rate α is defined as the value of Ss / Sa. If the value of α is smaller than 1 (α <1), the voice speed becomes faster than the original sound due to the voice compression effect, and is larger than 1 (α>). 1) In case of this, the sound expansion effect is slower than the original sound speed. In general, the rate of change (α) is reproduced as a value between 0.5 (2 times faster) and 2.0 (2 times slower).

SOLA는 먼저 원 신호에 window를 곱해 일정한 크기(winlen)를 갖는 프레임으로 잘라내고, 입력신호 x(n)의 첫 번째 프레임을 변환되는 출력신호 y(n)의 첫 번째 프레임에 그대로 복사하며, 다음 프레임을 얻을 때는 입력신호의 분석 단위인 Sa만큼 일정한 간격으로 이동시키면서 window를 오버래핑(overlapping)하여 다음 프레임을 구하게 된다.SOLA first multiplies the original signal by the window and cuts it into a frame with a constant size (winlen), then copies the first frame of the input signal x (n) into the first frame of the transformed output signal y (n), and then When the frame is obtained, the next frame is obtained by overlapping the window while moving at regular intervals as Sa, an analysis unit of the input signal.

연속된 두 개의 프레임을 정의하게 되면, 다음은 이웃한 프레임간 pitch 동기화 길이인 k값을 구하여 프레임을 재배치하고 중복된 프레임 sample에 가중치를 더하여 최종적으로 속도 변환된 출력 신호를 얻게된다.When two consecutive frames are defined, the next step is to obtain a value of k, the pitch synchronization length between neighboring frames, rearrange the frames, add weights to the duplicated frame samples, and finally obtain a speed-converted output signal.

연속된 2개의 프레임간에 동기화 길이 값인 k를 구하는 방법은 수학식 1의 교차상관(cross-correlation) 값을 구하여 이를 최대화하는 값으로 결정하게 되며The method of obtaining the synchronization length value k between two consecutive frames is determined by maximizing the cross-correlation value of Equation 1.

수학식 1에서은와의 오버래핑 길이이다. 동기화 길이 값인를 찾게 되면, 최종 출력신호은 수학식 2에 의하여 오버래핑 구간에서와에 가중치를 더하여 프레임들을 재배치하고, 최종적으로 원 입력신호와는 다른 길이의 신호 즉, 속도가 변환된 신호를 얻을 수 있다.In Equation 1 silver Wow The overlapping length of. The sync length value When is found, the final output signal Is in the overlapping interval by Equation 2 Wow Weights on Then, the frames are rearranged to obtain a signal having a length different from that of the original input signal, that is, a signal whose speed is converted.

도 2는 속도 변화율 α= Ss / Sa에 따른 SOLA알고리듬의 압축(α<1, 빠르게) 모드와 확장(α>1, 느리게)모드에 대한 동작 원리를 보여주고 있다.FIG. 2 shows the operating principle of the compressed (α <1, fast) mode and the extended (α> 1, slow) mode of the SOLA algorithm according to the rate of change α = Ss / Sa.

도 2a, 도 2b에서 (ⅰ)는 원 신호를 (ⅱ)는 속도 변화율(α)에 의해 재배치되어진 출력신호이나 아직은 동기화를 맞추지 않은 신호이고, (ⅲ)는 동기화 길이 값인 k에 의하여 동기화를 맞추어 만들어낸 최종 출력 신호이다.In Fig. 2A and Fig. 2B, (i) is the original signal (ii) is the output signal rearranged by the rate of change rate (α) or the signal is not yet synchronized, and (i) is synchronized by the synchronization length value k. This is the final output signal produced.

이러한 종래의 SOLA알고리듬은 2개의 연속된 프레임간의 pitch 동기화를 위하여 동기화 길이 값인 k 만큼 현재의 window를 이전 window방향으로 이동하여 오버래핑 하기 때문에 서로 겹쳐지는 음성정보가 많아지며, 압축모드에서의 "clicking" 현상과 확장모드에서의 "reverberation" 현상을 초래하는 하나의 원인이 된다.In the conventional SOLA algorithm, since the current window is moved in the previous window direction and overlapped by the synchronization length value k for pitch synchronization between two consecutive frames, the voice information overlaps with each other. This is one cause of the phenomenon and "reverberation" in the extended mode.

본 발명이 이루고자하는 기술적 과제는 압축모드일 경우 삭제되는 정보가 감소하여 "clicking" 현상이 적게 발생하며, 확장 모드일 경우에는 반복되는 정보를 감소시켜 "reverberation" 현상이 적게 발생하도록 음성속도를 변환시키기 위하여 듀얼 SOLA알고리듬을 이용한 음성속도변환방법 및 시스템을 제공하는 것이다.The technical problem to be achieved in the present invention is to reduce the information to be deleted in the compression mode to reduce the "clicking" phenomenon, in the extended mode to reduce the repetitive information to convert the voice speed so that less "reverberation" phenomenon occurs In order to provide a voice speed conversion method and system using a dual SOLA algorithm.

도 1은 SOLA 알고리즘에서 정의되어 사용되는 각 파라미터를 도시한 것이다.1 shows each parameter defined and used in the SOLA algorithm.

도 2는 속도 변화율(α= Ss / Sa)에 따른 SOLA알고리듬의 압축(α<1, 빠르게) 모드와 확장(α>1, 느리게)모드에 대한 동작 원리를 도시한 것이다.Figure 2 shows the operating principle of the compression (α <1, fast) mode and the expansion (α> 1, slow) mode of the SOLA algorithm according to the rate of change (α = Ss / Sa).

도 3은 입력 음성신호에서의 피치와 윈도우 길이의 상관관계를 도시한 것이다.3 illustrates a correlation between a pitch and a window length in an input voice signal.

도 4는 합성 시프트(Ss), 분석 시프트(Sa) 그리고 윈도우 길이와의 관계를 도시한 것이다.4 shows the relationship between the composite shift Ss, the analysis shift Sa and the window length.

도 5는 본 발명에 의한 듀얼 SOLA알고리듬을 이용한 음성속도변환방법을 순서도로 도시한 것이다.5 is a flowchart illustrating a voice speed conversion method using the dual SOLA algorithm according to the present invention.

도 6은 본 발명에 의한 음성복호기와 듀얼 SOLA알고리듬을 통합한 음성속도변환시스템의 블록도를 도시한 것이다.6 is a block diagram of a voice speed conversion system incorporating a voice decoder and dual SOLA algorithms according to the present invention.

도 7은 모의 실험에 사용된 각각의 테스트 음을 도시한 것이다.7 shows each test tone used in the simulation.

도 8은 ITU G.729에 듀얼 SOLA 알고리듬을 적용하기 전 버퍼 처리과정을 도시한 것이다.Figure 8 illustrates the buffer processing before applying the dual SOLA algorithm to ITU G.729.

도 9는 듀얼 SOLA알고리듬을 적용한 음성속도변환시스템의 압축모드=0.500, 0.750, 0.875에서 영어 남성 음성에 대한 출력 결과 음성 파형을 도시한 것이다.9 is a compression mode of the speech rate conversion system applying the dual SOLA algorithm The output resulting speech waveforms for English male speech at = 0.500, 0.750 and 0.875 are shown.

도 10은 듀얼 SOLA알고리듬을 적용한 음성속도변환시스템의 확장 모드 α = 1.250 , 1.500, 1.750, 2.000에서 영어 남성 음성에 대한 출력 결과 음성 파형을 도시한 것이다.FIG. 10 shows the output result speech waveform for the English male voice in the extended modes α = 1.250, 1.500, 1.750, 2.000 of the speech rate conversion system to which the dual SOLA algorithm is applied.

도 11은 보간 전처리 과정을 포함한 듀얼 SOLA알고리듬을 이용한 음성속도변환시스템의 압축 모드=0.500, 0.750, 0.875에서 영문 음성에 대한 출력 결과 음성 파형을 보여준다.11 is a compression mode of the speech rate conversion system using dual SOLA algorithm including interpolation preprocessing At 0, 500, 0.750, and 0.875, the output waveforms for English speech are shown.

도 12는 같은 영어 음성에 대해서 α= 0.5일 경우 보간 과정이 없는 경우 와 포함한 경우에 대한 비교 도면이다.FIG. 12 is a comparison diagram of a case of including the case where there is no interpolation process when α = 0.5 for the same English voice.

도 13은 보간 전처리 과정을 포함한 듀얼 SOLA알고리듬을 이용한 음성속도변환시스템의 확장 모드=1.250, 1.500, 1.750, 2.000에서 영문 음성에 대한 출력 결과 음성 파형을 보여준다.13 is an extended mode of a speech rate conversion system using dual SOLA algorithms including interpolation preprocessing. At = 1.250, 1.500, 1.750, and 2.000, the output waveform for English speech is shown.

도 14는 같은 영어 음성에 대해서 α= 2.0일 경우 보간 과정이 없는 경우 와 포함한 경우에 대한 비교 도면이다.FIG. 14 is a comparison diagram of a case in which there is no interpolation process and the case of α = 2.0 for the same English voice.

상기 기술적 과제를 해결하기 위한 본 발명에 의한 듀얼 SOLA알고리듬을 이용한 음성속도변환방법은 사용자가 원하는 음성속도에 맞추어 음성신호를 출력하는 방법에 있어서, (a)음성신호에 윈도우를 곱해 소정의 길이를 갖는 프레임으로 분할하는 단계; (b)상기 프레임을 속도변화율에 따라 소정의 값으로 고정된 출력신호의 합성시프트 값을 선택하고, 입력신호의 분석시프트 값을 변경시켜 다음 프레임을 구하는 단계; (c)상기 프레임간의 오버랩 구간에서 교차상관에 의해 구해진 동기화 길이만큼 이동시켜 프레임을 재배치하는 단계; 및 (d)상기 오버랩 구간에서 중복된 프레임 샘플에 가중치를 더하여 속도가 변환된 음성신호를 출력하는 단계를 포함함을 특징으로 한다.In order to solve the above technical problem, a voice speed conversion method using a dual SOLA algorithm according to the present invention is a method of outputting a voice signal according to a voice speed desired by a user. Dividing into frames having; (b) selecting a composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame, and changing the analysis shift value of the input signal to obtain a next frame; (c) rearranging the frames by moving by the synchronization length obtained by cross correlation in the overlap period between the frames; And (d) outputting a voice signal whose speed is converted by adding weights to the overlapped frame samples in the overlap period.

또한, 상기 (a)단계 이전에 음성신호의 샘플링 비율을 조정하는 전처리과정을 더 구비함을 특징으로 한다.The method may further include a preprocessing step of adjusting a sampling rate of the voice signal before step (a).

또한, 상기 (a)단계에서 상기 윈도우의 길이는 입력 음성신호의 2피치주기를 포함하는 길이를 갖고, 이웃한 프레임간의 동기화를 고려하여 3 내지 4 피치주기의 길이를 갖는 것을 특징으로 한다.In addition, in the step (a), the window has a length including two pitch periods of the input voice signal, and has a length of 3 to 4 pitch periods in consideration of synchronization between neighboring frames.

또한, 상기 (b)단계는 음성의 확장모드(느리기 속도)와 압축모드(빠르기 속도)에서 서로 다른 합성시프트 윈도우 길이를 사용하는 것을 특징으로 한다.In addition, the step (b) is characterized by using a different synthesis shift window length in the extended mode (slow speed) and compression mode (fast speed) of the voice.

또한, 상기 동기화 길이는In addition, the synchronization length is

(여기서,은와의 오버래핑 길이이고,는 입력신호의 분석 세그먼트 단위인 분석시프트를 의미하고,는 출력신호의 합성 세크먼트 단위인 합성시프트를 의미한다.) 을 이용하여 교차상관 값을 구하고, 상기 교차상관 값을 최대화하는 값으로 결정함을 특징으로 한다.(here, silver Wow Overlapping length of, Means an analysis shift which is an analysis segment unit of the input signal, Denotes a synthesis shift that is a synthesis segment unit of the output signal.) The cross-correlation value is obtained and determined as a value maximizing the cross-correlation value.

또한, 상기 동기화 길이는 피치정보의 정렬이 가능한 값으로, 1피치 주기보다 크고, 윈도우 길이보다 작음을 특징으로 한다.In addition, the synchronization length is a value in which the pitch information can be aligned, and is characterized in that it is larger than one pitch period and smaller than the window length.

상기 다른 기술적 과제를 해결하기 위한 본 발명에 의한 듀얼 SOLA알고리듬을 이용한 음성속도변환시스템은 음성디코더를 통해 출력되는 음성신호의 속도를 변환시키는 음성속도변환시스템에 있어서, 상기 음성디코더로부터 출력된 음성신호를 소정의 윈도우 길이를 갖는 프레임으로 분할하는 프레임분할부; 상기 프레임을 속도변화율에 따라 소정의 값으로 고정된 출력신호의 합성시프트 값을 선택하고,입력신호의 분석시프트 값을 변경시켜 다음 프레임을 구하는 프레임 계산부; 상기 프레임간의 오버랩 구간에서 교차상관에 의해 동기화 길이를 구하는 동기화 길이 계산부; 상기 동기화 길이만큼 이동시켜 프레임을 재배치하는 프레임재배치부; 및 상기 오버랩 구간에서 중복된 프레임 샘플에 가중치를 더하여 속도가 변환된 음성신호를 출력하는 가중치부를 포함함을 특징으로 한다.The voice speed conversion system using the dual SOLA algorithm according to the present invention for solving the other technical problem in the voice speed conversion system for converting the speed of the voice signal output through the voice decoder, the voice signal output from the voice decoder A frame dividing unit dividing the frame into frames having a predetermined window length; A frame calculator which selects a composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame and changes the analysis shift value of the input signal to obtain a next frame; A synchronization length calculation unit obtaining a synchronization length by cross correlation in the overlap period between the frames; A frame repositioning unit for repositioning the frame by moving the synchronization length; And a weighting unit for outputting a voice signal whose speed is converted by adding weights to the overlapping frame samples in the overlap period.

또한, 상기 프레임분할부로 상기 음성디코더로부터 출력되는 음성신호가 입력되기 이전에 상기 음성신호의 샘플링 비율을 조정하는 전처리부를 더 구비함을 특징으로 한다.The apparatus may further include a preprocessor configured to adjust a sampling rate of the voice signal before the voice signal output from the voice decoder is input to the frame divider.

또한, 상기 동기화 길이 계산부는In addition, the synchronization length calculation unit

(여기서,은와의 오버래핑 길이이고,는 입력신호의 분석 세그먼트 단위인 분석시프트를 의미하고,는 출력신호의 합성 세크먼트 단위인 합성시프트를 의미한다.) 을 이용하여 교차상관 값을 구하고, 상기 교차상관 값을 최대화하는 값으로 상기 동기화 길이를 결정함을 특징으로 한다.(here, silver Wow Overlapping length of, Means an analysis shift which is an analysis segment unit of the input signal, The cross-correlation value is obtained by using a synthesizing shift which is a synthesis segment unit of the output signal. The synchronization length is determined by maximizing the cross-correlation value.

이하, 본 발명을 구체적으로 설명하기 위해 실시 예를 들어 설명하고, 발명에 대한 이해를 돕기 위해 첨부도면을 참조하여 상세하게 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the following examples, and the present invention will be described in detail with reference to the accompanying drawings.

먼저, 본 발명에 의한 듀얼 SOLA알고리듬에 적용되는 파라미터의 최적화에 대하여 설명하기로 한다.First, the optimization of parameters applied to the dual SOLA algorithm according to the present invention will be described.

즉, 상기 파라미터의 최적화된 값을 구하고, 상기 최적화된 파라미터를 본 발명에 의한 알고리듬에 적용하여 설명하기로 한다. 상기 파라미터에는 윈도우 길이, 동기화 길이, 분석시프트 및 합성시프트 등이 있다.That is, the optimized value of the parameter is obtained, and the optimized parameter is applied to the algorithm according to the present invention. The parameters include window length, synchronization length, analysis shift, synthesis shift, and the like.

첫째, 윈도우 길이에 대하여 설명하기로 한다.First, the window length will be described.

입력신호를 분할(segmentation)하는 윈도우(window)는 일반적으로 입력 음성신호의 2 피치(pitch) 주기를 포함할 수 있을 정도로 충분히 커야하며, 이웃한 윈도우간의 동기화를 고려하여 최대 3-4 피치 주기의 길이를 포함할 수 있다.The window for segmenting the input signal should generally be large enough to include two pitch periods of the input audio signal, and up to 3-4 pitch periods in consideration of synchronization between neighboring windows. It may include a length.

일반적으로 피치 주기는 사람마다 다르며 여성은 주로 50-250Hz, 남성은 120-500Hz의 주파수 범위에 존재한다. 따라서 입력신호의 샘플링 비율(sampling rate)이 8KHz일 때, 1피치 주기를 평균 100샘플로 가정한다면 윈도우 길이는 최소 200샘플의 길이가 되어야 하며, 최대 400샘플을 넘지 말아야 한다.In general, pitch periods vary from person to person, with women in the 50-250 Hz frequency range and men in the 120-500 Hz frequency range. Therefore, when the sampling rate of the input signal is 8KHz, assuming a pitch period of 100 samples on average, the window length should be at least 200 samples and not exceed 400 samples.

도 3은 입력 음성신호에서의 피치와 윈도우 길이와 상관관계를 보여준다.Figure 3 shows the correlation between the pitch and the window length in the input speech signal.

도 3a는 윈도우 길이가 2피치 주기보다 작은 경우로서(1피치로 가정) 윈도우 길이가 너무 작아 이웃한 윈도우간에 오버래핑(overlapping)되는 영역이 감소되어 공통적 피치정보가 존재하는 확률이 거의 0%가 되기 때문에 오버래핑 구간에서의 피치 정렬이 불가능해진다.3A shows that when the window length is smaller than 2 pitch periods (assuming 1 pitch), the window length is too small and the overlapping area between neighboring windows is reduced, so that the probability of having common pitch information is almost 0%. This makes pitch alignment impossible in the overlapping section.

반면에 도 3b에서처럼 윈도우 길이가 너무 크게 되면(4피치로 가정) 오버래핑 되는 영역이 증가하여 이웃하는 윈도우간에 공통적 피치정보가 많아지므로 이 영역에서 "clicking" 및 "reverberation"이 발생하여 음질의 저하요소가 될 수 있다.On the other hand, if the window length is too large (assuming 4 pitches) as shown in FIG. 3B, the overlapping area is increased and common pitch information is increased between neighboring windows. Therefore, "clicking" and "reverberation" occur in this area, thereby degrading the sound quality. Can be

실제 윈도우 길이에 대한 다양한 모의 실험 결과, 윈도우 길이가 300샘플일 때 남성이나 여성 모두 최적의 음질 출력 결과를 얻을 수 있었으나, 윈도우 길이가 320샘플 이상에서는 여성음성의 확장모드에서 "reverberation" 현상이 심각함을 볼 수 있었다.As a result of various simulations on the actual window length, both males and females were able to obtain the optimal sound quality when the window length was 300 samples. However, when the window length is 320 samples or more, the "reverberation" phenomenon is severe in the female voice extension mode. Could see.

이는 여성의 음성이 남성보다 피치주기가 짧기 때문에 발생하는 현상이다. 본 발명에서는 남성이나 여성의 음성 모두 공통적인 알고리듬을 적용하는 것을 가정으로 윈도우 길이를 300샘플로 결정한다.This occurs because the female voice has a shorter pitch period than the male voice. In the present invention, a window length of 300 samples is determined on the assumption that a common algorithm is applied to both male and female voices.

둘째, 동기화 길이(Kmax)에 대하여 설명하기로 한다.Second, the synchronization length Kmax will be described.

Kmax는 연속된 2개의 프레임간 피치동기를 맞추기 위한 것으로 피치검색의 최대 이동범위를 정의한다. Kmax는 교차상관(cross-correlation)의 연산을 이용해 피치정보의 정렬이 가능한 최소한의 값으로써 1피치주기보다는 커야하며 기본적으로 [0 - 윈도우 길이] 사이에 존재하여야 한다. 만약 Kmax가 1pitch 주기보다 작게되며 피치검색확률이 낮아져 현저한 음질 저하를 초래한다. 따라서 입력 음성신호가 8KHz 샘플링 비율일 때, Kmax는 최소 100샘플정도로 제한하며 Kmax의 최대값은 이웃하는 윈도우의 오버래핑영역에서 모의 실험을 통해 Kmax값이 200샘플 이상일 경우 지터(jitters)현상이 많이 발생하였으므로, 2피치주기 이하로 제한하였다.Kmax is to match the pitch synchronization between two consecutive frames and defines the maximum range of pitch search. Kmax is the minimum value that the pitch information can be aligned using a cross-correlation operation. It should be larger than 1 pitch period and basically exist between [0-window length]. If Kmax is smaller than 1 pitch period and pitch search probability is lowered, it causes significant sound quality degradation. Therefore, when the input voice signal is 8KHz sampling rate, Kmax is limited to at least 100 samples, and the maximum value of Kmax is simulated in the overlapping area of the neighboring window. When the Kmax value is more than 200 samples, jitters occur a lot. Therefore, it was limited to 2 pitch cycle or less.

본 발명에서는 실제 100 Kmax, 200샘플 구간에서의 남성, 여성의 다양한 음성의 모의실험 결과 Kmax=100 일 때 "clicking" 과 "reverberation" 현상이 적어 청취하기에 좋은 결과 음성 출력을 보였다. 따라서 본 발명에서는 Kmax의 최적값을 100샘플로 하였다.In the present invention, the simulation results of various voices of males and females in the 100 Kmax and 200 sample intervals show that the voice output is good for listening when the Kmax = 100 is less "clicking" and "reverberation". Therefore, in the present invention, the optimum value of Kmax was 100 samples.

셋째, 분석시프트(Sa)와 합성시프트(Ss)에 대하여 설명하기로 한다.Third, the analysis shift Sa and the synthesis shift Ss will be described.

본 발명에 의한 듀얼 SOLA알고리듬은 SOLA와는 반대로 프레임 동기화를 위한 동기화 길이 k값 검색을 위하여 윈도우의 이동방향을 반대로 하여 윈도우간 겹쳐지는 음성정보를 감소하며 Ss의 값을 고정하여 출력신호를 기준으로 알고리듬을 수행한다.The dual SOLA algorithm according to the present invention, in contrast to SOLA, reduces the overlapping voice information between windows by reversing the moving direction of the window for retrieving the synchronization length k value for frame synchronization. Do this.

즉, 속도변화율 α= Ss / Sa 값이 변함에 따라 입력신호의 Sa 값을 변경하여 출력신호를 만든다. 이러한 Ss와 Sa 파라미터는 기본적으로 1샘플보다는 커야하며 이들의 최대값은 α= Ss / Sa에 의하여 Ss와 Sa 둘 중 하나의 최대값만 구하면 다 른 한 파라미터 값도 구할 수 있게 된다.That is, as the speed change rate α = Ss / Sa is changed, the output signal is changed by changing the Sa value of the input signal. The Ss and Sa parameters should be basically larger than 1 sample, and their maximum value can be obtained by calculating the other parameter value only by obtaining the maximum value of either Ss or Sa by α = Ss / Sa.

도 4는 Ss, Sa 그리고 윈도우 길이와의 관계를 보이고 있다.4 shows the relationship between Ss, Sa and the window length.

도 4의 입력신호의 분석 시프트인 Sa가 윈도우 길이보다 클 경우의 예로, 압축모드인 경우에 교차상관구간이 null 만큼 줄어들어 음의 매끄러움이 떨어지며, 확장모드일 경우에는 null 값이 포함되어 음이 끊어지는 현상이 나타난다. 그러므로 Sa는 윈도우 길이보다 작아야 하며, α= Ss / Sa의 관계식으로부터 합성 시프트 Ss는 α가 0.5(압축모드)일 경우를 고려하여 윈도우 길이/2 보다 작아야 한다.For example, when Sa, the analysis shift of the input signal of FIG. 4 is larger than the window length, in the compression mode, the cross-correlation section is reduced by null to reduce negative smoothness, and in the extended mode, the sound is cut off by including a null value. Losing symptoms appear. Therefore, Sa must be smaller than the window length, and from the relation α = Ss / Sa, the composite shift Ss should be smaller than the window length / 2 considering the case where α is 0.5 (compression mode).

상기 조건하의 모의 실험 결과 본 발명에서는 Ss=75, 50 sample에서 좋은 결과를 얻을 수 있다.Simulation results under the above conditions In the present invention, good results can be obtained at Ss = 75, 50 samples.

표 1은 각 SOLA 파라미터에 대한 최적의 모의 실험결과를 수록하였다.Table 1 lists the optimal simulation results for each SOLA parameter.

S_s S _s α= 2.0α = 2.0 α= 1.5α = 1.5 α= 0.875α = 0.875 α= 0.5α = 0.5 win300win300 win250win250 win200win200 win300win300 win250win250 win200win200 win300win300 win250win250 win200win200 win300win300 win250win250 win200win200 150150 CC BB BB CC 125125 BB CC BB BB BB BB CC CC 100100 BB BB CC BB BB BB BB BB BB BB BB CC 7575 BB BB CC AA AA BB BB BB CC BB BB CC 5050 CC AA AA BB BB BB CC

각 파라메터에 대한 모의 실험은 ITU G.729 8Khz 샘플링 비율을 갖는 입력음성을 가정하고 윈도우 길이가 200, 250, 300샘플, kmax는 100, 200샘플, 합성시프트 Ss는 윈도우 길이가 300샘플일 경우 150, 125, 100, 75샘플, 7250샘플일 경우 125, 100, 75, 50샘플, 200샘플일 경우 100, 75, 50샘플을 갖는 조건에서 음성속도 변화율이,=2.0(2배 느림),=1.500(1.500배 느림),=0.875(1/0.875배 빠름),=0.5(2배 빠름)에서 수행하였다.The simulation for each parameter assumes an input voice with ITU G.729 8Khz sampling rate, window lengths of 200, 250, and 300 samples, kmax is 100 and 200 samples, and synthetic shift Ss is 150 when the window length is 300 samples. , 125, 100, 75 samples, 7250 samples, 125, 100, 75, 50 samples, 200 samples 100, 75, 50 samples the rate of voice rate change, = 2.0 (twice slower), = 1.500 (1.500 times slower), = 0.875 (1 / 0.875x faster), Performed at = 0.5 (2 fold faster).

또한 모의 실험을 위하여, 한국인 남성 및 여성의 음성, 그리고 외국인 남성 등 3가지 음성을 입력으로 사용하였으며, 최종 결과 음질의 주관적인 성능평가를 위하여 남성 3명, 여성 2명의 청취자를 선정하여 실험을 진행하였다. 주관적인 성능 평가는 실제 결과 음을 들었을 때 약간의 부분적인 잡음현상이 존재하더라도 전체적인 문맥을 인지하는데는 아무 문제가 없음으로, 각 청취자가 속도 변환된 결과 음을 듣고 부분적으로 "clicking" 이나 "reverberation" 같은 잡음 현상을 인지하는 정도에 따라 A(전혀 없음), B(약간 있음), C(많이 있음) 등 3가지 등급으로 평가하게 하였다.For the simulation, three voices were used as inputs: Korean male and female voices and foreign males. For the final subjective performance evaluation, three male and two female listeners were selected. . Subjective performance assessments have no problem in recognizing the overall context, even if there is some partial noise when the actual result is heard, so that each listener hears the rate-converted result sound and partially "clicking" or "reverberation". According to the degree of recognition of the same noise phenomenon, three grades were evaluated: A (nothing at all), B (littlely), and C (many).

표 1의 모의 청취 실험 결과에서 보듯이 SOLA는 윈도우 길이가 300에 가까울수록 그리고 합성 시프트=75샘플, 50샘플에서 전반적으로 좋은 성능 결과를 보이고 있다. 특히 합성 시프트의 경우, 확장 모드에서는=75샘플일 때 그리고 압축 모드에서는=50샘플일 때 각각의 모드에서 좋은 음질을 출력하였다. 하지만 예외적으로 알고리듬의 극한 속도 변화율인=2.0(2배 느림),=0.5(2배 빠름)에서는 부분적인 "clicking" 과 "reverberation" 잡음현상을 인지 할 수 있었는데, 특히 확장모드 시 잡음은 주로 “ㅋ, ㅌ, ㅎ”등의 파열음에서의 심각함을 인지 할 수 있었다.As shown in the results of the simulated listening experiment in Table 1, the SOLA is calculated as the window length approaches 300 and the composite shift. The overall performance is good at = 75 samples and 50 samples. Especially in composite mode, = 75 samples and in compressed mode At 50 samples, the sound quality was good in each mode. The exception is the rate of change of the extreme speed of the algorithm. = 2.0 (twice slower), At 0.5 (2 times faster), partial "clicking" and "reverberation" noise phenomena could be recognized, especially in extended mode, where the noise was mainly noticeable in burst sounds such as "ㅋ, ㅌ, ㅎ". .

따라서 본 발명에서는 표 1의 모의 실험 결과를 토대로 SOLA 파라미터의 최적 값을 윈도우 길이=300샘플,= 100샘플, 그리고 합성 시프트는 확장 모드일 경우=75샘플, 압축 모드인 경우=50샘플로 정하여 실제 알고리듬 구현 시 각각의 압축, 확장 모드에서 최적의 합성 시프트 값을 적용하는 Dual SOLA 알고리듬을 사용한다.Therefore, in the present invention, based on the simulation results of Table 1, the optimal value of the SOLA parameter is obtained by window length = 300 samples, = 100 samples, and composite shift Is in extended mode = 75 samples in compressed mode The dual SOLA algorithm uses an optimal composite shift value in each compression and extension mode when the algorithm is set to = 50 samples.

따라서, 본 발명에 사용되는 듀얼 SOLA 알고리듬은 음성의 확장모드(느리기 속도)와 압축모드(빠르기 속도)에서 서로 다른 합성시프트 윈도우 길이를 사용함으로 고성능의 음성속도변환시스템 구축이 가능하다.Therefore, the dual SOLA algorithm used in the present invention enables the construction of a high performance speech rate conversion system by using different synthesized shift window lengths in the extended mode (slow speed) and the compressed mode (fast speed) of the voice.

음성복호기로부터 출력되는 음성신호에 윈도우를 곱해 소정의 길이를 갖는 프레임으로 분할한다(S510).The voice signal output from the voice decoder is multiplied by a window and divided into a frame having a predetermined length (S510).

여기서, 윈도우의 길이는 입력 음성신호의 2피치주기를 포함하는 길이를 갖고, 이웃한 프레임간의 동기화를 고려하여 3 내지 4 피치주기의 길이를 갖는 것이 바람직하다.Here, the length of the window is preferably a length including two pitch periods of the input voice signal, and has a length of 3 to 4 pitch periods in consideration of synchronization between neighboring frames.

상기 510단계 이전에 음성신호의 샘플링 비율을 조정하는 전처리과정을 더 구비한다(S505).The method further includes a preprocessing step of adjusting the sampling rate of the voice signal before step 510 (S505).

상기 전처리과정은 다음과 같이 설명된다.The pretreatment process is described as follows.

듀얼 SOLA알고리듬은 각각의 파마미터에 대한 최적의 조건을 갖도록 설계되었으나 여전히 알고리듬의 극한 속도 변화율인=2.0(2배 느림),=0.5(2배 빠름)에서는 부분적인 "clicking" 과 "reverberation" 잡음현상을 인지 할 수 있었다. 이러한 잡음 현상은 8Khz 샘프링 비율의 입력 음성신호를 16Khz 샘플링 비율을 갖도록 보간(interpolation)하여 듀얼 SOLA의 입력으로 사용함으로써 알고리듬의 극한 속도 변화율에서의 성능을 향상시킬 수 가 있다.The dual SOLA algorithm is designed to have optimal conditions for each parameter, but is still at the extreme rate of change of the algorithm. = 2.0 (twice slower), At = 0.5 (twice faster), partial "clicking" and "reverberation" noise phenomena were recognized. This noise phenomenon can improve the performance at the algorithm's extreme rate of change by interpolating the input voice signal with 8Khz sampling rate to have 16Khz sampling rate and using it as the input of dual SOLA.

본 발명에서 사용된 보간법은 그 구조가 간단하여 알고리듬 상으로도 간단히 구현될 수 있는 1차 보간 법인 선형 보간(linear interpolation)을 듀얼 SOLA의 전처리 과정으로 사용하였다. 실제 보간 과정 적용 시에는 입력신호의 샘플링 비율이 8Khz에서 16Khz로 변하게 됨으로 최적의 SOLA 파라미터들, 윈도우 길이,, 합성 시프트, 길이가 2배가 된다.The interpolation method used in the present invention uses linear interpolation, which is a simple structure and can be easily implemented in an algorithm, as a preprocessing procedure of dual SOLA. When applying the actual interpolation process, the sampling rate of the input signal is changed from 8Khz to 16Khz so that the optimal SOLA parameters, window length, Composite shift , The length is doubled.

상기 프레임을 속도변화율에 따라 소정의 값으로 고정된 출력신호의 합성시프트 값을 선택하고, 입력신호의 분석시프트 값을 변경시켜 다음 프레임을 구한다(S520).The synthesized shift value of the output signal fixed to the predetermined value according to the rate of change of the frame is selected, and the next frame is obtained by changing the analysis shift value of the input signal (S520).

상기 프레임간의 오버랩 구간에서 교차상관에 의해 구해진 동기화 길이만큼 이동시켜 프레임을 재배치한다(S530).The frame is rearranged by moving by the synchronization length obtained by the cross correlation in the overlap period between the frames (S530).

상기 동기화 길이는 수학식 1에 의하여 구하여진 교차상관 값을 최대화하는 값으로 결정한다. 또한, 상기 동기화 길이는 피치정보의 정렬이 가능한 값으로, 1피치 주기보다 크고, 윈도우 길이보다 작다.The synchronization length is determined as a value maximizing the cross-correlation value obtained by Equation 1. In addition, the synchronization length is a value in which the pitch information can be aligned, which is larger than one pitch period and smaller than the window length.

상기 오버랩 구간에서 중복된 프레임 샘플에 가중치를 더하여 속도가 변환된 음성신호를 출력한다(540).In operation 540, a voice signal having a converted speed is output by adding weights to the overlapping frame samples in the overlap period.

본 발명에서는 ITU G.729 음성 복호기의 8Khz, 80 sample/frame 단위의 음성 신호를 입력으로 하여, 듀얼 SOLA알고리듬을 통해 사용자가 원하는 음성 속도에 맞추어 출력 음성을 최대 2배 빠르게 혹은 2배 느리게 최적화 된 음성 품질로의 재생을 가능하게 하는 음성속도변환시스템에 대하여 설명하기로 한다.In the present invention, the 8kHz, 80 sample / frame unit voice signal of the ITU G.729 voice decoder is input, and the output voice is optimized up to 2 times faster or 2 times slower according to the user's desired voice speed through the dual SOLA algorithm. A speech rate conversion system that enables reproduction in speech quality will be described.

본 발명에 의한 음성속도변환시스템은 MP3 Player 같은 휴대용 오디오 기기에 외국어 학습 같은 부가적인 기능 구현과 휴대용 녹음기(voice recorder) 등에 응용되어 다양한 용도로 사용되어 질 수 있다.The voice speed conversion system according to the present invention can be used for various purposes by implementing additional functions such as foreign language learning and portable recorder (voice recorder) in a portable audio device such as an MP3 player.

음성 복호기(600)는 CS-ACELP(Conjugated Structured Algebraic Code Excited Linear Predictive)방식을 사용한 8 kbps rate의 음성 압축 코더이다. 여기서, CS-ACELP 음성부호화 방식은 분석/합성 구조에 의하여 피치 및 코드북 파라미터들을 결정하고 여기신호는 벡터 양자화하여 코드북에 저장된다. 이 방식은 예전 CELP 방식에 비해 음질개선 및 계산량 감소의 효과와 낮은 전송율로써 음성 품질의 저하 없이 음성신호를 부호화 할 수 있다는 장점이 있다. ITU G.729는 8kHz 샘플링된 입력 음성 신호를 10ms, 80샘플로 한 프레임을 구성하여 10차 LPC(Linear Predictive Coding)계수를 추출하고 다시 LSP(Line Spectrum Pair)계수로 분석하는 2-단계 벡터 양자화 과정을 거쳐서 pitch값을 양자화 하여 전송하고 복호기에서는 10ms 단위의 80샘플/프레임 음성을 복원하게 된다. 이와 같은 ITU G.729의 음성 복호기에 듀얼 SOLA알고리듬을 적용 통합하기 위해서는 10ms의 80샘플/프레임을 듀얼 SOLA의 윈도우 길이만큼 버퍼에 축적하여 알고리듬을 적용하게된다.The speech decoder 600 is a speech compression coder of 8 kbps rate using CS-ACELP (Conjugated Structured Algebraic Code Excited Linear Predictive). Here, the CS-ACELP speech coding method determines pitch and codebook parameters by an analysis / synthesis structure, and the excitation signal is vector quantized and stored in the codebook. This method has the advantage of improving the sound quality and reducing the amount of calculation compared to the previous CELP method and can encode the voice signal without degrading the voice quality. ITU G.729 is a two-step vector quantization that extracts 10th linear linear predictive coding (LPC) coefficients and analyzes them with LSP (Line Spectrum Pair) coefficients by constructing a frame of 8ms sampled input speech signal with 10ms and 80 samples. Through the process, the pitch value is quantized and transmitted, and the decoder recovers 80 samples / frame speech in 10 ms units. In order to integrate the dual SOLA algorithm into the ITU G.729 voice decoder, 80ms / frames of 10ms are accumulated in the buffer as much as the window length of the dual SOLA, and the algorithm is applied.

프레임분할부(610)는 상기 음성디코더로부터 출력된 음성신호를 소정의 윈도우 길이를 갖는 프레임으로 분할한다.The frame dividing unit 610 divides the voice signal output from the voice decoder into a frame having a predetermined window length.

프레임 계산부(620)는 상기 프레임을 속도변화율에 따라 소정의 값으로 고정된 출력신호의 합성시프트 값을 선택하고, 입력신호의 분석시프트 값을 변경시켜 다음 프레임을 구한다.The frame calculator 620 selects the composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame, and changes the analysis shift value of the input signal to obtain the next frame.

동기화 길이 계산부(630)는 상기 프레임간의 오버랩 구간에서 교차상관에 의해 동기화 길이를 구한다. 이때 동기화 길이는 수학식 1에 의하여 구하여진 교차상관 값을 최대화하는 값으로 결정한다. 또한, 상기 동기화 길이는 피치정보의 정렬이 가능한 값으로, 1피치 주기보다 크고, 윈도우 길이보다 작다.The synchronization length calculator 630 calculates a synchronization length by cross correlation in the overlap period between the frames. At this time, the synchronization length is determined as a value maximizing the cross-correlation value obtained by Equation 1. In addition, the synchronization length is a value in which the pitch information can be aligned, which is larger than one pitch period and smaller than the window length.

프레임재배치부(640)는 상기 동기화 길이만큼 이동시켜 프레임을 재배치한다.The frame repositioning unit 640 rearranges the frames by moving by the synchronization length.

가중치부(650)는 상기 오버랩 구간에서 중복된 프레임 샘플에 가중치를 더하여 속도가 변환된 음성신호를 출력한다.The weighting unit 650 outputs the voice signal whose speed is converted by adding weights to the overlapping frame samples in the overlap period.

전처리부(660)는 상기 프레임분할부(610)로 상기 음성디코더(600)로부터 출력되는 음성신호가 입력되기 이전에 상기 음성신호의 샘플링 비율을 조정한다. 이러한 전처리 과정은 보간과정을 거쳐 이루어진다.The preprocessor 660 adjusts the sampling rate of the voice signal before the voice signal output from the voice decoder 600 is input to the frame divider 610. This preprocessing process is performed through an interpolation process.

본 발명에서는 전처리(보간)과정을 사용하여 음성속도변환시스템을 구현할 수 있으며, 또한 전처리(보간)과정을 사용하지 않고 음성속도변환시스템을 구현할 수 있다.In the present invention, the speech rate conversion system may be implemented using a preprocessing process, and the speech rate conversion system may be implemented without using the preprocessing process.

따라서, 보간(Interpolation) 절차가 없는 음성속도변환시스템 구현에서는 ITU G.729의 10ms 80샘플/프레임을 37.5msec에 해당하는 300샘플 윈도우 길이 만큼 버퍼에 저장하여, 듀얼 SOLA를 적용하게 된다. 한편 보간 처리 과정을 거친 입력 신호는 2배 길이만큼의 75msec 단위의 윈도우 버퍼를 이용하여 데이터를 축적 후 알고리듬을 적용한다.Therefore, in the voice rate conversion system without interpolation procedure, 10 SO80 samples / frames of ITU G.729 are stored in a buffer as long as 300 sample window lengths corresponding to 37.5 msec, and dual SOLA is applied. On the other hand, the input signal that has undergone the interpolation process is applied to the algorithm after accumulating data by using the window buffer of 75msec unit which is twice the length.

본 발명에서는 음성속도변환시스템에 대한 모의 실험 및 성능 평가를 수행한다. 모의 실험에 사용된 Dual SOLA의 최적 파라미터는 상술한대로 윈도우 길이=300 샘플,= 100샘플, 그리고 합성 시프트는 확장 모드일 경우 75샘플, 압축 모드인 경우 50샘플로 정하였다.In the present invention, the simulation and performance evaluation of the speech rate conversion system is performed. The optimal parameters of the Dual SOLA used for the simulation were window length = 300 samples, = 100 samples, and composite shift Was set to 75 samples in the extended mode and 50 samples in the compressed mode.

모의 실험에 사용된 테스트 음성은 한국인 남성(3.79sec) 및 여성의 음성(4.01sec), 그리고 외국인 남성(3.05sec) 등 3가지 음성을 입력으로 사용하였으며, 각 테스트 음성은 8Khz 샘플링 비율, 16bit, 모노(mono)이다.The test voices used in the simulation were inputted into three voices: Korean male (3.79sec), female voice (4.01sec), and foreign male (3.05sec), and each test voice was 8Khz sampling rate, 16bit, It is mono.

남성음 : "계절이 지나가는 하늘에는 가을로 가득 차 있습니다."Male voice: "The seasons are full of autumn sky."

여성음 : "계절이 지나가는 하늘에는 가을로 가득 차 있습니다."Femininity: "The seasons are filled with autumn sky."

영어음 : "We have just dock on the beautiful shores of the same problem."English: "We have just dock on the beautiful shores of the same problem."

도 7은 모의 실험에 사용된 각각의 테스트 음을 보여준다.7 shows each test tone used in the simulation.

ITU G.729는 8khz sampling된 디지털 입력신호를 10msec 80samples/frame로 구성되어 있어, window 길이가 300 sample인 Dual SOLA 적용하기 위해서는 적절한 버퍼 조절이 필요하다.ITU G.729 consists of 8khz sampled digital input signal with 10msec 80samples / frame, so proper buffer adjustment is required to apply Dual SOLA with window length of 300 samples.

도 8에서 듀얼 SOLA는 윈도우 길이 300샘플만큼 입력 데이터를 축적해야 하기 때문에 ITU G.729의 4 프레임에 해당하는 320샘플을 버퍼에 축적한 후 그중 윈도우 길이를 초과하는 20샘플을 또 다른 버퍼에 축적하여 다음 윈도우 연산에 사용해야 한다.In FIG. 8, since dual SOLA has to store input data by 300 samples of window length, 320 samples corresponding to 4 frames of ITU G.729 are accumulated in a buffer, and 20 samples of which the window length is exceeded are stored in another buffer. To be used for the next window operation.

듀얼 SOLA는 속도 변화율 α에 따라서도 버퍼 인덱스(index)가 달라지기 때문에 인덱스 조정의 까다로운 단점이 있다. 예를 들어, 도 8에서 S_s가 75, α가 1.75 일 때 S_a는 43이고, 한 프레임 처리 후 그 다음 300샘플을 가져오기 위해서는 남은 42샘플과 나머지 버퍼에 저장되어 있던 20샘플을 제외한 238샘플를 G.729 복호기에서 입력받는다. 그러나, G.729 복호기는 항상 80샘플/프레임으로 고정되어 있기 때문에 다음의 240샘플을 입력받아 그 중 238샘플은 입력 버퍼 샘플로 사용되고, 나머지 2샘플을 나머지 버퍼에 저장하고 있다가 그 다음 프레임 처리할 때 사용한다. 이러한 인덱스 조정은 사용자에 의하여 속도 변화율 α가 변하기 때문에 그 때마다 버퍼 인덱스가 변화되어 실제 주어진 속도 변화율에 따른 적절한 버퍼의 조절이 필요하다.Dual SOLA has a difficult disadvantage of index adjustment because the buffer index varies depending on the rate of change α. For example, in FIG. 8, when S_s is 75 and α is 1.75, S_a is 43. To obtain the next 300 samples after processing one frame, G is 238 samples except for the remaining 42 samples and the 20 samples stored in the remaining buffer. It is input from the .729 decoder. However, since the G.729 decoder is always fixed at 80 samples / frame, the next 240 samples are input and 238 of them are used as input buffer samples, and the remaining 2 samples are stored in the remaining buffer and then processed. Used when This index adjustment requires a proper buffer adjustment according to the actual speed change rate since the buffer index changes every time since the speed change rate α is changed by the user.

도 9와 도10은 듀얼 SOLA를 적용한 음성속도변환시스템의 압축 모드=0.500, 0.750, 0.875 와 확장 모드 α = 1.250 , 1.500, 1.750, 2.000에서 영어 남성 음성에 대한 출력 결과 음성 파형을 보여준다.9 and 10 show the compression mode of the speech rate conversion system applying the dual SOLA Output waveforms for the English male voice are shown in the = 0.500, 0.750, 0.875 and extended modes α = 1.250, 1.500, 1.750, 2.000.

본 발명에서는 한국 남성음성과 여성음성에 대한 출력 결과도 유사한 결론을 보여주기 때문에 출력 음성 파형 그림을 생략하고 영어 남성 음성을 기준으로 설명하였다.In the present invention, the output results for the Korean male voice and the female voice show similar conclusions, and thus the output voice waveform is omitted and explained based on the English male voice.

도 9의 압축모드에서의 모의 실험 결과 전반적으로 각 속도 변화율에 따른 양질의 결과 음성을 들을 수 있었으나, 예외적으로 α=0.5(2배 빠르기)일 경우 특히 점선으로 표시한 "dock" 부분에서 clicking 현상을 확인할 수 있었고 실제 음성 청취 테스트에서도 지각 할 수 있을 정도였다. 한편 확장 모드의 그림 9에서는 α=2.0(2배 느리기)일 경우 전반적인 문맥에 걸쳐 reverberation 현상이 나타남을 볼 수 있었다.The simulation results in the compression mode of FIG. 9 show that the overall sound quality is good according to the rate of change. However, in the case of α = 0.5 (2 times faster), the clicking phenomenon in the "dock" portion indicated by the dotted line is particularly exceptional. I was able to confirm it and was able to perceive it in the actual voice listening test. On the other hand, in Figure 9 of the extended mode, it can be seen that reverberation occurs over the entire context when α = 2.0 (twice slow).

이러한 알고리듬의 극한 속도율= 2.0 `,` 0.5에서 clicking 과 reverberation 현상은 모의 실험 결과와도 일치하는 것으로서 입력신호의 보간 전 처리 과정을 통하여 극복할 수 있다.Ultimate rate rate of these algorithms The clicking and reverberation phenomena at = 2.0 `,` 0.5 are also consistent with the simulation results and can be overcome by preprocessing the input signal.

상술한 듀얼 SOLA알고리듬을 이용한 음성속도변환시스템의 성능은 알고리듬의 극한 속도 변화율인= 2.0 `,` 0.5 을 제외하고 나머지 변화율에서 전반적으로 좋은 성능을 보이고 있다.The performance of the voice speed conversion system using the dual SOLA algorithm described above is the extreme rate of change of the algorithm. = 2.0 `,` Except for 0.5, the overall performance is good at the remaining change rate.

ITU G.729 복호기의 8Khz 샘플링 비율의 음성신호를 16Khz 샘플링 비율로 변환하여 듀얼 SOLA의 입력으로 사용함으로서 알고리듬의 극한 속도 변화율에서의 성능을 향상시킬 수 있음을 보인다. 본 발명에서 사용된 보간 기법은 그 구조가 간단하여 알고리듬 상으로도 간단히 구현될 수 있는 1차 보간 법인 선형 보간을 듀얼 SOLA의 전처리 과정으로 사용하였다. 실제 보간 과정 적용 시에는 입력신호의 샘플링 비율이 8Khz에서 16Khz로 변하게 됨으로써, 최적의 듀얼 SOLA 파라미터들이 2배가되어, 윈도우 길이는 300샘플에서 600샘플로,는 100샘플에서 200샘플로,는 확장 모드일 경우 75샘플에서 150샘플, 확장 모드인 경우 50샘플에서 100샘플로 변경된다. 이 경우 윈도우 길이가 2배가 됨으로 인하여 듀얼 SOLA에서 가장 많은 연산부분을 갖고 있는 프레임 동기화 검색의 교차상관 계산량이 늘어나게 된다.It shows that the algorithm can improve the performance at the rate of change of the extreme speed of the algorithm by converting the 8Khz sampling rate of the ITU G.729 decoder into 16Khz sampling rate and using it as the input of dual SOLA. The interpolation technique used in the present invention uses linear interpolation, which is a simple structure and can be easily implemented in an algorithm, as a preprocessing procedure of dual SOLA. When the actual interpolation process is applied, the sampling rate of the input signal is changed from 8Khz to 16Khz, so that the optimal dual SOLA parameters are doubled, and the window length is 300 samples to 600 samples, 100 samples to 200 samples, Is changed from 75 samples to 150 samples in the extended mode and 50 samples to 100 samples in the extended mode. In this case, the window length is doubled, which increases the cross-correlation calculation of the frame synchronization search that has the largest amount of computation in the dual SOLA.

따라서 본 발명에서는 알고리듬의 연산량을 줄이기 위해 듀얼 SOLA의 파라미터 집합을 1.25배, 1.5배, 1.59배, 1.67배, 1.75배로 변화시켜 모의 실험을 수행하여 파라미터 집합이 1.75배에서도 2배인 조건과 거의 동등한 결과 음질을 얻을 수 있음을 확인하였고 그 결과 전체 연산량의 1/8을 줄일 수 있었다. 상기 듀얼 SOLA의 파라미터 집합은 윈도우 길이 525샘플,= 175샘플,는 확장 모드일 경우 131샘플, 압축 모드일 경우 88샘플로 하였다.Therefore, in the present invention, in order to reduce the calculation amount of the algorithm, the simulation of the dual SOLA parameter set is changed to 1.25 times, 1.5 times, 1.59 times, 1.67 times, 1.75 times, and the result is almost equivalent to the condition that the parameter set is twice even at 1.75 times. As a result, sound quality can be obtained, and as a result, one-eighth of the total calculation amount can be reduced. The dual SOLA parameter set has a window length of 525 samples, = 175 samples, 131 samples in extended mode and 88 samples in compressed mode.

도 12는 같은 영어 음성에 대해서 α= 0.5일 경우 보간 과정이 없는 경우 와 포함한 경우에 대한 비교 도면으로서 도 12a는 도 9a에서 clicking 현상이 발생했던 "dock" 부분을 확대한 것이고 도 12b는 보간 전 처리 과정을 통과한 결과 음성 신호의 "dock" 부분을 확대한 것이다. 도면에서 구별 할 수 있듯이 보간 전 처리 과정을 통과한 출력 신호의 0.44 ~ 0.46 sec 구간에서 clicking 현상이 많이 감소하여 음과 음 사이에 끊어지는 현상이 사라짐을 확인 할 수 있었다.FIG. 12 is a comparative view illustrating a case in which there is no interpolation process in the case of α = 0.5, and FIG. 12A is an enlarged portion of a “dock” in which a clicking phenomenon occurs in FIG. 9A. As a result of the processing, the "dock" portion of the voice signal is enlarged. As can be distinguished from the figure, the clicking phenomenon was reduced a lot in the 0.44 ~ 0.46 sec section of the output signal which passed the pre-interpolation process, and the break between the sound and the sound disappeared.

도 14는 같은 영어 음성에 대해서 α= 2.0일 경우 보간 과정이 없는 경우 와 포함한 경우에 대한 비교 도면으로서, 도 14a는 도 10d에서 reverberation 현상이 발생했던 "dock" 부분을 확대한 것이고, 도 14b는 보간 전 처리 과정을 통과한 결과 음성 신호의 "dock" 부분을 확대한 것이다. 도면에서 보듯이 도 14a가 도 14b에 비해 반복되는 신호가 많음을 볼 수 있으며, 보간 전 처리 과정을 통과한 출력 신호에서 reverberation 현상이 많이 감소함을 볼 수 있다.FIG. 14 is a comparative view illustrating a case where there is no interpolation process for the same English voice and when there is no interpolation process. FIG. 14A is an enlarged view of a "dock" in which reverberation occurs in FIG. 10D. As a result of passing the pre-interpolation process, the "dock" portion of the speech signal is enlarged. As shown in FIG. 14, it can be seen that FIG. 14A has a larger number of repeated signals than FIG. 14B, and reverberation is reduced in the output signal passing through the pre-interpolation process.

도면과 명세서는 단지 본 발명의 예시적인 것으로서, 이는 단지 본 발명을 설명하기 위한 목적에서 사용된 것이지 의미한정이나 특허청구범위에 기재된 본 발명의 범위를 제한하기 위하여 사용된 것은 아니다. 그러므로 본 기술 분야의 통상의 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 타 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 첨부된 특허청구범위의 기술적 사상에 의해 정해져야 할 것이다.The drawings and specification are merely exemplary of the invention, which are used for the purpose of illustrating the invention only and are not intended to limit the scope of the invention as defined in the appended claims or claims. Therefore, those skilled in the art will understand that various modifications and equivalent other embodiments are possible from this. Therefore, the true technical protection scope of the present invention will be defined by the technical spirit of the appended claims.

본 발명에 의하면, 다양한 SOLA 파라메터에 대한 모의 실험과 이론적 분석에 의거하여 ITU G.729 복호기 음성 신호에 대한 최적화된 듀얼 SOLA 알고리듬을 제안하였으며, 극한 속도 변화율에서의 음질 저하요소인 "click" 과 "reverberation" 현상은 음성 입력 단에 간단한 선형 보간 처리과정을 포함함으로써 전체 적인 성능을 향상시킬 수 있었다.According to the present invention, we propose an optimized dual SOLA algorithm for the ITU G.729 decoder speech signal based on simulation and theoretical analysis of various SOLA parameters. reverberation "can improve the overall performance by including a simple linear interpolation process in the speech input stage.

또한, 본 발명에 의한 시스템은 사용자의 어학 학습 능률을 높일 수 있는 다양한 디지털 어학 학습 시스템으로 응용 될 수 있다.In addition, the system according to the present invention can be applied to a variety of digital language learning system that can increase the language learning efficiency of the user.

또한, 본 발명은 음성속도변환 알고리듬 자체뿐만이 아니라 MP3플레이어의 디지털 어학학습 기능 구현시, MP3 플레이어 내 ITU G.729 디코더와 듀얼 SOLA 알고리듬을 상호연결 통합한 시스템을 구현할 수 있다.In addition, the present invention can implement a system in which the ITU G.729 decoder and the dual SOLA algorithm are interconnected and integrated in the MP3 player when the digital language learning function of the MP3 player is implemented as well as the voice rate conversion algorithm itself.

Claims

In the method for outputting a voice signal at a desired voice speed,

(a) multiplying a voice signal by a window to divide the speech signal into frames having a predetermined length;

(b) selecting a composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame, and changing the analysis shift value of the input signal to obtain a next frame;

(c) rearranging the frames by moving by the synchronization length obtained by cross correlation in the overlap period between the frames; And

(d) adding a weight to the overlapping frame samples in the overlap period and outputting a voice signal whose speed is converted;

And a preprocessing step of adjusting a sampling rate of the voice signal before step (a),

In step (a), the length of the window has a length including two pitch periods of the input voice signal, and has a length of 3 to 4 pitch periods in consideration of synchronization between neighboring frames,

In the step (b), the voice speed conversion method using a dual SOLA algorithm, characterized in that using a different synthesis shift window length in the expansion mode (slow speed) and compression mode (fast speed) of the voice.

delete

The method of claim 1, wherein the synchronization length is

(here, silver Wow Overlapping length of, Means an analysis shift which is an analysis segment unit of the input signal, Denotes a composite shift unit which is a composite segment unit of the output signal.)

Obtaining the cross-correlation value using, and determines the value to maximize the cross-correlation value, the voice speed conversion method using a dual SOLA algorithm.

The method of claim 5, wherein the synchronization length is

Pitch information is a sortable value, the voice speed conversion method using a dual SOLA algorithm characterized by greater than 1 pitch period, less than the window length.

In the voice speed conversion system for converting the speed of the voice signal output through the voice decoder,

A frame dividing unit dividing the voice signal output from the voice decoder into a frame having a predetermined window length;

A frame calculator which selects a composite shift value of the output signal fixed to the predetermined value according to the rate of change of the frame and changes the analysis shift value of the input signal to obtain a next frame;

A synchronization length calculation unit obtaining a synchronization length by cross correlation in the overlap period between the frames;

A frame repositioning unit for repositioning the frame by moving the synchronization length; And

A weighting unit configured to output a voice signal whose speed is converted by adding weights to the overlapping frame samples in the overlap period,

And a preprocessing unit for adjusting a sampling rate of the voice signal before the voice signal output from the voice decoder is input to the frame divider.

delete

The method of claim 7, wherein the synchronization length calculation unit

(here, silver Wow Overlapping length of, Means an analysis shift which is an analysis segment unit of the input signal, Denotes a synthesis shift, which is a synthesis segment unit of the output signal.), And the synchronization length is determined by a value maximizing the cross correlation value, using the dual SOLA algorithm. Speed conversion system.