KR100309873B1

KR100309873B1 - A method for encoding by unvoice detection in the CELP Vocoder

Info

Publication number: KR100309873B1
Application number: KR1019980059740A
Authority: KR
Inventors: 배명진; 나덕수; 정찬중
Original assignee: 강상훈; 정보통신연구진흥원; 진영돈; (주)토미스정보통신
Priority date: 1998-12-29
Filing date: 1998-12-29
Publication date: 2001-12-17
Also published as: KR19990024266A

Abstract

PURPOSE: An encoding method using detection of voiceless sound in a code-excited linear prediction is provided to reduce the amount of calculation and a transmitting rate by detecting and encoding the part of a voiceless sound in a voice sound. CONSTITUTION: A variation degree of spectrum is tracked by using an LSP. The first frame of a starting part of a voiceless sound part and the last frame of the last part of the voiceless sound part are detected by using interval information and interval dispersion of LSP. The first frame of the starting part of the voiceless sound part is encoded as 1. The last frame of the last part of the voiceless sound part is encoded as 0. In a decoding process, the encoded first frame is decoded by using LSP information of the next frame. The encoded last frame is decoded by using LSP information of a previous frame.

Description

A method for encoding by unvoice detection in the CELP Vocoder}

과거의 음성 정보를 전송하는 방법은 한정된 일부 사용자와 특정 분야에 적용되었기 때문에 전송률은 크게 고려가 되지 않았으며 고음질의 음성을 수신단에 보낼 수 있으면 되었다. 그러나, 과거에 비해 멀티미디어와 이동 통신의 발달로 인해 특정 집단이나 개인에게 제공되었던 서비스들이 이제는 다수의 대중에게도 제공되었고 그 숫자 역시 기하 급수적으로 늘어나게 되었다. 그로 인해 지금까지 적용되어왔던 전송률로는 사용자 집단의 숫자를 충족시킬수 없게 되었고 전송률을 저하시켜 동일 채널상에 사용자 수를 증가시킬 경우에 발생하는 음질 열하 역시 문제가 되어왔다. 이와 같은 배경하에서 음성 부호화기, 즉 보코더(Vocoder : coder/decoder)의 개발이 이루어지게 된 것이다.Since the method of transmitting the voice information of the past has been applied to a limited number of users and a specific field, the transmission rate is not considered very much and only if the voice of high quality can be sent to the receiver. However, due to the development of multimedia and mobile communication compared to the past, the services provided to a specific group or individual are now available to a large number of the public, and the number has also increased exponentially. As a result, the transmission rate, which has been applied until now, cannot satisfy the number of user groups, and the degradation of sound quality that occurs when the transmission rate is reduced and the number of users on the same channel has also been a problem. Under such a background, a voice encoder, that is, a coder / decoder (Vocoder) has been developed.

음성신호를 저장하거나 전송하기 위한 음성 부호화법에는 크게 파형부호화법, 신호원부호화법, 혼성부호화법등 세가지로 나눌 수 있다. 파형부호화법은 음성 신호의 성분분리 없이 파형 자체의 잉여 성분만을 제거한 후 부호화하여 전송하고 다시 합성하는 방식이다. 파형부호화법은 고음질과 화자의 개성이 유지되는 반면에 전송 파형을 유지하기 위한 데이터량이 많기 때문에 전송률이 높으며 대용량의 메모리가 필요하게 된다. 이에 비해, 신호원부호화법은 음성의 발성 모델에 근거하여 음성 신호의 여기 성분과 여파기 성분을 분석하여 각각 독립적으로 분리시켜 부호화 하는 방법을 사용하기 때문에 전송 대역폭이 작고 메모리 사용량이 작다. 그러나 분석시와 합성시의 오차가 누적되는 문제점을 갖기 때문에 합성음질은 자연성과 명료성이 크게 떨어지게 된다. 혼성부호화법은 신호원부호화법의 메모리 효율성 및 저 전송률과 파형부호화법의 고음질 유지의 장점을 결합시킨 것으로 포만트 정보는 선형 예측 부호화법을 사용하게 되며, 그 나머지 잔여 신호를 어떻게 하느냐에 따라 RELP, VELP, MELP, CELP법 등이 제안되어져 있다.Speech coding methods for storing or transmitting speech signals can be classified into three types: waveform coding, signal source coding, and hybrid coding. The waveform encoding method removes only the excess components of the waveform itself without separating the components of the speech signal, encodes them, transmits them, and synthesizes them again. The waveform encoding method maintains the high sound quality and the speaker's personality, while the large amount of data is required to maintain the transmission waveform, so that the transmission rate is high and a large memory is required. On the other hand, the signal source encoding method uses a method of separately separating and encoding the excitation component and the filter component of the speech signal based on the speech model of speech, so that the transmission bandwidth is small and the memory usage is small. However, due to the cumulative problem of analysis and synthesis, synthetic sound quality is greatly degraded in nature and clarity. Hybrid coding combines the memory efficiency and low bit rate of signal source coding with the advantages of maintaining the high quality of waveform coding. Formant information uses linear predictive coding. VELP, MELP, CELP methods and the like have been proposed.

현재 ITU-T 멀티미디어용 표준 보코더로는 G.723.1이 채택되어 있으며 G.723.1은 5.3/6.3kbps의 이중 전송율(dual rate) 구조를 가지고 있다. 두 가지 전송율 중에서 5.3kbps인 경우에 CELP (Code Excited Linear Prediction) 계열의 음성 부화법중 하나인 ACELP(Algebraic CELP)를 사용하고 있다.Currently, G.723.1 is adopted as the standard vocoder for ITU-T multimedia, and G.723.1 has a dual rate structure of 5.3 / 6.3kbps. In case of 5.3kbps among two transmission rates, ACELP (Algebraic CELP), which is one of CELP (Code Excited Linear Prediction) series speech incubation method, is used.

CELP형 보코더인 G.723.1은 5.3/6.3kbps의 이중 전송률을 갖는 구조로 현재 상용화되고 있는 인터넷폰과 그 외의 이동통신용 보코더로 사용되어지고 있으면 낮은 전송률에 비해서 우수한 음질을 제공하고 있다. 하지만 G.723.1에서는 음성신호를 단지 묵음과 음성부분으로만 나누어 부호화 함으로써 무성음에서는 별도의 처리가 없다. 즉, 무성음도 유성음과 같은 분석을 통해서 데이터를 전송하게 되므로 보다 많은 계산량이 필요하게 되고 전송률 측면에서도 손실을 가져오게 된다.G.723.1, a CELP type vocoder, has a double transmission rate of 5.3 / 6.3kbps, and is used as an internet phone and other mobile communication vocoder that is currently commercialized, and provides excellent sound quality compared to low data rates. However, in G.723.1, the audio signal is encoded by dividing it into only mute and voice parts, so there is no separate processing in unvoiced sound. In other words, the unvoiced sound is also transmitted through the same analysis as the voiced sound, which requires more computation and loss in terms of transmission rate.

본 발명은 상기와 같은 문제점을 해결하여 분석단에서 얻어지는 LSP를 사용하여 음성중 무성음의 일부분을 검출하여 부호화함으로써 계산량을 감소시키고, 따라서 전송률을 감소시키기 위한 것이다.The present invention is to solve the above problems by using a LSP obtained from the analysis stage to detect and encode a portion of the unvoiced voice in the speech to reduce the amount of calculation, and thus to reduce the transmission rate.

제1도는 종래의 CELP 음성부호화기의 블럭도.1 is a block diagram of a conventional CELP speech coder.

제2도는 본 발명에서 성분분리에 사용되는 LSP 특성을 나타내는 그래프로서, 제2a도는 무성음에 대한 스펙트럼을 나타내는 도면, 제2b도는 유성음에 대한 스펙트럼을 나타내는 도면, 제2c도는 무성음의 LPC 포락선과 LSP를 나타내는 도면, 제2d도는 유성음의 LPC 포락선과 LSP를 나타내는 도면.Figure 2 is a graph showing the characteristics of the LSP used for component separation in the present invention, Figure 2a shows the spectrum for unvoiced sound, Figure 2b shows the spectrum for voiced sound, Figure 2c shows the LPC envelope and LSP of unvoiced sound 2D is a diagram showing an LPC envelope and an LSP of a voiced sound.

제3도는 CELP 부호화기에서 추출한 LSP를 보여주는 도면으로서, 제3a도는 'She'에 대한 음성파형을 보여주는 그래프이고, 제3b도는 CELP에서 추출한 LSP를 보여주는 그래프.3 is a diagram showing an LSP extracted from a CELP encoder, and FIG. 3a is a graph showing a speech waveform for 'She', and FIG. 3b is a graph showing an LSP extracted from a CELP.

제4도는 본 발명의 성분분리에 사용되는 파라미터들을 나타내는 도면으로서, 제4a도는 'She'에 대한 음성파형은 보여주는 도면, 제4b도는 스펙트럼 변화도를 나타내는 도면, 제4c도는 포만트 개수를 보여주는 도면, 제4d도는 LSP 간격의 분산을 보여주는 도면.4 is a diagram showing parameters used for component separation of the present invention. FIG. 4a shows a speech waveform for 'She', FIG. 4b shows a spectral variation diagram, and FIG. 4c shows a number of formants. 4d shows the dispersion of LSP intervals.

제5도는 본 발명의 성분분리 알고리듬에 의해 검출된 무성음 부분을 나타내는 도면.5 is a diagram showing unvoiced sound portions detected by the component separation algorithm of the present invention.

제6도는 본 발명의 성분분리 알고리듬에 의해 검출된 무성음 부분을 나타내는 도면.6 is a diagram showing an unvoiced sound portion detected by the component separation algorithm of the present invention.

제7도는 본 발명의 성분분리 알고리듬에 의해 검출된 무성음 부분을 나타내는 도면.7 shows unvoiced portions detected by the component separation algorithm of the present invention.

본 발명은 상기와 같은 과제를 해결하기 위해 CELP 보코더에서 LSP 파라미터를 통해 성분분리를 수행하여 부호화하는 방법에서, LSP로 스펙트럼 변화도를 추적하여 변화가 큰 부분을 찾고 LSP의 분산과 간격정보를 동시에 사용하여 음성이 시작되는 부분중 무성음으로 시작하는 부분의 처음 한 프레임 또는 음성이 끝나는 부분중 무성음으로 끝나는 마지막 한 프레임을 검출하고, 상기 검출된 프레임에 대한 부호화에 1 비트를 사용하여 무성음으로 시작되는 첫 프레임으로 검출된 부분은 1로, 무성음으로 끝나는 마지막 프레임으로 검출된 부분은 0으로 부호화하고, 복호화시에는 1로 부호화된 프레임은 뒤에 이어서 나오는 프레임의 LSP 정보를 이용하여 복호화하고 0으로 부호화된 프레임은 앞 프레임의 LSP 정보를 이용하여 복호화 한다.In order to solve the above problems, in the method of encoding component separation through LSP parameters in a CELP vocoder, the spectral gradient is tracked with the LSP to find a large change, and the dispersion and spacing information of the LSP is simultaneously Using the first one frame of the part where the voice starts or the last one of the part where the voice ends, ending in the unvoiced sound, and using 1 bit in the encoding for the detected frame. The part detected as the first frame is encoded as 1, the part detected as the last frame ending with unvoiced sound is encoded as 0, and in decoding, the frame encoded as 1 is decoded using the LSP information of the subsequent frame and encoded as 0. The frame is decoded using the LSP information of the previous frame.

이하, 첨부된 도면을 참조하여 본 발명을 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described in detail the present invention.

1. CELP형 음성 부호화기의 원리1. Principle of CELP Type Speech Coder

도 1은 종래 사용되고 있는 CELP 보코더인 G.723.1의 구조를 나타낸다.Fig. 1 shows the structure of G.723.1, which is a CELP vocoder which is conventionally used.

CELP형 부호화기(100)는 코드북내에 저장된 입력 여기 신호열을 두 개의 시변 선형 희귀(Time-varing Linear Recursive) 필터를 통과시킴으로써 얻은 신호중, 주어진 충실도 판정을 최적화시키는 것을 선택하도록 구성되어 있다.The CELP type encoder 100 is configured to select one of the signals obtained by passing two time-varing linear recursive filters of the input excitation signal sequence stored in the codebook to optimize a given fidelity determination.

이러한 CELP 부호화기(100)는 입력으로 얻어진 음성신호를 분석하여 필요한 파라미터를 추출하고 이를 이용하여, 음성신호를 합성하여 입력 음성신호와 비교하는 합성에 의한 분석법(Analysis by Synthesis)을 사용함으로써 음질이 매우 우수하다. 그러나, 매번 음성을 합성해서 비교해야 하므로 매우 복잡한 구조를 갖는다.The CELP encoder 100 analyzes a speech signal obtained as an input, extracts necessary parameters, and uses the analysis by synthesis method to synthesize a speech signal and compare the input speech signal with a very good sound quality. great. However, since the speech must be synthesized and compared each time, it has a very complicated structure.

2. G.723.1에서의 LSP 추출2. LSP Extraction from G.723.1

CELP 보코더인 G.723.1에서는 10차 선형 예측 분석기(Linear Predictive Analysis)(10)가 사용된다. 각 부프레임에서 180샘플 윈도우는 각 부프레임의 중앙에 위치한다. 해밍 윈도우(Hamming window)가 이때 사용된다.In the CELP vocoder G.723.1, a 10th-order linear predictive analysis (10) is used. In each subframe, the 180 sample window is located at the center of each subframe. Hamming window is used at this time.

선형예측계수(LPC)는 Levinson-Durbin recursion을 사용하여 계산되어진다. 이 LPC계수는 단구간 인지 가중화된 필터(20)를 만들기 위해 사용된다. LPC합성 필터는 다음과 같이 정의된다.Linear predictive coefficients (LPC) are calculated using Levinson-Durbin recursion. This LPC coefficient is used to make the short-perceptual weighted filter 20. The LPC synthesis filter is defined as follows.

여기서 i는 부프레임 인덱스이고 0과 3사이에서 정의되어진다.Where i is the subframe index and is defined between 0 and 3.

LPC 계수\{alpha_j \}_j=1..10는 단위원과 영교차에 의한 인터폴레이션 동안의 검색에 의해 LSP 계수 \{p'_j \}_j=1..10로 변환된다(LSP양자화기(30)).The LPC coefficient \ {alpha_j \} _ j = 1..10 is converted to the LSP coefficient \ {p'_j \} _ j = 1..10 by a search during interpolation by unit circle and zero crossing (LSP quantizer ( 30)).

3. CELP에서 LSP를 이용한 무성음 검출3. Detection of unvoiced sound using LSP in CELP

LSP는 일정한 스펙트럼 민감도와 저전송율 부호화에서의 낮은 스펙트럼 왜곡 그리고 파라미터의 좋은 선형보간특성을 가지고 있다.LSP has constant spectral sensitivity, low spectral distortion in low bit rate coding, and good linear interpolation of parameters.

도 2는 무성음과 유성음에 대한 LPC와 LSP스펙트럼 특성을 보여주고 있다.2 shows LPC and LSP spectrum characteristics of unvoiced and voiced sounds.

도 2a, 2b는 각각 무성음과 유성음의 스펙트럼을 나타내고, 도 2c는 무성음의 LPC 포락선과 LSP 스펙트럼을 보여주고 있다. 도 2c의 저 주파수에서는 공명 봉우리의 대역폭을 나타내는 LSP가 나타나지 않는 것을 알 수 있다. 도 2d는 유성음의 LPC 포락선과 LSP의 스펙트럼을 보여주는 것으로, 3개의 공명 봉우리를 나타내는 좁은 간격의 LSP를 볼 수 있다. 이처럼 무성음의 LSP 스펙트럼에서는 좁은 간격을 나타내는 스펙트럼이 높은 주파수에서 나타나고 이러한 것도 2개 이하로 나타난다. 즉, 무성음은 포만트가 고주파에 위치하고 주로 2개 이하이고, 유성음은 3개 이상의 포만트가 존재하고 저주파에서 고주파까지 분포한다.2A and 2B show an unvoiced voice and a voiced sound spectrum, respectively, and FIG. 2C shows an LPC envelope and an LSP spectrum of unvoiced sound. It can be seen that the LSP representing the bandwidth of the resonance peak does not appear at the low frequency of FIG. 2C. FIG. 2D shows the spectrum of the LPC envelope and LSP of the voiced sound, where a narrow gap LSP representing three resonance peaks can be seen. In this unvoiced LSP spectrum, narrowly spaced spectra appear at higher frequencies, with two or fewer. That is, the unvoiced sound has a formant located at a high frequency and mainly two or less, and the voiced sound has three or more formants and is distributed from a low frequency to a high frequency.

도 3은 음성파형 'She'에 대해 CELP에서 분석을 통해 얻어진 LSP의 형태를 보여주고 있다. 한 프레임당 10차의 양자화되지 않은 계수로서 최대 256(2pi)값을 가질수 있고 pi/256의 규준화된 주파수 해상도를 가진다. 도 3에서, 11 프레임까지는 무성음을 나타내고 그 이후는 유성음을 나타낸다. 11 프레임 이전에는 제1포만트가 나타나지 않고 각 LPC간격이 비교적 큰 변화를 보이지 않는다. 그리고 포만트가 나타나더라도 2개 이하임을 알수 있다. 그러나 11프레임 이후에서는 제1포만트 뿐만 아니라 3개 이상의 포만트가 아주 좁은 간격으로 나타나고 있다. 이와 같은 특징으로 볼 때 포만트의 개수와 10개의 LSP 계수들간의 간격차의 정보로써 무성음과 유성음을 구별할 수 있음을 짐작할 수 있다.Figure 3 shows the shape of the LSP obtained through analysis in the CELP for the speech waveform 'She'. It is a 10th order unquantized coefficient per frame, which can have a maximum value of 256 (2pi) and has a normalized frequency resolution of pi / 256. In FIG. 3, up to 11 frames represent unvoiced sound and thereafter voiced sound. Prior to 11 frames, the first formant does not appear and each LPC interval does not show a relatively large change. And even though the formant appears, it can be seen that it is two or less. However, after 11 frames, not only the first formant but also three or more formants appear at very narrow intervals. With this feature, it can be estimated that the unvoiced sound and the voiced sound can be distinguished by the information of the gap between the number of formants and the 10 LSP coefficients.

본 발명에서는 묵음에서 무성음으로 천이되는 구간 중 무성음 시작 프레임, 또는 무성음에서 묵음으로 천이되는 구간에서 무성음의 끝 프레임을 검출하여 전송율 감소에 적용하려고 한다. 그 기준은 위에서 설명한 공명봉우리 즉, 포만트 개수와 LSP 계수의 간격의 분산을 이용하고 그에 덧붙여 스펙트럼 변화도를 먼저 측정한다.In the present invention, the unvoiced start frame or the unframed end frame of the unvoiced sound is detected from the silent to unvoiced section and applied to the rate reduction. The criterion uses the resonance peaks described above, that is, the variance of the formant number and the spacing of the LSP coefficients, and in addition, the spectral gradient is measured first.

P_i ^T는 i 번째 프레임의 LSP계수이고 P_j ^T는 i 번째 이전 프레임인 j 번째 프레임의 LSP계수이다. 두 프레임의 계수를 가지고 다음과 같이 스펙트럼의 변화를 측정한다.P_i ^ T is the LSP coefficient of the i-th frame and P_j ^ T is the LSP coefficient of the j-th frame that is the i-th previous frame. Take the coefficients of two frames and measure the change in the spectrum as follows:

Distance(i)는 묵음에서 무성음이 시작되는 부분과 무성음에서 묵음으로 천이 되는 구간에 큰 값을 가진다. 물론 그외 포만트 변화가 일어나는 구간에서도 큰 값을 가진다.Distance (i) has a large value at the beginning of the unvoiced sound in the silence and the section transitioning from the unvoiced sound to the silence. Of course, it also has a large value in other formant changes.

도 4a는 'She'에 대한 음성파형을 도시하고, 도 4b는 [수학식 3]를 이용해서 구한 스펙트럼 변화도이다. 도 4b에서, 4 프레임을 보면 묵음에서 무성음이 시작되는 부분에 큰 변화가 일어나는 것을 알 수 있다. 본 발명에서 검출하고자 하는 구간이 바로 상기 4 프레임과 같은 무성음 구간이므로 나머지 프레임에 나타나는 큰 값은 중요한 의미를 가지지 못한다. 도 4c는 포만트의 개수를 나타내는 그래프이고, 4 프레임을 보면 2개 이하의 포만트를 가지고 있다. 도 4d는 LSP 간격의 분산을 나타내는 그래프로서, 무성음이 유성음에 비해서 적게 나타나므로 이것에 문턱값을 적용하여 4 프레임과 같은 구간이 검출되게 한다.FIG. 4A illustrates a speech waveform for 'She', and FIG. 4B is a spectrum change diagram obtained using Equation 3. FIG. 4b, it can be seen from FIG. 4 that a large change occurs at the beginning of the unvoiced sound in silence. In the present invention, since the section to be detected is an unvoiced section such as the four frames, a large value appearing in the remaining frames does not have an important meaning. FIG. 4C is a graph showing the number of formants, and 4 frames have two or less formants. 4D is a graph showing the dispersion of the LSP intervals. Since unvoiced sounds appear less than voiced sounds, a threshold value is applied to them so that a section equal to 4 frames is detected.

4. 실험 및 결과4. Experiment and Results

실험을 하기 위해 이용한 장비는 IBM-PC/Pentium(150) 시스템이며 여기에 음성신호를 입출력하기 위한 상용화된 16비트 AD/DA 변환기를 인터페이스하여 11kHz, 8kHz의 표본화율로 데이터를 입력하였다. 각 시료에 대한 한 프레임의 길이는 240표본으로 하여 부프레임을 60표본 단위로 처리하였다. 처리결과의 성능을 측정하기 위해 다음의 대표적인 문장을 연령층이 다양한 남녀 5명의 화자가 발성하여 시료로 사용하였다.The equipment used for the experiment was IBM-PC / Pentium (150) system and interfaced with 16-bit AD / DA converter for commercial input and output of voice signal and inputted data at 11kHz and 8kHz sampling rate. The length of one frame for each sample was 240 samples, and the subframes were processed in units of 60 samples. In order to measure the performance of the treatment result, the following representative sentences were used as samples by five speakers of various age groups.

발성1 :/ 인수네 꼬마는 천재소년을 좋아한다./Speaking 1:: Insu's little boy likes a genius boy.

발성2 :/ 예수님께서 천지창조의 교훈을 말씀하셨다./Voice 2: / Jesus told the lesson of creation.

발성3 :/ 숭실대학교 정보통신과 음성통신 연구팀이다./Voice 3: / Soongsil University, Information and Communication, Voice Communication Research Team./

발성4 :/ 창공을 헤쳐나가는 인간의 도전은 끝이 없다./Voice 4: / The human challenge of flying through the sky is endless./

발성5 : 일기예보Voice 5: Weather forecast

G.723.1의 5.3kbps 음성 부호화기에서 무성음 검출 알고리즘을 C-언어로 구현하여 수행하였다. 전송율을 비교하기 위해서 크게 두가지 과정을 수행하였다. 우선 도 1의 LPC analysis 후에 얻어지는 LSP를 사용하여 검출하고자 하는 무성음 구간인지 확인한다. 이러한 과정에서 스펙트럼의 변화도는 20이상, 포만트 개수는 2이하, LSP간격의 분산은 20이하를 만족하는 프레임을 검출하였다. 이러한 결정논리에 의해 검출된 프레임이 찾고자 했던 묵음에서 무성음으로 변화하는 구간에서 무성음의 처음 부분인지, 또는 무성음에서 묵음으로 변화하는 구간에서 무성음의 끝 부분인지 동일한 알고리즘의 Matlab 프로그램을 작성하여 확인하였다.In the 5.3kbps speech coder of G.723.1, the unvoiced detection algorithm is implemented in C-language. In order to compare the data rates, two processes were performed. First, the LSP obtained after the LPC analysis of FIG. 1 is used to determine whether it is an unvoiced sound section to be detected. In this process, a frame having a spectral gradient of 20 or more, a number of formants of 2 or less, and a dispersion of LSP intervals of 20 or less were detected. By using the same algorithm, we confirmed whether the frame detected by the decision logic is the beginning of the unvoiced sound in the section that changes from silent to unvoiced or the end of the unvoiced sound in the section that changes from unvoiced to silent.

도 5 내지 도 7은 제안된 결정논리에 의해 검출된 무성음 프레임을 나타낸다. 도 5와 도 6은 무성음으로 시작되는 부분에서 검출된 결과이고, 도 7은 무성음으로 끝나는 구간에서 검출된 결과이다.5 to 7 show unvoiced frames detected by the proposed decision logic. 5 and 6 show results detected at a portion starting with an unvoiced sound, and FIG. 7 shows results detected at a portion ending with an unvoiced sound.

이렇게 검출된 프레임은 분석과 부호화 과정을 모두 생략하고 단지 기존의 묵음과 음성부분(voice activity)과 구별되는 새로운 프레임 Type임을 1 비트 추가해서 부호화하게 된다. 즉, 비트 하나를 사용하여 무성음의 시작프레임(묵음에서 무성음으로 전이하는 구간 등)구간은 1로 부호화하고, 무성음의 끝 프레임(무성음에서 묵음으로 전이하는 구간 등)구간은 0으로 부호화할 수 있다. 복호화기에서는 상기 하나의 비트를 읽어 이전 프레임의 데이타를 사용해서 복호화한다.The detected frame skips both the analysis and the encoding process and encodes a new frame type by adding 1 bit, which is different from the existing silent and voice activity. That is, one frame may be used to encode an interval of a start frame of an unvoiced sound (eg, a transition from silent to unvoiced) as 1, and an interval of an unvoiced end frame (such as a transition from unvoiced to silent) to 0. . The decoder reads the one bit and decodes the data using the previous frame.

음성시료 1)과 2)는 발성 1)에서 5)까지를 모두 발성한 것이고 음성시료 3)은 일기예보 아나운서들의 목소리이다. 평균 에러율은 0.18%이다.Voice samples 1) and 2) produced both voice 1) through 5) and voice samples 3) were the voices of the weather forecasters. The average error rate is 0.18%.

저전송형 보코더인 CELP형 음성부호화기의 분석단에서는 LPC로서 스펙트럼에 대해 선형예측을 수행하고 LSP로 변형하여 전송하게 된다. 이렇게 함으로써 에러에도 강하고 선형보간도 쉬워져 음질유지에 적합하다.The analysis stage of the CELP speech coder, which is a low transmission vocoder, performs linear prediction on the spectrum as an LPC, transforms it into an LSP, and transmits it. This makes it suitable for maintaining sound quality since it is resistant to errors and easy to linear interpolation.

본 발명은 이렇게 분석단에서 얻어지는 LSP를 사용하여 음성중 무성음의 일부분을 검출하여 전송율을 감소시키는 방법을 제안하였다. 먼저 LSP의 포만트 개수를 나타내는 특성과 스펙트럼 반영 특성 및 LSP 간격의 분산 정보등을 사용하여 음질의 열하를 최소화 하면서 전송율을 낮출 수 있는 무성음 구간을 찾는다. 이렇게 찾아진 구간에 대해 간단한 코딩과 디코딩 기술을 이용하여 전송율을 낮출 수 있다. CELP형 보코더인 G.723.1의 5.3kbps에 본 발명의 방법 적용한 결과 성분분리 평균 검출 에러율은 0.18%, 전송율 감소는 10.9% 정도의 성능향상을 보였고 음질에서는 0.034 정도의 미묘한 열하가 발생했다.The present invention proposed a method of reducing the transmission rate by detecting a portion of the unvoiced sound in the speech using the LSP obtained from the analysis stage. First, the unvoiced sound section is found to reduce the transmission rate while minimizing degradation of sound quality by using characteristics indicating the number of formants of LSP, reflectance characteristics of spectrum and dispersion information of LSP interval. The bit rate can be reduced by using simple coding and decoding techniques. As a result of applying the method of the present invention to 5.3kbps of the CELP vocoder G.723.1, the average separation error rate was 0.18% and the transmission rate was reduced by 10.9%, and the sound quality showed subtle degradation of about 0.034.

Claims

A method of encoding by performing component separation through the LSP parameter in a CELP vocoder, wherein the spectral gradient is tracked with the LSP to find a large change, and the part of the voice starting using the interval information of the LSP and the variance of the interval simultaneously Detecting the first one frame of the part starting with the unvoiced sound or the last one ending with the unvoiced sound from the part where the voice ends, and the part detected as the first frame starting with the unvoiced sound by using 1 bit for encoding the detected frame Is 1, the part detected as the last frame ending in unvoiced sound is encoded by 0, and when decoding, the frame encoded by 1 is decoded using the LSP information of a subsequent frame, and the frame encoded by 0 is a preceding frame. Encoding using the LSP information of the apparatus.