KR20010066558A

KR20010066558A - Voice activity detection method of voice signal processing coder using energy and LSP parameter

Info

Publication number: KR20010066558A
Application number: KR1019990068413A
Authority: KR
Inventors: 김정진; 장경아; 배명진
Original assignee: 대표이사 서승모; (주)씨앤에스 테크놀로지
Priority date: 1999-12-31
Filing date: 1999-12-31
Publication date: 2001-07-11
Also published as: KR100312334B1

Abstract

PURPOSE: A method for voice activity detection in voice signal process encoder is provided to reduce a transmission rate at a mute interval in a voice signal process encoder by using energy and an LSP(line service protocol) parameter. CONSTITUTION: A first process(S10) calculates an average energy of a frame on the ground of a voice activity detection. A second process(S20) compares the calculated average energy with a noise level, and makes a decision of a voiced sound if the average energy is higher than the noise level and makes a decision of a voiceless or mute sound if not. A third process(S30) judges by using a minimum value and a maximum value of an LSP interval to consider a case that SNR(signal to noise ratio) is low if the second process(S20) makes a decision of a voiced sound. A fourth process(S40) compares the minimum interval with the maximum interval of the LSP to consider a case that the energy of voice is low if the average energy is lower than the noise level in the second process(S20).

Description

Voice activity detection method of voice signal processing coder using energy and LSP parameter

본 발명은 음성신호처리 부호화기에서 묵음 구간에서의 전송율을 낮추는 방법에 있어서 에너지와 LSP 파라메타를 이용한 음성활동검출 방법에 관한 것이다.The present invention relates to a voice activity detection method using energy and LSP parameters in a method of lowering a transmission rate in a silent section in a voice signal processing encoder.

음성신호처리 부호화기에서는 묵음 구간에서의 전송율을 낮추는 방법으로 음성활동검출기(voice activity detector)과 쾌적잡음발생기(comfort noise generator)를 사용하고 있다. VAD에서는 프레임 내에 음성 신호가 존재하는 경우는 1로 그렇지 않은 경우는 0으로 설정하여 CNG의 동작유무를 설정한다.In the speech signal processing encoder, a voice activity detector and a comfort noise generator are used as a method of lowering the transmission rate in the silent section. In the VAD, the presence or absence of a voice signal in a frame is set to 1, otherwise, it is set to 0 to set the operation of the CNG.

하지만, G.723.1 보코더의 VAD는 판정의 연속성과 안정성을 위해 다양한 조건을 가지고 있으며 이러한 조건에 의해 음성신호의 존재유무 판정을 하게 된다. 먼저 유성음 판별의 피치의 주기성과 정현파를 검출하여 신호대잡음비가 낮은 신호에 대한 정확한 판별을 위해 스펙트럼의 특성을 이용하고 있다.However, the VAD of the G.723.1 vocoder has various conditions for the continuity and stability of the decision, and the condition determines whether there is a voice signal. First, the pitch characteristics of voiced sound discrimination and sinusoidal wave are detected, and the characteristics of the spectrum are used for accurate discrimination of signals with low signal-to-noise ratio.

상기 종래의 VAD는 판별의 안정성과 연속성을 위해 적응 인에이블 플래그 (adaptation enable flag)를 사용하고 있으며 이 값은 느린 증가, 빠른 감소라는 특성을 이용하므로 묵음 구간이 지속적으로 발생하지 않는 경우 묵음이 존재하는 구간을 음성이 존재하는 프레임으로 설정하게 된다. 이처럼 VAD의 가장 큰 문제점은 어떠한 배경 잡음에 대해서도 음성 신호를검출할 수 있어야 한다. 예를 들어 신호대잡음비가 아주 낮은 신호에서도 음성 신호의 존재 유무를 정확히 판정해야만한다. 하지만 이런 판정은 단순 판별 조건으로 잡음의 신호가 음성의 신호보다 낮은 경우에 음성과 잡음 신호를 구분한다는 것은 거의 불가능하다는 문제점이 있다.The conventional VAD uses an adaptation enable flag for stability and continuity of discrimination. Since this value uses a characteristic of slow increase and fast decrease, silence exists when the silence section does not continuously occur. The interval is set to the frame in which the voice exists. The biggest problem with VAD is that it must be able to detect speech signals against any background noise. For example, even in a signal with a very low signal-to-noise ratio, it is necessary to accurately determine the presence or absence of a voice signal. However, this determination has a problem that it is almost impossible to distinguish between the speech and noise signals when the noise signal is lower than the speech signal as a simple discrimination condition.

본 발명은 상기 종래 기술의 문제점을 해결하고자 제시된 것으로서, 본 발명의 목적은 음성신호의 유무를 판별하기 위해서 스펙트럼상의 특징을 고려할 필요가 있으므로 VAD에서는 잡음 구간에서 만들어낸 계수를 이용한 역 필터를 이용하고 있고 안정성과 연속적 판별을 위해 신호의 시작부분에서는 거의 모든 경우 1로 설정함으로써 안정성을 해치지 않는 범위 안에서 종래의 방법보다 음성구간 검출이 용이하고 또한 신호대잡음비가 낮은 신호의 경우 주파수 특성을 고려하기위해 LSP 파라미터를 이용하여 음성의 존재유무를 판정하는 방법을 제공하는데 있다.The present invention has been made to solve the problems of the prior art, and an object of the present invention is to consider the characteristics of the spectrum in order to determine the presence or absence of a speech signal, so in VAD, an inverse filter using coefficients generated in a noise section is used. For the sake of stability and continuous discrimination, at the beginning of the signal, it is set to 1 in almost all cases, so that the LSP is easier to detect than the conventional method within the range that does not impair the stability. The present invention provides a method for determining the presence or absence of speech using a parameter.

상기 본 발명의 목적을 달성하기 위한 기술적 사상으로서, 음성활동검출 (Voice Activity Detection)에 의해 프레임에 대한 평균 에너지를 계산하는 제 1 과정과, 상기 계산된 평균 에너지와 잡음 레벨을 비교하여 평균 에너지가 잡음레벨보다 크다면 유성음으로 판정을 하게 되고 그렇지 않으면 무성음이나 묵음 구간으로 판정하는 제 2 과정와, 상기 과정에서 유성음으로 판정한 경우 신호대잡음비 (SNR)가 낮은 경우를 고려하기위해 LSP 간격의 최소값과 최대값을 이용하여 판정하는 제 3과정과, 평균 에너지가 잡음레벨보다 작은 경우 음성의 에너지가 작은 경우를 고려하기 위해 LSP의 최대간격과 최소간격을 비교하는 판정하는 제 4과정으로 구성되는 것을 특징으로 하는 발명이 제시된다.As a technical idea for achieving the object of the present invention, the first step of calculating the average energy for the frame by Voice Activity Detection, the average energy is compared by comparing the calculated average energy and noise level If it is greater than the noise level, the voiced sound is determined. Otherwise, the second process is determined as the unvoiced or silent section, and if the voiced sound is determined in the above process, the minimum and maximum values of the LSP interval are considered to consider the case where the signal-to-noise ratio (SNR) is low. A third process of determining using a value and a fourth process of comparing the maximum and minimum intervals of the LSP to consider the case where the energy of the voice is small when the average energy is smaller than the noise level. The invention is presented.

도 1은 음성활동검출장치와 쾌적 잡음 발생기를 갖는 음성신호처리 부호화기를 나타내는 블럭 구성도이다.1 is a block diagram showing a speech signal processing encoder having a speech activity detection device and a comfortable noise generator.

도 2는 음성활동검출기의 한 예를 보여주는 블럭 구성도이다.2 is a block diagram illustrating an example of a voice activity detector.

도 3는 본 발명에 따른 에너지와 LSP 파라메타를 이용한 음성활동검출 알고리즘을 나타내는 순서도이다.3 is a flowchart illustrating a voice activity detection algorithm using energy and LSP parameters according to the present invention.

<도면의 주요부분에 대한 간단한 설명><Brief description of the main parts of the drawing>

1 : 입력 2 : 아날로그디지털변환기(ADC)1: Input 2: Analog-to-Digital Converter (ADC)

3 : 역필터분석기 4 : 자기상관계수(ACF)3: inverse filter analyzer 4: autocorrelation coefficient (ACF)

4a : 버퍼 4b : 평균4a: buffer 4b: average

5 : 가중기 6 : 가산기5: weight adder 6: adder

7 : 비교기 8 : 음성/비음성 출력7: comparator 8: voice / non-voice output

14 : 자기상관계수 15 : 버퍼14: autocorrelation coefficient 15: buffer

20 : 제어신호발생회로부 21 : LPC 분석기20: control signal generation circuit 21: LPC analyzer

24 : 버퍼 27 : 피치 분석기24: Buffer 27: Pitch Analyzer

이하에서는 본 발명의 실시예의 구성 및 작용에 관하여 첨부된 도면을 참조하여 설명하면 다음과 같다.Hereinafter, with reference to the accompanying drawings with respect to the configuration and operation of the embodiment of the present invention will be described.

도 1은 음성활동검출장치와 컴포트 잡음 발생기를 갖는 음성신호처리 부호화기를 나타내는 블럭 구성도이다.1 is a block diagram illustrating a speech signal processing encoder having a speech activity detection device and a comfort noise generator.

도 1에 도시된 바와 같이, 부호화단은 음성신호처리 부호화기와, 음성활동검출기(VAD)와, 부호화측 쾌적잡음발생기(COD-CNG)와, 다중화장치(MUX)로 구성된다.As shown in Fig. 1, the encoding stage is composed of a speech signal processing encoder, a voice activity detector (VAD), a coding side comfort noise generator (COD-CNG), and a multiplexing device (MUX).

상기 음성활동검출장치(VAD)는 음성 부호화기에 의해 생성된 30ms 각 프레임에 대해 음성이 존재 유무를 판정한다. 프레임 t에서의 VAD 결정은 Vad_t로 표기하고 이것은 Ftype_t를 계산하는 COD-CNG 블럭의 입력이 된다.The voice activity detection apparatus (VAD) determines whether a voice is present for each 30 ms frame generated by the voice encoder. The VAD decision in frame t is denoted by Vad _t , which is the input to the COD-CNG block that computes Ftype _t .

도 2에 도시된 바와 같이, 음성활동검출기는 아날로그디지털변환기(2), 역필터분석기(3), ACF(4), 버퍼(4a), AV(4b), 계수곱셈기(5), 덧셈기(6), 비교기(7), 제어신호발생회로(20)등으로 구성된다.As shown in Fig. 2, the voice activity detector includes an analog to digital converter (2), an inverse filter analyzer (3), an ACF (4), a buffer (4a), an AV (4b), a coefficient multiplier (5), and an adder (6). ), A comparator 7, a control signal generation circuit 20, and the like.

입력신호(1)는 아날로그디지털변환기(2)에 의해 표본화되고 디지털화되어 역필터분석기(3)에 입력된다. 상기 역필터분석기(3)는 실제 음성활동검출기가 작동하는 음성 부호화기의 부분으로 입력신호의 스펙트럼의 역에 상응하는 필터의 계수를 만든다. 디지털화된 신호는 자기상관기(autocorrelator)(4)에 공급된다. 상기 자기상관기(4)는 입력신호의 자기상관 벡터를 만든다. 자기상관계수는 신뢰성을 향상하기위해 몇 연속한 음성 프레임에 대해 평균해진다. 이는 자기상관기(4)에 의한 자기상관 계수 출력이 버퍼(4a)내에 저장되고 평균기(4b)가 현재의 자기상관 계수와버퍼(4a)로부터 제공되는 이전에 저장된 프레임의 자기상관계수의 가중치 합을 구함으로써 이루어진다. 상기 평균 자기상관 계수는 가중기(5)와 가산기(6)에 전달된다. 상기 가중기(5)와 가산기(6)는 버퍼(15)를 통해 자기상관기(14)로부터 저장된 잡음기간의 역필터 계수의 자기상관 벡터도 받는다. 상기 평균 자기상관 계수와 자기상관 벡터로부터 측도 M을 만든다. 상기 측도는 비교기(7)의 역치가 된다. 상기 비교기의 논리결과는 출력(8)에서 음성의 존재 유무를 지시한다.The input signal 1 is sampled by the analog-to-digital converter 2, digitized and input to the inverse filter analyzer 3. The inverse filter analyzer 3 is a part of the speech coder in which the actual speech activity detector operates and produces coefficients of the filter corresponding to the inverse of the spectrum of the input signal. The digitized signal is supplied to an autocorrelator 4. The autocorrelator 4 creates an autocorrelation vector of the input signal. The autocorrelation coefficients are averaged over several consecutive speech frames to improve reliability. This is because the autocorrelation coefficient output by the autocorrelator 4 is stored in the buffer 4a and the weighted sum of the autocorrelation coefficients of the previously stored frame provided by the averaging member 4b from the buffer 4a. By obtaining The average autocorrelation coefficients are transmitted to weighter 5 and adder 6. The weighter 5 and the adder 6 also receive autocorrelation vectors of the inverse filter coefficients of the noise period stored from the autocorrelator 14 via the buffer 15. A measure M is made from the mean autocorrelation coefficient and the autocorrelation vector. The measure is the threshold of the comparator 7. The logic result of the comparator indicates the presence or absence of speech at the output 8.

역필터계수가 잡음 스펙트럼의 역의 예측치에 상응하도록 하기위해 상기 계수를 잡음동안 갱신하는 것이 바람직하다. 그러나, 갱신에 근거한 음성/비음성의 결정은 갱신의 결과에 의존하지는 않는다. 신호의 하나의 잘못 확인된 프레임은 음성활동검출기가 이어서 다음 프레임을 잘못 인식한다. 따라서 제어신호발생회로(20)가 있어 역필터분석기(3)를 제어하기위해 음성의 존재 유무를 나타내는 독립된 제어신호를 만들어서 측도 M을 만들기위해 사용되는 역필터자기상관계수는 잡음기간에만 갱신된다. 상기 제어신호발생회로(20)는 입력신호에 상응하는 LPC 계수를 만드는 LPC분석기(21)와, 자기상관계수를 만드는 자기상관기(21a)를 포함한다. 자기상관계수들은 자기상관기(4)로부터 입력신호의 자기상관 벡터를 받는 가중기(22)와 가산기(23)에 공급된다. 입력음성프레임과 이전음성프레임사이의 스펙트럼상의 유사함 정도가 계산된다. 비교기(26)에 의해 판단된 스펙트럼상의 차이 신호는 음성의 존재 유무를 나타낸다. 이러한 측정이 잡음과 비음성을 구분하는데 뛰어날지라도 잡음과 음성을 구분하는 데는 부족하므로 회로(20)내에 피치 분석기(27)를 포함한 음성검출회로가 제공된다. 상기 피치분석기(27)는 음성신호가 검출될 경우 참인 논리 신호를 만들고, 이 신호는 비교기(26)로부터의 역치 측도와 함께 음성이 존재하면 "거짓"을 잡음이 존재하면 "참"인 신호를 발생하는 NOR 게이트(28)에 입력된다. 상기 신호는 버퍼(15)에 공급되어 역필터 계수들이 잡음 기간동안만 갱신된다.It is desirable to update the coefficients during the noise so that the inverse filter coefficients correspond to the inverse predictions of the noise spectrum. However, the determination of voice / non-voice based on update does not depend on the result of the update. One misidentified frame of the signal is followed by the voice activity detector incorrectly recognizing the next frame. Therefore, there is a control signal generation circuit 20, so that the inverse filter autocorrelation coefficient used to make the measure M by making an independent control signal indicating the presence or absence of speech to control the inverse filter analyzer 3 is updated only during the noise period. The control signal generation circuit 20 includes an LPC analyzer 21 for producing LPC coefficients corresponding to the input signal and an autocorrelator 21a for generating autocorrelation coefficients. The autocorrelation coefficients are supplied from the autocorrelator 4 to the weighter 22 and the adder 23 which receive the autocorrelation vector of the input signal. The degree of spectral similarity between the input speech frame and the previous speech frame is calculated. The difference signal on the spectrum determined by the comparator 26 indicates the presence or absence of voice. Although such a measurement is excellent in distinguishing noise from non-voice, a speech detection circuit including a pitch analyzer 27 is provided in the circuit 20 because it is insufficient in distinguishing noise and speech. The pitch analyzer 27 produces a logical signal that is true when a voice signal is detected, and this signal is a "true" signal when the voice is present along with a threshold measure from the comparator 26. It is input to the generated NOR gate 28. The signal is supplied to the buffer 15 so that the inverse filter coefficients are updated only during the noise period.

임계치 어뎁터(29)는 제어신호발생회로(20)의 비음성신호제어출력을 받도록 연결되어 있다. 임계치 어뎁터(29)의 출력은 비교기(7)에 공급된다. 임계치 어뎁터(29)는 임계치가 잡음레벨에 근사할 때가지 단계적으로 임계치를 증가 혹은 감소한다. 입력신호가 매우 작은 경우, 역치가 자동적으로 고정된 저 레벨에 설정되는 것이 바람직하다. 왜냐하면, 낮은 신호레벨에서는 ADC(2)에 의한 신호 양자화의 효과는 바람직하지 않은 결과를 초래할 수 있다.The threshold adapter 29 is connected to receive the non-voice signal control output of the control signal generation circuit 20. The output of the threshold adapter 29 is supplied to the comparator 7. The threshold adapter 29 incrementally increases or decreases the threshold until the threshold approximates the noise level. When the input signal is very small, it is desirable that the threshold is automatically set at a fixed low level. Because at low signal levels, the effect of signal quantization by the ADC 2 may lead to undesirable results.

도 3에 도시된 바와 같이, 음성활동검출(Voice Activity Detection)에 의해 프레임에 대한 평균 에너지를 계산하는 제 1 과정(S10)과, 상기 계산된 평균 에너지와 잡음 레벨을 비교하여 평균 에너지가 잡음레벨보다 크다면 유성음으로 판정을 하게 되고 그렇지 않으면 무성음이나 묵음 구간으로 판정하는 제 2 과정(S20)와, 상기 과정에서 유성음으로 판정한 경우 신호대잡음비(SNR)가 낮은 경우를 고려하기위해 LSP 간격의 최소값과 최대값을 이용하여 판정하는 제 3과정(S30)과, 상기 제 2과정에서 평균 에너지가 잡음레벨보다 작은 경우 음성의 에너지가 작은 경우를 고려하기 위해 LSP의 최대간격과 최소간격을 비교하는 판정하는 제 4과정(S40)으로구성된다.As shown in FIG. 3, a first step S10 of calculating an average energy for a frame by Voice Activity Detection, and comparing the calculated average energy with a noise level, the average energy is a noise level. If it is greater than the second step (S20) is determined as a voiced sound, otherwise determined as an unvoiced or silent section, and the minimum value of the LSP interval to consider the case where the signal-to-noise ratio (SNR) is low when it is determined as a voiced sound in the process The third step (S30) to determine by using and the maximum value, and the second step is determined to compare the maximum interval and the minimum interval of the LSP to consider the case where the energy of the voice is small when the average energy is less than the noise level It is composed of a fourth process (S40).

상기 제 3과정은 LSP 최소 간격이 최대간격의 1/2보다 클 경우 포만트가 존재한다는 것으로 음성활동검출로 설정하는 단계(S31)와, 그렇지 않을 경우 잡음의 에너지가 큰 신호로 판정하고 잡음의 레벨을 증가시키는 단계(S32)로 구성되며,상기 제 4과정은 LSP 최소 간격이 최대간격의 1/2 보다 작은 경우는 음성이 존재하는 경우로 설정하고 잡음 레벨을 줄이는 단계(S41)와,그렇지 않을 경우 무성음이나 묵음으로 판정하는 단계(S42)로 구성된다.In the third process, if the LSP minimum interval is greater than 1/2 of the maximum interval, the formant is set to voice activity detection (S31). And increasing the level (S32), wherein the fourth process sets the case where the voice is present when the LSP minimum interval is less than 1/2 of the maximum interval and reduces the noise level (S41); If not, it is a step (S42) for determining as unvoiced or silent.

먼저 VAD는 240샘플을 갖는 프레임에 대해 평균 에너지를 계산한다. 이렇게 구하여진 평균에너지와 잡음 레벨을 비교한다. 이때 평균에너지가 잡음레벨보다 크다면 유성음으로 판정을 하게 되고 그렇지 않으면 무성음이나 묵음 구간으로 판정을 한다. 유성음으로 판정 한 경우 SNR이 낮은 경우를 고려하기 위해 LSP 간격의 최소값과 최대값을 이용하여 최종판정을 하게 된다. 만약 LSP 최소간격이 최대간격의 1/2보다 크다면 포만트가 존재한다는 것이며 VAD=1로 설정된다. 만약 그렇지 않으면 잡음의 에너지가 큰 신호로 판정하고 잡음의 레벨을 증가시키게 된다.First, VAD calculates the average energy for a frame with 240 samples. Compare the average energy and noise level. At this time, if the average energy is greater than the noise level, it is judged as voiced sound. Otherwise, it is judged as unvoiced or silent section. In the case of voiced sound, the final decision is made by using the minimum and maximum values of the LSP interval to consider the case where the SNR is low. If the LSP minimum interval is greater than half of the maximum interval, the formant is present and VAD = 1 is set. Otherwise, the energy of the noise is determined to be a large signal and the level of noise is increased.

평균에너지가 잡음레벨보다 작은 경우 음성의 에너지가 작은 경우를 고려하기 위해 상기와 유사한 방법으로 LSP의 최대간격과 최소간격을 비교한다. 만약 LSP 최소간격이 최대간격의 1/2보다 작은 경우는 음성이 존재하는 경우이므로 VAD=1로 설정되고 잡음 레벨을 줄이게 된다.If the average energy is smaller than the noise level, the maximum and minimum intervals of the LSP are compared in a similar manner as above to consider the case where the energy of the voice is small. If the LSP minimum interval is less than 1/2 of the maximum interval, since voice is present, VAD = 1 is set and the noise level is reduced.

VAD는 기본적으로 에너지 검출기이다. 역 필터된 신호의 에너지는 문턱값과 비교되어지고 이 문턱값을 넘는 경우 그 프레임에는 음성이 존재하는 것으로 판정한다. 문턱값을 계산하기 위해서는 두 과정이 필요하다. 첫째, 잡음 레벨은 이전 프레임의 잡음 레벨과 현재 프레임의 역 필터된 에너지에 근거하여 갱신된다. 둘째, 문턱값은 로그스케일의 잡음 레벨을 이용하여 계산한다.VAD is basically an energy detector. The energy of the inverse filtered signal is compared with a threshold and if it exceeds this threshold it is determined that voice is present in the frame. Two steps are required to calculate the threshold. First, the noise level is updated based on the noise level of the previous frame and the inverse filtered energy of the current frame. Second, the threshold is calculated using the log scale noise level.

현재 프레임 t에 대해 Aen_t로 표기되는 적응 인에이블 플래그(Adaptation enable flag)는 VAD 잡음 레벨이 음성이 없는 경우에만 갱신되도록 하기위해 사용된다. 이러한 것은 배경 잡음이 음성 신호도 아니고 정현파도 아니라는 사실에 근거한다.An adaptation enable flag, denoted Aen _t for the current frame t, is used to ensure that the VAD noise level is updated only in the absence of speech. This is based on the fact that background noise is neither a speech signal nor a sinusoid.

이전과 현재 프레임의 개회로 피치 지연은 유성음을 판별하기 위해 사용된다.The open loop pitch delay of the previous and current frames is used to determine the voiced sound.

이 값을 L^j _OL, j=0,1,2,3이라고 할 때 최소 지연값 L^min _OL= Min(L^j _OL, j=0,1,2,3)이 먼저 계산된다. 계수기 pc ∈[1,2,3,4]는 L^min _OL(±3) 배수의 주위에 얼마나 많은 지연 L^j _OL이 존재하는지 나타내기 위한 것이다. 만약 pc가 4라면 그 신호는 유성음으로 판정된다.When this value is L ^j _OL , j = 0,1,2,3, the minimum delay value L ^min _OL = Min (L ^j _OL , j = 0,1,2,3) is calculated first. Counter pc ∈ [1,2,3,4] is intended to indicate how much delay L ^j _OL is around the L ^min _OL (± 3) multiple. If pc is 4, the signal is determined to be voiced.

다음의 정현파 검출기는 G.723.1 부호화기의 LPC 분석 내에 포함된다.The following sinusoidal detector is included in the LPC analysis of the G.723.1 encoder.

k^t _i[2]를 프레임 t의 각 부프레임 i=0,...,3에서 듀빈(Durbin) 알고리즘에 의해 계산된 두번째 반사계수라 할 때 만약 15개 값들 중에서 최소한 14개 값이k^t _i[2]≥0.95라면 정현파가 검출되는 것으로 판정한다(SinD=1). 그렇지 않은 경우 SinD=0이 된다. ^t _i k [2] for each subframe of the frame t i = 0, ..., when considered with the second reflection coefficient calculated by the dyubin (Durbin) algorithm if the at least three 14 values are k ^t from among the 15 values _{If i} [2]? 0.95, it is determined that a sine wave is detected (SinD = 1). Otherwise, SinD = 0.

적응 인에이블 플래그를 계산하는 수학식은 다음과 같다.The equation for calculating the adaptive enable flag is as follows.

Aen_t=Aen_t-1+2if pc=4 or SinD=1 Aen _t = Aen _t-1 +2 if pc = 4 or SinD = 1

Aen_t=Aen_t-1-1otherwise Aen _t = Aen _t-1 -1 otherwise

상기 수학식 1에서 Aen_t는 [0,6]을 경계조건으로 한다.In Equation 1, Aen _t is a boundary condition of [0,6].

입력신호프레임, {s[n]}_n=60..239,는 계수 {a_no[j]}_j=1..10를 갖는 FIR 필터 A_no(z)에 의해 역 필터링된다. 이 필터는 CNG블럭에 의해 계산되어지고 현재 프레임의 배경 잡음과 관련된 LPC필터를 제공한다.The input signal frame, {s [n]} _{n = 60..239} , is inversely filtered by the FIR filter A _no (z) with the coefficient {a _no [j]} _{j = 1..10} . This filter is calculated by the CNG block and provides an LPC filter that is related to the background noise of the current frame.

상기 수학식 2에서 e'_t는 역 필터링된 신호이다.In Equation 2, e ' _t is an inverse filtered signal.

에너지, Enr_t는 현재 프레임의 역 필터링된 신호로부터 계산되어진다.The energy, Enr _t, is calculated from the inverse filtered signal of the current frame.

프레임 t의 잡음 레벨, Nlev_t,는 이전의 잡음 레벨과 이전의 에너지, Enr_t-1,그리고 적응 인에이블 플래그, Aen_t에 의해 갱신된다. 이런 갱신 과정은 느린 증가, 빠른 감소로 특성되어진다. 프레임 t에서의 잡음 레벨의 동적 범위는 [Nlev_min, Nlev_max]으로 제한된다.The noise level of frame t, Nlev _t , is updated by the previous noise level and the previous energy, Enr _t-1 , and the adaptive enable flag, Aen _t . This update process is characterized by a slow increase, a fast decrease. The dynamic range of the noise level in frame t is limited to [Nlev _min , Nlev _max ].

만약 Nlev_t-1> Enr_t-1이면 잡음 레벨은 클리핑된다.If Nlev _t-1 > Enr _{t-1, the} noise level is clipped.

만약 적응(adaptation)이 활성되면, Nlev_t는 증가되고 그렇지 않으면 조금씩 감소된다.If adaptation is active, Nlev _t is increased, otherwise it is decreased slightly.

with with

프레임 t에서의 잡음 레벨, Nlev_t, 문턱값, Thr, 사이의 관계는 로그 스케일적으로 정의되고 다음과 같은 공식을 이용한다.The relationship between the noise level, Nlev _t , threshold, Thr, in frame t is defined in logarithmic scale and uses the following formula:

VAD결정은 문턱값 Thr과 현재에너지 Enr_t의 비교에 의해 결정된다.The VAD decision is determined by comparing the threshold Thr with the current energy Enr _t .

다음 수식의 변수는 제외하고 VAD 알고리즘의 모든 정적 변수는 0으로 초기화 된다.All static variables of the VAD algorithm are initialized to zero except for the variables in the following formula.

Nlev_-1= 1024Nlev _-1 = 1024

Enr_-1=1024Enr _-1 = 1024

L_OL ^j=1 j=0,1L _OL ^j = 1 j = 0,1

L_OL ^j=60 j=2,3L _OL ^j = 60 j = 2,3

유성음의 스펙트럼은 일반적으로 1kHz 이내에서 제 1포만트가 나타나고 3개 이상의 포만트가 존재한다. 반면에 무성음의 스펙트럼은 저주파수 영역의 에너지가 낮고 고주파수 영역의 에너지가 높다. 이와 같은 스펙트럼 모양처럼 LSP의 모양도 유성음과 무성음에서 뚜렸한 차이를 가진다. 먼저 유성음에서는 포만트로 인해 저주파수 영역에 보다 많은 LSP가 분포하고 그 간격도 고주파수의 LSP보다 좁게 나타난다. 반면에 무성음에서는 고주파수 영역에서 LSP가 많이 분포하고 그 간격도 저주파수 영역의 LSP보다 좁게 나타난다.In the voiced spectrum, the first formant generally appears within 1 kHz and three or more formants exist. On the other hand, the unvoiced spectrum has low energy in the low frequency region and high energy in the high frequency region. Like this spectral shape, the shape of the LSP also differs significantly between voiced and unvoiced sounds. First, in voiced sound, more LSPs are distributed in the low frequency region due to the formant, and the interval is narrower than that of the high frequency LSP. On the other hand, in the unvoiced sound, the LSP is distributed in the high frequency region and the spacing is narrower than that in the low frequency region.

위와 같은 유/무성음에 따른 LSP 파라미터의 특성으로 인해 성분분리가 가능해 진다. 먼저 LSP의 유/무성음에 따른 분포의 차이를 이용한다. 샘플링 주파수를 Fs라고 할 때 Fs/4이하의 주파수 영역에 존재하는 LSP 개수를 NL이라 하고, Fs/4이상의 주파수영역에 존재하는 LSP 개수를 NH라 한다면, NL이 NH보다 큰 경우는 음성신호의 스펙트럼이 저주파 쪽에서 봉우리(pole)가 많이 나타나는 모양이어서 유성음의 스펙트럼 특징을 나타낸다고 간주한다. 즉, 유성음의 제 1포만트와 제 2포만트가 주로 저주파수 영역에 존재하기 때문이다.Due to the characteristics of the LSP parameter according to the voiced and unvoiced sounds, component separation is possible. First, the difference of distribution according to LSP voiced and unvoiced sound is used. If the sampling frequency is Fs, and the number of LSPs in the frequency domain below Fs / 4 is NL, and the number of LSPs in the frequency domain above Fs / 4 is NH, when NL is larger than NH, It is considered that the spectrum is characterized by the appearance of many poles on the low frequency side, indicating the spectral characteristics of voiced sound. That is, the first formant and the second formant of the voiced sound mainly exist in the low frequency region.

이와 반대로 NH가 NL보다 큰 경우는 무성음을 나타낸다고 결정한다. 즉 무성음의 스펙트럼은 주된 포만트가 고주파영역에 나타나기 때문이다. 하지만 /i/I/ε/ae/와 같은 유성음은 제 2포만트, 제 3포만트 또는 제 4 포만트가 고주파수 영역에 존재하여 유성음이면서도 NH가 NL보다 크게 나타난다. 이와 같은 경우에는 제 1포만트의 존재 여부로써 무성음인지 유성음인지를 결정하게 된다. 즉, LSP파라미터들의 간격을 조사하여 Fs/4이하의 영역에서 좁은 간격을 나타내는 LSP들이 존재하면 유성음으로 간주한다.In contrast, if NH is greater than NL, it is determined that it is silent. In other words, the unvoiced spectrum is because the main formant appears in the high frequency region. However, voiced sound such as / i / I / ε / ae / has a second formant, a third formant, or a fourth formant in the high frequency region, so that NH is larger than NL even though it is a voiced sound. In such a case, whether the first formant is present or not is determined whether it is an unvoiced sound or a voiced sound. In other words, if there are LSPs with narrow intervals in the region below Fs / 4 by examining the intervals of the LSP parameters, they are regarded as voiced sounds.

상기 기술된 설명에서 알 수 있는 바와 같이, 본 발명은 종래의 G.723.1의 음성 활동 검출 장치에 비해 묵음 구간에 대한 검출을 보다 빈번히 함으로써 전송율 감소 효과를 가져올 수 있으며 그로 인해 사용자 증대 효과를 얻을 수 있다. 또한 본 발명은 음성 활동 검출 뿐만 아니라 음성인식이나 화자식별시 묵음 구간 검출을 위한 알고리즘으로 사용될 수 있으며, 음성 분석시 유/무성음 구간 검출시에도 효과적인 결과를 얻을 수 있다. 음성 분석시에는 LSP 최소 간격정보가 필요하며 이 정보는 제 1 포만트의 위치를 나타내므로 본 발명에서 제시한 알고리즘을 첨가하여 유/무성음 판정이 가능하다는 효과가 있다.As can be seen from the above description, the present invention can reduce the transmission rate by making the detection of the silent section more frequently than the conventional G.723.1 voice activity detection device, and thereby obtain the user increase effect. have. In addition, the present invention can be used as an algorithm for detecting silent sections as well as voice activity detection and speech recognition, and effective results can be obtained when detecting voiced / unvoiced sections. The LSP minimum interval information is required for speech analysis, and this information indicates the position of the first formant. Therefore, it is possible to determine whether the voice / voice is sound by adding the algorithm proposed in the present invention.

Claims

In the method of lowering the transmission rate in the silent section in the speech signal processing encoder,

A first process of calculating an average energy for a frame by Voice Activity Detection,

A second process of comparing the calculated average energy with the noise level and determining that the average energy is greater than the noise level as voiced sound;

A third step of determining by using the minimum and maximum values of the LSP interval in order to consider the case where the signal-to-noise ratio (SNR) is low when it is determined as voiced sound in the above process;

Speech signal processing encoder using energy and LSP parameters, characterized in that the fourth step of comparing the maximum and minimum intervals of the LSP to determine the case where the average energy is less than the noise level is compared to determine the maximum and minimum intervals of the LSP Voice Activity Detection in.

The method of claim 1, wherein the third process comprises: setting the voice activity detection that the formant exists when the LSP minimum interval is greater than 1/2 of the maximum interval;

Otherwise, determining the signal with a large energy of noise and increasing the level of noise. The method of detecting a voice activity in a speech signal processing encoder using energy and LSP parameters.

The method of claim 1, wherein the fourth process comprises the steps of setting the case where the voice is present when the LSP minimum interval is less than 1/2 of the maximum interval and reducing the noise level;

Otherwise, the voice activity detection method in the speech signal processing encoder using the energy and the LSP parameter, characterized in that it is determined as unvoiced or silent.