KR100294921B1

KR100294921B1 - The method and apparatus of speech detection for speech recognition of cellular communication system

Info

Publication number: KR100294921B1
Application number: KR1019980037172A
Authority: KR
Inventors: 김경선; 공병구; 김동국; 김진
Original assignee: 윤종용; 삼성전자 주식회사
Priority date: 1998-09-09
Filing date: 1998-09-09
Publication date: 2001-07-12
Also published as: KR20000019198A

Abstract

본 발명은 보코더에서 생성된 음성특징 파라미터만을 이용한 음성검출방법 및 장치에 관한 것으로, 이러한 음성검출방법은 (a)보코더에서 상기 음성특징 파라미터를 추출하는 과정, (b)음성특징 파라미터를 이용하여 유사신호를 생성하는 과정, (c)유사신호의 절대값을 합해 얻어진 프레임별 유사신호의 게인을 생성하는 과정, (d)유사신호 게인의 중첩평균을 시간에 따라 구하는 과정, (e)상기 유사신호 게인의 중첩평균결과와 이전 프레임들의 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성사이의 묵음구간인지를 결정하고, 음성과 음성사이의 묵음구간을 포함한 시작위치와 끝위치 정보를 알려주는 과정; 및 (f)상기 (e)과정의 결과를 이용하여 음성구간의 시작위치와 끝위치 결과의 앞뒤 소정 프레임내에서 상기 (e)과정을 다시 수행시켜 음성출력 결과를 보정하는 과정을 포함함을 특징으로 한다.The present invention relates to a voice detection method and apparatus using only voice feature parameters generated by a vocoder. The voice detection method includes (a) extracting the voice feature parameter from a vocoder, and (b) using a voice feature parameter. A process of generating a signal, (c) generating a gain of a similar signal for each frame obtained by summing the absolute values of the similar signals, (d) obtaining a superimposed average of the similar signal gains over time, and (e) the similar signal The process of determining whether the current frame is a voice or a silent section between voice and voice using the superimposed average result of the gain and the results of previous frames, and notifying the starting and ending position information including the silent section between voice and voice. ; And (f) performing the process (e) again within a predetermined frame before and after the start and end positions of the voice section by using the result of step (e) to correct the voice output result. It is done.

본 발명에 의하면, 적은 계산량으로도 음성 검출이 가능하므로 하드웨어의 추가 없이 소프트웨어만으로 기존 이동 전화기에 음성 검출 기능을 적용시킬 수 있다. 고속도로 같은 잡음이 다양한 환경에서도 음성 검출이 가능하므로 자동차 운전시에도 음성 다이얼링을 할 수 있다.According to the present invention, since voice detection is possible with a small amount of calculation, the voice detection function can be applied to an existing mobile phone using only software without adding hardware. Voice detection is possible even in a variety of noisy environments such as highways, enabling voice dialing while driving a car.

Description

The method and apparatus of speech detection for speech recognition of cellular communication system

본 발명은 이동전화의 음성인식에 관한 것으로, 특히 이동전화 단말기에서 잡음이 포함된 입력음성에서 음성부분만 선별하는 방법에 관한 것이다.The present invention relates to voice recognition of a mobile phone, and more particularly, to a method for selecting only a voice part from an input voice including noise in a mobile phone terminal.

음성 신호를 입력받아 음성 검출하는 방법은 많이 나와있지만 이동전화기에서 음성 검출하는 것은 기존과는 다른 문제이다. 우선 입력이 음성이 아니라 각 이동전화기에서 채택한 부호화기의 결과이다.Although there are many methods for detecting voice by receiving a voice signal, detecting a voice in a mobile phone is a different problem. First of all, the input is not the voice but the result of the encoder adopted by each mobile phone.

예를 들면 CDMA방식에서는 10차 LSP(Line Spectrum Pair)값과 프레임별 게인값과 피치 정보이다. 그리고, 요구되는 계산량이 1~2 MIPS 이내여야하며 메모리도 100KByte이내여야 한다. 따라서, 복호화기를 구동시켜 음성 신호를 재생하여 음성구간을 검출하는 기존의 방법은 CDMA나 GSM용 이동전화기에서는 불가능하게 된다. 한편, 이동전화기에서 생성되는 음성 특징 파라미터 중 게인값을 이용하여 음성 검출하는 방법은 조용한 환경에서는 소용이 있으나 잡음이 있는 고속도로 환경에서는 성공률이 극히 희박하다.For example, in the CDMA method, the tenth order LSP (Line Spectrum Pair) value, the gain value and the pitch information for each frame are used. And the required calculation amount should be within 1 ~ 2 MIPS and the memory should be within 100KByte. Therefore, the conventional method of driving the decoder and reproducing the voice signal to detect the voice section is impossible in the CDMA or GSM mobile telephone. On the other hand, a method of detecting a voice using a gain value among voice feature parameters generated in a mobile phone is useful in a quiet environment, but the success rate is extremely low in a noisy highway environment.

도 1과 도 2는 종래의 음성검출장치를 도시한 것이다.1 and 2 show a conventional voice detection device.

먼저, 도 1은 프레임별 게인 파라미터를 이용한 음성검출장치를 도시한 블록도이다.First, FIG. 1 is a block diagram illustrating a voice detection apparatus using a gain parameter for each frame.

입력 인터페이스(110)는 프레임별 패킷 데이터에서 소정시간이내에 보코더에서 음성특징 파라미터를 추출한다. 프레임 상태 판정부(120)는 상기 음성특징 파라미터의 게인 파라미터를 이용하여 현재 프레임이 음성인지, 음성과 음성 사이의 묵음 구간인지, 배경 잡음 구간인지 판정하여 1차 음성구간을 결정한다. 1차 음성구간 결정은 음성과 음성 사이의 묵음 구간을 포함한 시작 위치와 끝 위치 정보를 알려준다. 상기 1차 음성 구간 결과를 이용하여 후처리부(130)는 시작 위치와 끝 위치를 좀더 상세하게 보정해 준다.The input interface 110 extracts a voice feature parameter from the vocoder within a predetermined time from the packet data for each frame. The frame state determiner 120 determines whether the current frame is a voice, a silence section between the voice and the voice, or a background noise section using the gain parameter of the voice feature parameter to determine the first voice section. The first voice segment determination informs the start position and the end position information including the silence section between the speech and the speech. After the first voice section results, the post-processing unit 130 corrects the start position and the end position in more detail.

도 2는 재생신호를 이용한 음성검출장치를 도시한 블록도이다.2 is a block diagram showing a voice detection apparatus using a reproduction signal.

입력 인터페이스(210)는 프레임별 패킷 데이터에서 소정시간이내에 보코더에서 음성특징 파라미터를 추출한다. 입력신호재생부(220)은 상기 음성특징 파라미터를 이용하여 재생신호를 생성한다. 게인생성부(230)는 상기 재생신호의 절대값을합해 얻어진 프레임별 재생신호의 게인을 생성한다. 프레임상태판정부(240)은 상기 재생신호 게인의 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성 사이의 묵음 구간인지, 배경 잡음 구간인지 판정하여 1차 음성구간을 결정한다. 1차 음성구간 결정은 음성과 음성 사이의 묵음 구간을 포함한 시작 위치와 끝 위치 정보를 알려준다. 상기 1차 음성 구간 결과를 이용하여 후처리부(230)는 시작 위치와 끝 위치를 좀더 상세하게 보정해 준다.The input interface 210 extracts a voice feature parameter from the vocoder within a predetermined time from the packet data for each frame. The input signal reproducing unit 220 generates a reproducing signal using the voice feature parameter. The gain generator 230 generates a gain of the reproduction signal for each frame obtained by adding up the absolute values of the reproduction signals. The frame state determiner 240 determines whether the current frame is a voice, a silent section between the voice and the voice, or a background noise section, using the result of the reproduction signal gain to determine the primary voice section. The first voice segment determination informs the start position and the end position information including the silence section between the speech and the speech. After the first voice section results, the post processor 230 corrects the start position and the end position in more detail.

이동 전화기의 음성 다이얼링에서 음성검출이 어려운 점은 사용할 수 있는 자원이 극히 제한되어 있다는 점과 아울러 고속도로 같은 실제 상황에서 생성되는 잡음과 음성 신호를 구별하기 어렵다는 점이다. 좀더 부연 설명하자면, 실제 상황에서의 잡음은 절대적인 잡음 신호의 크기가 클 수 있다. 예를 들면 고속도로나 국도에서의 신호대 잡음비(SNR)는 6dB ~ -6dB정도이다. 그리고, 처리하여야 할 잡음의 종류가 다양하다. 자기 차의 엔진 소리, 틈새를 통해 들리는 바람 소리, 깜박이나 에어콘 소리, 옆차 지나가는 소리, 노면 상황에 따른 쿵쾅거리는 소리 등과 음성 신호를 구별해야 한다. 그리고, 입력되는 잡음의 특성이 시변이다. 자동차 속도의 변화나 방향에 따라 엔진 소리나 바람 소리의 절대 크기가 수시로 변화된다.The difficulty of voice detection in voice dialing of mobile phones is that the resources available are extremely limited, and that it is difficult to distinguish between noise and voice signals generated in real-world situations such as highways. More specifically, noise in real situations can be large in absolute noise signals. For example, the signal-to-noise ratio (SNR) on highways and national highways is between 6dB and -6dB. And there are various kinds of noise to be processed. You'll need to distinguish between the sound of your car's engine, the sound of the wind through a crevice, the sound of flickering or air conditioning, the sound of passing by your side of the car, and the thumping of the road. In addition, the characteristics of the input noise are time-varying. The absolute loudness of the engine sound or wind noise changes from time to time as the speed or direction of the car changes.

따라서, 도 1의 경우는 신호대 잡음비가 6dB ~ -6dB정도인 지역에서는 음성검출 능력이 21%정도로 현저히 떨어진다는 단점이있다. 도 2의 경우는 음성검출 계산량이 15MIPS정도로 계산량이 너무 많으므로 별도의 하드웨어가 추가되어야 하는 단점이 있다.Therefore, in the case of FIG. 1, the voice detection capability is significantly reduced to 21% in the region where the signal-to-noise ratio is about 6dB to -6dB. In the case of FIG. 2, since the amount of calculation of the voice detection calculation is about 15 MIPS, there is a disadvantage that additional hardware must be added.

본 발명이 이루고자하는 기술적 과제는 음성검출 능력이 뛰어나며 적은 계산량으로도 음성 검출이 가능하므로 하드웨어의 추가 없이 소프트웨어만으로 기존 이동 전화기에 음성 검출 기능을 적용시킬 수 있는 유사신호 생성방법을 이용한 음성검출방법 및 그 장치를 제공함에 있다.The technical problem to be achieved by the present invention is the voice detection method using a similar signal generation method that can apply the voice detection function to the existing mobile phone only by software without the addition of hardware because the voice detection ability is excellent and the voice detection is possible with a small amount of calculation In providing the device.

도 1은 프레임별 게인 파라미터를 이용한 음성 검출장치를 도시한 블록도이다.1 is a block diagram illustrating an apparatus for detecting a voice using a gain parameter for each frame.

도 2는 재생 신호를 이용한 음성 검출장치를 도시한 블록도이다.2 is a block diagram showing a voice detection apparatus using a reproduction signal.

도 3은 본 발명에 의한 유사신호 생성방법을 이용한 음성 검출장치의 블록도를 도시한 것이다.3 is a block diagram of an apparatus for detecting speech using the method for generating a similar signal according to the present invention.

상기 기술적 과제를 해결하기 위한 본 발명에 의한 음성검출방법은 음성 보코더에서 생성된 음성특징 파라미터만을 이용한 음성검출방법에 있어서, (a)상기 보코더에서 상기 음성특징 파라미터를 추출하는 과정; (b)상기 음성특징 파라미터를 이용하여 유사신호를 생성하는 과정; (c)상기 유사신호의 절대값을 합해 얻어진 프레임별 유사신호의 게인을 생성하는 과정; (d)상기 유사신호 게인의 중첩평균을 시간에 따라 구하는 과정; (e)상기 유사신호 게인의 중첩평균결과와 이전 프레임들의 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성사이의 묵음구간인지를 결정하고, 음성과 음성사이의 묵음구간을 포함한 시작위치와 끝위치 정보를 알려주는 과정; 및 (f)상기 (e)과정의 결과를 이용하여 음성구간의 시작위치와 끝위치 결과의 앞뒤 소정 프레임내에서 상기 (e)과정을 다시 수행시켜 음성출력 결과를 보정하는 과정을 포함함을 특징으로 한다.According to an aspect of the present invention, there is provided a voice detection method using only a voice feature parameter generated by a voice vocoder, comprising: (a) extracting the voice feature parameter from the vocoder; (b) generating a similar signal using the voice feature parameter; (c) generating gains of the similar signals for each frame obtained by adding the absolute values of the similar signals; (d) obtaining a superposition average of the pseudo signal gains over time; (e) Determine whether the current frame is a voice or a silent section between voice and voice by using the superimposed average result of the similar signal gain and the results of previous frames, and the start position and the end including the silent period between voice and voice. Informing location information; And (f) performing the process (e) again within a predetermined frame before and after the start and end positions of the voice section by using the result of step (e) to correct the voice output result. It is done.

상기 다른 기술적 과제를 해결하기 위한 본 발명에 의한 음성검출장치는 음성 보코더에서 생성된 음성특징 파라미터만을 이용하여 음성을 검출하는 음성검출장치에 있어서, 프레임별 패킷 데이터에서 소정시간이내에 상기 보코더에서 음성특징 파라미터를 추출하는 입력 인터페이스; 상기 음성특징 파라미터를 이용하여 유사신호를 생성하는 유사신호 생성부; 상기 유사신호의 절대값을 합해 얻어진 프레임별 유사신호의 게인을 생성하는 유사신호 게인 생성부; 상기 유사신호 게인의 중첩평균을 시간에 따라 구하는 게인평균계산부; 상기 게인 평균부의 결과와 이전 프레임들의 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성사이의 묵음구간인지를 결정하고, 음성과 음성사이의 묵음구간을 포함한 시작위치와 끝위치 정보를 알려주는 프레임 상태 판정부; 및 상기 프레임 상태 판정부의 결과를 이용하여 음성구간의 시작위치와 끝위치 결과의 앞뒤 소정 프레임내에서 상기 프레임 상태 판정부를 다시 수행시켜 음성출력 결과를 보정하는 후처리부을 포함함을 특징으로 한다.According to another aspect of the present invention, there is provided a speech detection apparatus for detecting speech using only speech feature parameters generated by a speech vocoder. An input interface for extracting parameters; A similar signal generator for generating a similar signal using the voice feature parameter; A pseudo signal gain generator for generating a gain of a similar signal for each frame obtained by summing the absolute values of the similar signals; A gain average calculating unit for obtaining a superposition average of the pseudo signal gains over time; A frame for determining whether the current frame is a voice or a silent section between the voice and the voice by using the result of the gain average unit and the results of previous frames, and notifying the start and end position information including the silent section between the voice and the voice. State determination unit; And a post-processing unit for correcting the voice output result by performing the frame state determining unit again within a predetermined frame before and after the result of the start and end positions of the voice section by using the result of the frame state determining unit.

이하 도면을 참조하여 본 발명을 상세히 설명하기로 한다.Hereinafter, the present invention will be described in detail with reference to the accompanying drawings.

도 3은 본 발명에 의한 유사신호 생성방법을 이용한 음성검출장치에 관한 것으로, 입력 인터페이스(310), 유사신호 생성부(320), 유사신호 게인생성부(330), 게인평균계산부(340), 프레임상태 판정부(350) 및 후처리부(360)으로 이루어진다.3 is related to a voice detection apparatus using a method for generating a similar signal according to the present invention, and includes an input interface 310, a similar signal generator 320, a similar signal gain generator 330, and a gain average calculator 340. And a frame state determination unit 350 and a post-processing unit 360.

입력 인터페이스(310)는 프레임별 패킷 데이터에서 소정시간이내에 상기 보코더에서 음성특징 파라미터를 추출하고, 보코더의 패킷 데이터를 언패킹(unpackin g)하고 그 데이타를 해석하여 음성 검출단에 입력시켜주는 일을 한다.The input interface 310 extracts voice feature parameters from the vocoder within a predetermined time from the packet data for each frame, unpacks the packet data of the vocoder, interprets the data, and inputs the data to the voice detection stage. do.

유사신호 생성부(320)는 음성특징 파라미터를 이용하여 유사 신호를 생성한다.The similar signal generator 320 generates a similar signal using a voice feature parameter.

유사신호 게인생성부(330)는 생성된 유사신호를 절대값을 취한 후 프레임의 크기 만큼 합해 얻어진 프레임별 유사신호의 게인을 생성한다.The similar signal gain generation unit 330 generates the gain of the similar signal for each frame obtained by taking the absolute value of the generated similar signal and summing it by the size of the frame.

유사신호 게인 평균부(340)는 연속된 4개의 프레임의 평균값을 구하는 일을 수행한다.The similar signal gain average unit 340 calculates an average value of four consecutive frames.

프레임 상태 판정부(350)는 유사신호의 게인 평균값과 과거 상태 판정 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성 사이의 묵음 구간인지, 배경 잡음 구간인지 판정해주는 일을 한다. 1차 음성 구간 결정은 음성과 음성 사이의 묵음 구간을 포함한 시작 위치와 끝 위치 정보를 알려준다.The frame state determination unit 350 determines whether the current frame is a voice, a silent section between the voice and the voice, or a background noise section, using the gain average value of the similar signal and the past state determination result. The first speech segment determination informs the start position and the end position information including the silence section between the speech and the speech.

후처리부(360)은 1차 음성 구간 결과를 이용하여 시작 위치와 끝 위치를 좀더 상세하게 보정해 준다.The post-processing unit 360 corrects the start position and the end position in more detail by using the first speech section result.

상술한 구성에 의거하여 본 발명의 동작에 대하여 설명하기로 한다.The operation of the present invention will be described based on the above configuration.

본 발명은 QCELP, EVRC 및 RPE-LTP 같은 보코더에서 생성된 음성특징 파라미터만을 이용한 것이다. 상기 보코더의 패킷 데이터를 언패킹(unpacking)하고 그 데이타를 해석하여 음성검출단에 입력하고, 프레임 별 패킷 데이터에서 1~2msec 이내에 특징 파라미터를 추출한다.The present invention uses only voice feature parameters generated by vocoder such as QCELP, EVRC and RPE-LTP. Unpacking the packet data of the vocoder, interpreting the data, inputting the data to the voice detection terminal, and extracting feature parameters within 1 to 2 msec from the packet data for each frame.

상기 특징 파라미터를 이용하여 유사 신호를 생성하기 위해서는 백색 잡음 펄스에 보코더 게인 값을 이용하여 게인 필터링을 한 후, 피치 정보를 이용하여 피치 필터링을 수행한다. 이렇게 해서 생성된 신호는 사람의 귀로 구별할 수 있을 정도의 음성 신호는 아니지만 음성을 구별하는 기본 정보인 음의 크기와 주파수 정보를 포함하고 있다. 수학식 1은 유사 신호를 생성하는 식이다.To generate a similar signal using the feature parameter, gain filtering is performed using a vocoder gain value on a white noise pulse, and then pitch filtering is performed using pitch information. The signal generated in this way is not a speech signal that can be distinguished by a human ear, but contains loudness and frequency information, which is basic information for distinguishing speech. Equation 1 generates an analogous signal.

x(i)=w(i)*G+x(i-l)*Bx (i) = w (i) * G + x (i-l) * B

여기서, i는 샘플 수에 해당하는 시간 변수이고, G값은 프레임별 게인 값, L은 프레임별 피치, B는 프레임별 피치 게인, x(i)는 유사 신호, w(i)는 백색 신호이다.Where i is a time variable corresponding to the number of samples, G is a gain value per frame, L is a pitch per frame, B is a pitch gain per frame, x (i) is a similar signal, and w (i) is a white signal. .

이런 방법을 이용하여 이동 전화기에서 음성검출기의 실시간 구현을 위해 각 보코더의 신호 재생 장치를 대치할 수 있는 것이다.Using this method, the signal reproducing apparatus of each vocoder can be replaced for real-time implementation of the voice detector in the mobile phone.

상기 수학식 1에 의해 생성된 유사신호에서 유사신호의 게인을 생성하기 위해서는 상기 유사신호의 절대값을 취한 후 프레임의 크기 만큼 합한다. 각 신호를 제곱하여 더한 후 제곱근을 구해야 정확한 게인값을 구할 수 있지만 실시간에 구현하기 위해서는 곱하기나 제곱근은 사용할 수 없다. 수학식 2는 유사신호의 게인을 생성하는 식이다.In order to generate the gain of the similar signal from the similar signal generated by Equation 1, the absolute value of the similar signal is taken and summed by the size of the frame. Each signal must be squared and added to find the square root to get the correct gain, but for real-time implementation, you cannot use multiplication or square root. Equation 2 is an equation for generating the gain of the pseudo signal.

s(j) = abs_sum(x(j*I) ~ x(j*(I+1) - 1))s (j) = abs_sum (x (j * I) to x (j * (I + 1)-1))

여기서, j는 프레임 수로서 j = i / I 에 해당한다. I는 프레임의 크기로서 160(20msec)이 일반적이다. abs_sum(x(k) ~ x(m))는 k번째 유사신호부터 m번째 유사신호까지의 절대값 합을 의미한다.Here, j corresponds to the number of frames j = i / I. I is a frame size of 160 (20 msec) in general. abs_sum (x (k) to x (m)) means the sum of absolute values from the kth similar signal to the mth similar signal.

상기 유사신호 게인의 평균을 구하기 위해서는 수학식 3과 같이 연속된 4개의 프레임의 평균값을 구하는 일을 수행한다. 본 발명에서 연속된 4개의 프레임을 사용하는 이유는 음성의 특징이 가장 잘 유지되는 프레임의 길이이기 때문이다.In order to calculate the average of the similar signal gains, an average value of four consecutive frames is calculated as shown in Equation 3 below. The reason for using four consecutive frames in the present invention is that the feature of speech is the length of the frame that is best maintained.

nG(j) = (s(j-3) + s(j-2) + s(j-1) + s(j)) / 4nG (j) = (s (j-3) + s (j-2) + s (j-1) + s (j)) / 4

각 프레임별 유사신호의 게인은 시간에 따른 변화 폭이 커 수학식 2의 결과를 이용한 음성 검출시에 에러가 발생할 확률이 높아진다. 수학식 3은 수학식 2보다 게인의 변화폭이 안정되어서 음성 부분과 잡음 부분을 구분짓기가 용이하다.The gain of the similar signal for each frame has a large variation in time, and thus the probability of an error occurring during voice detection using the result of Equation 2 increases. Equation 3 has a more stable variation in gain than Equation 2, so that it is easy to distinguish between the voice part and the noise part.

프레임 상태 판정부(350)는 상기 유사신호의 게인 평균값과 과거 상태 판정 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성 사이의 묵음 구간인지, 배경 잡음 구간인지 판정하여 1차 음성구간을 결정한다. 1차 음성구간 결정은 음성과 음성 사이의 묵음 구간을 포함한 시작 위치와 끝 위치 정보를 알려준다.The frame state determination unit 350 determines whether the current frame is a voice, a silent section between a voice and a voice, or a background noise section using the gain average value of the similar signal and the past state determination result to determine the first voice section. . The first voice segment determination informs the start position and the end position information including the silence section between the speech and the speech.

또한, 프레임 상태 판정부(350)의 상세한 설명과 사용되는 용어는 다음과 같다.In addition, the detailed description of the frame state determination unit 350 and the terms used are as follows.

1. 용어1. Terminology

X[i]: 현 프레임의 유사신호 게인 평균값X [i]: Average of similar signal gain of current frame

TH1 : 묵음구간의 평균에너지 값의 3.75배에 해다하는 값. 이것보다 작으면 그 프레임은 확실한 묵음구간.TH1: Value of 3.75 times the average energy value of the silent section. If it is smaller than this, the frame is a silent section.

TH2 : 묵음구간의 평균에너지 값의 6.25배에 해당하는 값. 이것보다 크면 그 프레임은 확실한 음성구간TH2: Value corresponding to 6.25 times the average energy value of the silent section. If it is larger than this, the frame is a definite voice interval

LEN_TH : 음성입력 과정이 끝났다는 것을 판정하는데 쓰이는 수치로서 마지막 음성 끝부분부터 현 묵음프레임의 길이가 이 수치보다 큰지 작은지 비교하여 음성입력 종료여부를 결정한다. 보통 0.5초를 사용한다.LEN_TH: A number used to determine the end of the voice input process. It determines whether the voice input ends by comparing whether the length of the current silent frame is larger or smaller than this value from the end of the last voice. Usually 0.5 seconds is used.

2. 단계별 설명2. Step by step description

(1)X[i]가 TH1보다 작다면 그 구간은 묵음구간.(1) If X [i] is less than TH1, the interval is silent.

(2)X[i]가 TH2보다 크다면 그 구간은 음성구간.(2) If X [i] is greater than TH2, the interval is a negative segment.

(3)X[i]가 TH1보다 크고 TH2 보다 작다면,(3) if X [i] is greater than TH1 and less than TH2,

1)바로 이전 프레임이 음성구간이라면 그 구간은 음성구간1) If the previous frame is a voice section, the section is a voice section.

2)바로 이전 프레임이 음성구간이 아니라면 그 구간은 묵음구간.2) If the previous frame is not a voice section, the section is a silent section.

(4)앞의 단계에서 묵음구간으로 판정이 난 경우, 이전 상태판정단에서 음성구간이 있었고, 마지막 음성구간의 끝부터 현 프레임까지의 길이가 LEN_TH 보다 크다면 음성검출 완료.(4) In the case where the silent section is judged in the previous step, if there was a speech section at the previous status judgment stage, and the length from the end of the last speech section to the current frame is greater than LEN_TH, the speech detection is completed.

(5)음성검출이 완료되었으면 전체 음성 입력에서 음성의 시작위치와 끝위치를 구한다.(5) When voice detection is completed, find the start position and end position of voice from all voice inputs.

후처리부(350)는 상기 1차 음성 구간 결과를 이용하여 시작 위치와 끝 위치를 좀더 상세하게 보정해 준다. 이것은 시작 위치 결과의 앞뒤 10프레임, 끝 위치 결과의 앞뒤 10프레임 내에서 상태 판정부를 다시 수행시켜 음성 검출 결과를 보정시키는 일을 수행한다.The post processor 350 corrects the start position and the end position in more detail by using the first speech section result. This performs the state determination unit again within 10 frames before and after the start position result and 10 frames before and after the end position result to correct the voice detection result.

후처리부(350)의 입력은 프레임 상태 판정부(340)에서 결정된 음성구간들의 위치정보와 그 구간들에 해당되는 유사게인 평균값들이다.The inputs of the post processor 350 are position information of voice sections determined by the frame state determiner 340 and similar gain average values corresponding to the sections.

후처리과정을 단계별로 살펴보면 다음과 같다.Looking at the post-processing step by step as follows.

(1)음성구간이 연속되어 있다면 그 연속구간(혹은 음성펄스구간)의 시작과 끝 위치정보를 이용하여 펄스 구간 길이를 구한다.(1) If the voice section is continuous, the pulse section length is obtained by using the start and end position information of the continuous section (or voice pulse section).

(2)펄스 구간의 길이가 정해진 값(예:0.04초)보다 작다면 그 펄스 구간음 음성이 아니라 주변잡음이므로 묵음구간으로 편입시킨다.(2) If the length of the pulse section is smaller than the specified value (eg 0.04 seconds), it is incorporated into the silence section because the pulse section sound is not the voice but the surrounding noise.

(3)2단계 과정의 결과를 고려하여 묵음구간의 길이가 LEN_TH(0.5초)보다 큰 부분이 있는지 점검한다.(3) In consideration of the result of the 2nd step, check if there is any part where the silence section is longer than LEN_TH (0.5 sec).

3-1)첫번째 펄스나 마지막 펄스가 2단계에 의해 묵음으로 편입되었다면 전체 음성의 시작과 끝도 그에 따라 변경해서음성구간 후처리 과정을 완료한다.3-1) If the first pulse or the last pulse is incorporated as muted by step 2, the start and end of the entire voice are also changed accordingly to complete the voice interval post-processing process.

3-2)세개 이상의 펄스가 존재하는 경우에서 첫 번째나 마지막 펄스가 아닌 그 외의 펄스가 묵음으로 편입되어 바로 앞뒤의 음성사이가 LEN_TH(0.5초)보다 크게 될 때는 전체 음성구간을 두 부분으로 나누어 각각 음성구간의 시작 위치와 끝위치를 구하고 그 구간의 길이를 구한다. 이렇게 해서 얻어진 두 부분의 길이중 큰 쪽을 선택해 음성구간 후처리과정을 완료한다.3-2) When three or more pulses are present, when the pulses other than the first or last pulse are incorporated as silence, and the voices immediately before and after the voice are larger than LEN_TH (0.5 sec), the whole voice interval is divided into two parts. Find the start and end positions of the voice section, respectively, and find the length of that section. The larger of the lengths of the two parts thus obtained is selected to complete the postprocessing of the voice segment.

다음은 종래기술과 본 발명을 비교한 것을 표로 나타낸 것이다.The following shows a table comparing the prior art and the present invention.

표 1은 CDMA단말기에서 종래의 프레임별 게인값과 본 발명에 의한 유사신호 게인값을 이용한 음성 검출 결과를 비교한 것이다.Table 1 compares a conventional frame-specific gain value and a voice detection result using a similar signal gain value according to the present invention in a CDMA terminal.

종류 SNRClass SNR 22dB이상22 dB or more 15~10dB15-10 dB -6~6dB-6 to 6 dB 프레임별 게인Gain Per Frame 95%95% 72%72% 21%21% 유사신호 게인Similar signal gain 95%95% 96%96% 75%75%

표 2는 CDMA단말기에서 재생신호와 본 발명에 의한 유사신호를 이용한 음성 검출 결과를 비교한 것이다.Table 2 compares the results of voice detection using the playback signal and the analogous signal according to the present invention in the CDMA terminal.

종류 SNRClass SNR 22dB이상22 dB or more 15~10dB15-10 dB -6~6dB-6 to 6 dB 유사 신호Pseudo signal 99%99% 96%96% 75%75% 재생 신호Play signal 99%99% 97%97% 81%81%

표 3은 CDMA단말기에서 재생신호와 본 발명에 의한 유사신호를 이용한 음성검출 계산량을 비교한 것이다.Table 3 compares the calculation amount of speech detection using the playback signal and the analogous signal according to the present invention in the CDMA terminal.

종류 SNRClass SNR 곱하기 수Multiplication 더하기 수Plus number 전체 계산량Total computation 유사 신호Pseudo signal 16000/초16000 / second 40000/초40000 / second 1 MIPS1 MIPS 재생 신호Play signal 320000/초320000 / second 720000/초720000 / second 15 MIPS15 MIPS

Claims

In the voice detection method using only the voice feature parameters generated by the voice vocoder,

(a) extracting the speech feature parameter from the vocoder;

(b) generating a similar signal using the voice feature parameter;

(c) generating gains of the similar signals for each frame obtained by adding the absolute values of the similar signals;

(d) obtaining a superposition average of the pseudo signal gains over time;

(e) Determine whether the current frame is a voice or a silent section between voice and voice using the superimposed average result of the similar signal gain and the results of previous frames, and the start position and end including the silent section between voice and voice. Informing location information; And

(f) using the result of step (e) to correct the voice output result by performing step (e) again within a predetermined frame before and after the start and end positions of the voice section. Voice detection method.

The method of claim 1, wherein in step (b), the pseudo signal x (i) is

x (i) = w (i) * G + x (i-L) * B

Where i is a time variable corresponding to the number of samples, G is the gain value per frame, L is the pitch per frame, B is the pitch gain per frame, x (i) is a similar signal, and w (i) is a white signal to be. )

Voice detection method, characterized in that generated by.

The method of claim 2, wherein in step (c), the similar signal gain s (j) is

s (j) = abs_sum (x (j * I) to x (j * (I + 1)-1))

Where j is the number of frames and j = i / I. I is the size of the frame. Abs_sum (x (k) to x (m)) is the absolute value from the kth pseudosignal to the mth pseudosignal It means sum.)

Voice detection method, characterized in that generated by.

4. The method of claim 3, wherein in the step (d), the overlap average nG (j) of the pseudo signal gain is

nG (j) = (s (j-3) + s (j-2) + s (j-1) + s (j)) / 4

Voice detection method, characterized in that calculated by.

A voice detection apparatus for detecting voice using only voice feature parameters generated by a voice vocoder,

An input interface for extracting a voice feature parameter from the vocoder within a predetermined time from frame-by-frame packet data;

A similar signal generator for generating a similar signal using the voice feature parameter;

A pseudo signal gain generator for generating a gain of a similar signal for each frame obtained by summing the absolute values of the similar signals;

A gain average unit for obtaining a moving average of the pseudo signal gains over time;

A frame for determining whether the current frame is a voice or a silent section between the voice and the voice by using the result of the gain average unit and the results of previous frames, and notifying the start and end position information including the silent section between the voice and the voice. State determination unit; And

And a post-processing unit for correcting the voice output result by executing the frame state determining unit again within a predetermined frame before and after the result of the voice section using the result of the frame state determining unit. Device.

The method of claim 1, wherein (e)

(e1) comparing the average signal gain average value of the current frame with a first threshold value corresponding to a silent period and a second threshold value corresponding to a voice interval;

(e2) If the average value of the similar signal gain of the current frame is smaller than the first threshold, the interval is the silent section, and if the average value of the similar signal gain of the current frame is larger than the second threshold, the interval is the voice interval and the average value of the similar signal gain of the current frame. Determining that the section is a speech section if the previous frame is a speech section if the first frame is larger than the first threshold and less than the second threshold, and the section is a silent section if the previous frame is not a speech section;

(e3) If the determination is made in the silent section in the above step, the process of completing the voice detection if the voice section is present in the previous status determination stage and the length from the end of the last voice section to the current frame is larger than the predetermined voice input end threshold. ; And

(e4) If the voice detection is completed, the voice detection method comprising the step of obtaining the start position and the end position of the voice from the entire voice input.

The method of claim 1, wherein (f)

(f1) obtaining a pulse interval length using the start and end position information of the continuous interval (or voice pulse interval) if the voice interval is continuous;

(f2) if the length of the pulse section is smaller than a predetermined value, the step of incorporating the silence section as the surrounding noise rather than the sound of the pulse section sound; And

(f3) In consideration of the result of step (f2), check whether the length of the silent section is larger than the predetermined voice input end threshold, and if the first pulse or the last pulse is muted by the step (f2), The start and end of the voice are also changed accordingly, and when there are three or more pulses, when the pulses other than the first or the last pulse are incorporated as silence, the entire voice immediately before and after the voice becomes larger than the predetermined voice input end threshold. And dividing the voice section into two parts to obtain a start position and an end position of the voice section, respectively, and selecting the larger of the lengths of the two parts obtained by obtaining the length of the section.