KR100774800B1

KR100774800B1 - Segment-level speech/nonspeech classification apparatus and method utilizing the poisson polling technique

Info

Publication number: KR100774800B1
Application number: KR1020060085461A
Authority: KR
Inventors: 서영주; 김회린; 지미경
Original assignee: 한국정보통신대학교 산학협력단
Priority date: 2006-09-06
Filing date: 2006-09-06
Publication date: 2007-11-07

Abstract

A segment-unit voice/non-voice classification method using a Poisson polling method and an apparatus therefor are provided to classify whether an input signal in a segment interval is a voice or a non-voice accurately and need only a memory of a low calculation quantity and low capacity. A signal input device(100) receives a voice or non-voice signal, and converts the received voice or non-voice signal into a digital signal. A feature extractor(101) classifies the digital signal inputted from the signal input device(100) by a frame unit, and extracts a feature parameter per frame. A score converter(102) receives the feature parameter of the frame unit extracted in the feature extractor(101), compares the received feature parameter with a voice and non-voice pattern, and converts voice or non-voice degree into a score. A score quantizer(103) quantizes the scores obtained in the score converter(102). A voice/non-voice classifier(104) judges whether an input signal is a voice/non-voice per segment on the basis of a Poisson polling method from the score of the frame unit and the quantized scores inputted from the score converter(102) and the score quantizer(103). An output device(105) outputs voice/non-voice information judged in the voice/non-voice classifier(104).

Description

Segment-level speech / nonspeech classification apparatus and method utilizing the Poisson polling technique

도 1은 본 발명의 바람직한 실시예에 따른 포아송 폴링 기법에 의한 세그먼트 단위의 음성/비음성 분류 장치를 나타내는 블록도, 1 is a block diagram illustrating an apparatus for classifying speech / non-speech in segments by a Poisson polling scheme according to a preferred embodiment of the present invention;

도 2는 본 발명의 바람직한 실시예에 따른 포아송 폴링 기법에 의한 세그먼트 단위의 음성/비음성 분류방법을 나타내는 흐름도,2 is a flowchart illustrating a voice / non-voice classification method in units of segments by a Poisson polling scheme according to a preferred embodiment of the present invention;

도 3은 도 1에 도시된 스코어 변환기를 더욱 상세하게 설명하는 블록도,3 is a block diagram illustrating the score converter shown in FIG. 1 in more detail;

도 4는 도 2에 도시된 음성/비음성 분류기법 중 기본형 포아송 폴링 기반 분류기법을 설명하는 흐름도,4 is a flowchart illustrating a basic Poisson polling-based classification method among the voice / non-voice classification methods shown in FIG. 2;

도 5는 도 2에 도시된 음성/비음성 분류기법 중 전역 기반 소프트 카운팅형 포아송 폴링에 의한 분류기법을 설명하는 흐름도,FIG. 5 is a flowchart illustrating a classification method using global-based soft counting Poisson polling among the voice / non-voice classification methods shown in FIG. 2;

도 6은 도 2에 도시된 음성/비음성 분류기법 중 클래스 기반 소프트 카운팅형 포아송 폴링에 의한 분류기법을 설명하는 흐름도,6 is a flowchart illustrating a classification method by class-based soft counting Poisson polling among the voice / non-voice classification methods shown in FIG. 2;

도 7은 도 2에 도시된 음성/비음성 분류기법 중 혼합 모델 기반 포아송 폴링 기법을 설명하는 흐름도이다. FIG. 7 is a flowchart illustrating a mixed model based Poisson polling technique of the speech / non-voice classification technique illustrated in FIG. 2.

본 발명은 음성인식시스템에서 음성 부분 만을 검출하고 비음성 부분을 배제시키는 음성인식기에 관한 것으로, 특히 잡음이 빈번하게 존재하는 환경에서 음성인식기를 효율적으로 운용하기 위하여, 비음성 잡음의 입력을 사전에 배제시키는 포아송 폴링 기법을 이용한 세그먼트 단위의 음성/비음성 분류 방법 및 장치에 관한 것이다. The present invention relates to a voice recognizer that detects only a voice part and excludes a non-voice part in a voice recognition system. In particular, in order to efficiently operate the voice recognizer in an environment where noise is frequently present, input of non-voice noise is performed in advance. Disclosed are a method and apparatus for classifying speech / non-speech using segmental Poisson polling techniques.

지능형 로봇, 홈 네트워크 서비스와 같은 차세대 성장 산업의 등장은 인간과 기계 사이의 정보 전달에서 중요하게 응용될 수 있는 음성인식 기술에 대한 관심을 증가시키고 있다. 현재 개발되는 대부분의 음성인식시스템들은 음성신호의 효과적인 입력을 위하여 음성검출기를 전처리기로 사용하고 있다. 음성인식시스템에서 음성검출기의 가장 기본적인 기능은 입력된 음성 부분 만을 검출하고 비음성 부분을 배제시키는 것이다. The emergence of next-generation growth industries, such as intelligent robots and home network services, is increasing interest in speech recognition technology, which can be important in the transmission of information between humans and machines. Most voice recognition systems currently developed use voice detectors as preprocessors for effective input of voice signals. The most basic function of the voice detector in the voice recognition system is to detect only the input voice part and exclude the non-voice part.

지금까지 이 분야에서의 수많은 연구활동의 결과로서, 비교적 정확한 성능을 가진 음성검출기들이 음성인식 분야에 적용되고 있다. 대부분의 이러한 음성검출기들은 프레임 에너지나, 영교차율(zero crossing rate), 또는 필터뱅크(filter-bank) 에너지 등을 입력으로 사용하면서, 이 값들을 내부적으로 정한 문턱값들과 비교하는 간단한 방법을 사용하였다. 따라서, 이러한 음성검출기들은 음성과 비슷한 레벨의 에너지나 영교차율을 가지는 비음성 입력에 대해 음성으로 쉽게 오인하 는 단점을 가지고 있다. 또한, 프레임보다 시간적으로 큰 단위인 세그먼트 단위에서 수행되는 종래의 대표적인 음성/비음성 분류 기술은 복호화 단계에서 복잡한 계산 과정을 수반하는 은닉 마르코프 모델을 사용하기 때문에, 저계산량을 요구하는 음성인식시스템의 전처리기로 적합하지 못한 단점을 지니고 있다.As a result of numerous research activities in this field, voice detectors with relatively accurate performance have been applied to the speech recognition field. Most of these voice detectors use a simple method of comparing these values with internally determined thresholds, using frame energy, zero crossing rate, or filter-bank energy as inputs. It was. Therefore, these voice detectors have a disadvantage in that they are easily mistaken as voice for non-voice inputs having a similar level of energy or zero crossing rate. In addition, the conventional speech / non-speech classification technique that is performed on a segment unit that is larger than a frame in time uses a hidden Markov model that involves a complicated calculation process in the decoding step. It has the disadvantage of not being suitable as a preprocessor.

음성검출기에 의해 음성으로 오인된 비음성 신호가 음성인식시스템의 입력으로 그대로 사용되면, 음성인식시스템은 불필요한 인식과정을 수행하여야 한다. 또한, 이러한 비음성의 입력으로 인하여 음성인식시스템은 오인식된 결과를 출력한다. 따라서, 비음성 입력은 음성인식시스템의 운용상의 비효율성과 더불어 성능 저하를 가져온다. If a non-voice signal mistaken for speech by a voice detector is used as an input of a voice recognition system, the voice recognition system must perform an unnecessary recognition process. In addition, due to such a non-speech input, the speech recognition system outputs a misrecognized result. Thus, non-voice input leads to performance degradation along with operational inefficiency of the voice recognition system.

특히, 가정환경과 같이 비음성 잡음이 빈번하게 발생하는 환경에서 음성인식시스템을 운용할 경우에는 낮은 신호대잡음비로 인하여, 비음성 잡음을 음성으로 오인하는 경우가 증가되는데, 이는 음성인식 기술 도입에 있어서 주요한 장애요인으로 작용한다. 따라서, 이러한 잡음 환경에서 음성인식시스템을 효율적으로 운용하기 위해, 비음성 잡음을 음성인식시스템의 입력으로부터 효과적으로 배제시킬 수 있는 정확하고 효율적인 음성/비음성 분류 기술이 요구되어 왔다. In particular, when operating a voice recognition system in an environment where non-voice noise occurs frequently, such as a home environment, due to the low signal-to-noise ratio, it is more likely to mistake the non-voice noise as a voice. It acts as a major obstacle. Therefore, in order to efficiently operate the speech recognition system in such a noise environment, an accurate and efficient speech / non-speech classification technique capable of effectively excluding non-voice noise from the input of the speech recognition system has been required.

본 발명은 상기한 점을 감안하여 발명된 것으로, 세그먼트 구간내의 입력신호에 대해 음성인지 비음성인지를 정확하게 분류하고, 저계산량과 저용량의 메모리만이 요구되는 세그먼트 단위의 음성/비음성 분류 방법 및 장치를 제공함에 그 목 적이 있다. The present invention has been invented in view of the above points, and accurately classifies whether an input signal in a segment section is voiced or non-voice and performs a segment / speech voice / non-voice classification method requiring only a low calculation amount and a low capacity memory. The purpose is to provide a device.

상기 목적을 달성하기 위한 본 발명에 의한 음성/비음성 분류 방법은, 음성이나 비음성 신호를 입력받아 디지털 신호로 변환하는 단계, 디지털 신호를 프레임 단위로 구분하고, 프레임마다 특징 파라메터를 추출하는 단계, 프레임 단위의 특징 파라메터를 음성 및 비음성 패턴과 비교하여 음성 또는 비음성의 정도를 스코어로 변환하는 단계, 스코어들을 양자화하는 단계, 프레임 단위의 스코어 및 양자화된 스코어들로부터 포아송 폴링 기반에 의해 세그먼트마다 음성/비음성 여부를 판정하는 단계 및, 판정된 음성/비음성 정보를 출력하는 단계를 포함하는 것을 특징으로 한다. In the speech / non-voice classification method according to the present invention for achieving the above object, the step of receiving a voice or non-voice signal and converting it into a digital signal, dividing the digital signal in units of frames, and extracting feature parameters for each frame Comparing the feature parameters on a frame-by-frame basis with speech and non-voice patterns to convert the degree of speech or non-speech to scores, quantizing the scores, segment by frame, and Poisson polling from the quantized scores. Determining whether each voice / non-voice is present, and outputting the determined voice / non-voice information.

이때, 음성 또는 비음성의 정도를 스코어로 변환하는 단계는, 프레임 단위 특징 파라메터를 이용하여 음성 클래스와 비음성 클래스에 대해 우도를 추출하는 단계, 추출된 음성 클래스 우도 및 비음성 클래스 우도를 로그로 변환하는 단계 및, 로그로 변환된 음성 클래스 우도에서 비음성 클래스 우도를 감산하여 프레임단위 스코어를 생성하는 단계를 포함하는 것이 바람직하다. In this case, the step of converting the degree of speech or non-voice into a score, extracting the likelihood for the speech class and the non-voice class using the frame feature parameters, log the extracted speech class likelihood and non-voice class likelihood as a log And converting the voice class likelihood converted to the log and subtracting the non-voice class likelihood to generate a frame-by-frame score.

음성/비음성 여부를 판정하는 단계는, 양자화된 프레임 단위 스코어들로부터 관측 빈도수 f(q)를 추정하는 단계, 관측 빈도수 f(q)로부터 수학식

와 같이 기본형 포아송 폴링 기법에 의해 음성/비음성 로그우도비 를 추출하는 단계 및, 추출된 로그우도비 값을 소정 문턱값과 비교하여, 로그우도비 값이 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비 값이 문턱값보다 작으면 비음성 세그먼트로 판정하는 단계를 포함할 수 있다. The determining of the speech / non-speech may include estimating the observation frequency f (q) from the quantized frame unit scores, and the equation from the observation frequency f (q).

Extracting the voice / non-voice log likelihood ratio by a basic Poisson polling technique as described above, and comparing the extracted log likelihood ratio value with a predetermined threshold value and determining that the log likelihood ratio value is larger than the threshold value as a voice segment. If the log likelihood ratio value is less than the threshold value may include determining the non-segment segment.

또는 음성/비음성 여부를 판정하는 단계는, 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 전역 기반 세그먼트 단위의 관측빈도수

를 추정하는 단계, 관측 빈도수

로부터 수학식

와 같이 전역 기반 소프트 카운팅형 포아송 폴링 기법에 의해 음성/비음성 로그우도비를 추출하는 단계 및, 추출된 로그우도비 값을 소정 문턱값과 비교하여, 로그우도비 값이 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비 값이 문턱값보다 작으면 비음성 세그먼트로 판정하는 단계를 포함할 수 있다. Or determining whether the voice / non-speech is, frequency of observation on a global basis segment basis from frame-based scores and quantized frame-based scores.

Estimating, observation frequency

Equation from

Extracting the voice / non-voice log likelihood ratio by a global-based soft counting Poisson polling technique, and comparing the extracted log likelihood ratio with a predetermined threshold, if the log likelihood ratio value is greater than the threshold, And determining that the log likelihood ratio value is less than the threshold value as the non-segmented segment.

또는 음성/비음성 여부를 판정하는 단계는, 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 클래스 기반 세그먼트 단위 관측 빈도수

를 추정하는 단계, 관측 빈도수

로부터 수학식

와 같이 클래스 기반 포아송 폴링 기법에 의해 음성/비음성 로그우도비를 추출하는 단계 및, 추출된 로그우도비 값을 소정 문턱값과 비교하여, 로그우도비 값이 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비 값이 문턱값보다 작으면 비음성 세그먼트로 판정하는 단계를 포함할 수 있다. Or determining voice / non-speech may include class-based segment-based observation frequency from frame-based scores and quantized frame-based scores.

Estimating, observation frequency

Equation from

Extracting the voice / non-voice log likelihood ratio by a class-based Poisson polling scheme, and comparing the extracted log likelihood ratio value with a predetermined threshold, and determining the voice segment if the log likelihood ratio value is larger than the threshold value. If the log likelihood ratio value is smaller than the threshold value, the method may include determining that the non-segmented segment is a non-voice segment.

또는 음성/비음성 여부를 판정하는 단계는, 양자화된 프레임 단위 스코어들로부터 추정된 관측 빈도수 f(q), 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 추정된 전역 기반 세그먼트 단위의 관측빈도수

, 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 추정된 클래스 기반 세그먼트 단위 관측 빈도수

중 하나로부터 음성 세그먼트에 대한 혼합 확률 인자 P_m(F|S) 및 비음성 세그먼트에 대한 혼합 확률 인자 P_m(F|N)를 구하는 단계, 음성 세그먼트에 대한 혼합 확률 인자 및 비음성 세그먼트에 대한 혼합 확률 인자로부터

와 같이 혼합 모델 기반 포아송 폴링에 의한 음성/비음성 로그우도비를 추출하는 단계 및, 추출된 로그우도비 값을 소정 문턱값과 비교하여, 로그우도비 값이 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비 값이 문턱값보다 작으면 비음성 세그먼트로 판정하는 단계를 포함할 수 있다. Or determining voice / non-speech may include: observation frequency f (q) estimated from quantized frame unit scores, observation frequency in unit of global based segment estimated from frame unit scores, and quantized frame unit scores.

Class-based segment-based observation frequencies estimated from frame-by-frame scores and quantized frame-by-frame scores

Obtaining the mixed probability factor P _m (F | S) for the negative segment and the mixed probability factor P _m (F | N) for the non-segment segment, the mixed probability factor for the negative segment, and the From mixed probability factors

Extracting the voice / non-voice log likelihood ratio by the mixed model based Poisson polling as described above, and comparing the extracted log likelihood ratio value with a predetermined threshold value, and determining the voice segment if the log likelihood ratio value is larger than the threshold value. If the log likelihood ratio value is smaller than the threshold value, the method may include determining that the non-segmented segment is a non-voice segment.

상기 목적을 달성하기 위한 본 발명에 의한 음성/비음성 분류 장치는, 음성이나 비음성 신호를 입력받아 디지털 신호로 변환하는 신호입력기, 신호입력기로부터 입력된 디지털 신호를 프레임 단위로 구분하고, 프레임마다 특징 파라메터를 추출하는 특징추출기, 특징추출기에서 추출된 프레임 단위의 특징 파라메터를 입력받아, 음성 및 비음성 패턴과 비교하여 음성 또는 비음성의 정도를 스코어로 변환하는 스코어 변환기, 스코어 변환기에서 구해진 스코어들을 양자화하는 스코어 양자 화기, 스코어 변환기 및 스코어 양자화기로부터 입력된 프레임 단위의 스코어 및 양자화된 스코어들로부터 포아송 폴링 기반에 의해 세그먼트마다 음성/비음성 여부를 판정하는 음성/비음성 분류기 및, 음성/비음성 분류기에서 판정된 음성/비음성 정보를 출력하는 출력기를 구비하는 것을 특징으로 한다. In order to achieve the above object, a voice / non-voice classification apparatus according to the present invention includes a signal input unit for receiving a voice or non-voice signal and converting the digital signal into a digital signal, and dividing the digital signal input from the signal input unit in units of frames. A feature extractor for extracting feature parameters, a score converter that receives feature parameters in units of frames extracted from the feature extractor, converts the degree of speech or non-speech into scores in comparison with speech and non-voice patterns, and scores obtained from score converters. Speech / non-voice classifier for determining whether voice / non-speech per segment is based on Poisson polling from scores and quantized scores input from the score quantizer, score converter and score quantizer to quantize, and voice / ratio To output the voice / non-voice information determined by the voice classifier. It is characterized by including an output device.

이때, 스코어 변환기는, 프레임 단위 특징 파라메터를 입력으로 받아, 음성 클래스에 대해 우도를 추출하는 음성 클래스 우도추출기, 프레임 단위 특징 파라메터를 입력으로 받아, 비음성 클래스에 대해 우도를 추출하는 비음성 클래스 우도추출기, 음성 클래스 우도추출기의 출력을 입력으로 받아 로그로 변환하는 제1로그변환기, 비음성 클래스 우도추출기의 출력을 입력으로 받아 로그로 변환하는 제2로그변환기 및, 제1로그변환기의 출력에서 상기 제2로그변환기의 출력을 감산하여 프레임단위 스코어를 출력하는 감산기를 구비하는 것이 바람직하다. In this case, the score converter receives a frame-like feature parameter as an input, a voice class likelihood extractor for extracting a likelihood for a voice class, and a frame-like feature parameter as an input, and a non-voice class likelihood for extracting a likelihood for a non-voice class. A first log converter which receives the output of the extractor, the voice class likelihood extractor as an input, and converts it into a log; a second log converter which receives the output of the non-voice class likelihood extractor as an input, and converts it into a log; and at the output of the first log converter It is preferable to have a subtractor which subtracts the output of the second log converter and outputs a frame-by-frame score.

음성/비음성 분류기는

와 같이 음성 또는 비음성 세그먼트를 판정하거나,

와 같이 음성 또는 비음성 세그먼트를 판정하거나,

와 같이 음성 또는 비음성 세그먼트를 판정하 거나,

와 같이 음성 또는 비음성 세그먼트를 판정할 수 있다. Voice / non-voice classifiers

To determine the voice or non-voice segment,

The voice or non-voice segment can be determined as follows.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세하게 설명하기로 한다. 그러나, 이하의 실시예는 이 기술분야에서 통상적인 지식을 가진 자에게 본 발명이 충분히 이해되도록 제공되는 것으로서 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 다음에 기술되는 실시예에 한정되는 것은 아니다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, the following embodiments are provided to those skilled in the art to fully understand the present invention, and may be modified in various forms, and the scope of the present invention is limited to the embodiments described below. It doesn't happen.

(실시예)(Example)

이하, 예시도면을 참조하여 본 발명에 따른 실시예를 상세히 설명한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 바람직한 실시예에 따른 포아송 폴링 기법에 의한 세그먼트 단위의 음성/비음성 분류 장치를 나타내는 블록도이고, 도 2는 본 발명의 바람직한 실시예에 따른 포아송 폴링 기법에 의한 세그먼트 단위의 음성/비음성 분류방법을 나타내는 흐름도이다. 1 is a block diagram illustrating a voice / non-speech classification apparatus in units of segments by a Poisson polling scheme according to a preferred embodiment of the present invention, and FIG. 2 is a segment of units in a Poisson polling scheme according to a preferred embodiment of the present invention. A flowchart illustrating a speech / non-voice classification method.

도 1을 참조하면, 포아송 폴링 기법에 의한 세그먼트 단위의 음성/비음성 분류 장치는 신호 입력기(100), 특징추출기(101), 스코어 변환기(102), 스코어 양자화기(103), 음성/비음성 분류기(104), 출력기(105)로 구성된다.Referring to FIG. 1, a speech / non-speech classification unit based on a Poisson polling technique includes a signal inputter 100, a feature extractor 101, a score converter 102, a score quantizer 103, and a voice / non-voice. It is composed of a classifier 104 and an output unit 105.

신호입력기(100)는 신호를 입력받는 장치로서, 음성이나 비음성 신호를 디지털 신호로 변환하는 기능을 한다(제200단계). 특징추출기(101)는 신호입력기(100)로부터 입력된 디지털 신호를 프레임 단위로 구분하고, 음성과 비음성 분류에 효과적인 특징 파라메터를 프레임마다 추출한다(제201단계). The signal input device 100 is a device that receives a signal and functions to convert a voice or non-voice signal into a digital signal (step 200). The feature extractor 101 classifies the digital signal input from the signal inputter 100 in units of frames and extracts feature parameters effective for speech and non-voice classification for each frame (step 201).

스코어 변환기(102)는 특징추출기(101)에서 추출된 프레임 단위의 특징 파라메터를 입력받아, 이를 음성 및 비음성 패턴과 비교하여 음성 또는 비음성의 정도를 스코어로 변환하여 수치화한다(제202단계). 스코어 양자화기(103)는 스코어 변환기(102)에서 구해진 스코어들을 포아송 폴링 기법에 적용하기 위해 양자화한다(제203단계).The score converter 102 receives the feature parameters of the frame unit extracted from the feature extractor 101, compares them with voice and non-voice patterns, and converts the digits of the voice or non-voice into scores to quantify them (step 202). . The score quantizer 103 quantizes the scores obtained by the score converter 102 to apply to the Poisson polling technique (step 203).

음성/비음성 분류기(104)는 분석 세그먼트에 포함된 프레임들로부터 구해진 프레임 단위의 스코어 및 양자화된 스코어들로부터 포아송 폴링 기반에 의해 세그먼트마다 음성/비음성 여부를 판정한다(제204단계). The speech / non-speech classifier 104 determines whether the speech / non-speech is segment-by-segment based on Poisson polling based on frame-based scores and quantized scores obtained from frames included in the analysis segment (step 204).

출력기(105)는 음성/비음성 분류기에서 판정된 음성/비음성 정보를 출력한다(제205단계).The output unit 105 outputs the voice / non-voice information determined by the voice / non-voice classifier (step 205).

도 3은 도 1에 도시된 스코어 변환기(102)를 더욱 상세하게 설명하는 블록도이다. 스코어 변환기(102)는 음성 클래스 우도추출기(300a), 비음성 클래스 우도추출기(300b), 제1로그변환기(301a), 제2로그변환기(301b), 감산기(302)로 구성된다. 3 is a block diagram illustrating the score converter 102 shown in FIG. 1 in more detail. The score converter 102 includes a voice class likelihood extractor 300a, a non-voice class likelihood extractor 300b, a first log converter 301a, a second log converter 301b, and a subtractor 302.

도 1에서 설명한 바와 같이, 프레임 단위의 음성/비음성 분류용 스코어 변환기(102)에서는 분석 프레임마다 음성/비음성 분류에 효과적인 특징 파라메터를 입 력받아 음성/비음성 여부를 수치로 변환한다. 로그 우도비(log-likelihood ratio) 기반의 프레임 단위 스코어 변환 기법을 식으로 나타내면 수학식 1과 같다.As described with reference to FIG. 1, the score converter 102 for speech / non-voice classification on a frame-by-frame basis receives feature parameters effective for speech / non-voice classification for each analysis frame and converts voice / non-voice to numerical values. A frame-based score conversion scheme based on a log-likelihood ratio is represented by Equation 1 below.

여기서, X(l)은 l번째 분석 프레임에서 추출한 특징 파라메터 벡터를 나타낸다. P(X(l)│Θ_S)와 P(X(l)│Θ_N)는 각각 음성과 비음성 클래스에 대한 특징 파라메터 X(l)의 우도(likelihood)를 나타낸다.Here, X (l) represents the feature parameter vector extracted from the l-th analysis frame. P (X (l) | Θ _S ) and P (X (l) │Θ _N ) represent the likelihood of feature parameter X (l) for speech and non-voice classes, respectively.

즉, 도 2에 도시된 바와 같이, 음성 클래스 우도추출기(300a)와 비음성 클래스 우도추출기(300b)는 프레임 단위 특징 파라메터[X(l)]를 입력으로 받아, 음성 클래스와 비음성 클래스에 대해 우도를 추출한다. 그러면, 제1로그변환기(301a)는 음성 클래스 우도추출기의 출력을 입력으로 받아 로그로 변환하고, 제2로그변환기(301b)는 비음성 클래스 우도추출기의 출력을 입력으로 받아 로그로 변환한다. 제1로그변환기(301a)와 제2로그변환기(301b)의 출력신호는 감산기(302)에서 감산되어 프레임단위 스코어[r(l)]를 출력하게 된다. That is, as shown in FIG. 2, the voice class likelihood extractor 300a and the non-voice class likelihood extractor 300b receive the frame unit feature parameter [X (l)] as an input, for the voice class and the non-voice class. Extract the likelihood. Then, the first log converter 301a receives the output of the voice class likelihood extractor as an input and converts it into a log, and the second log converter 301b receives the output of the non-voice class likelihood extractor as an input and converts it into a log. The output signals of the first log converter 301a and the second log converter 301b are subtracted by the subtractor 302 to output a frame unit score [r (l)].

이하에서는 도 1 및 도 2에 도시된 세그먼트 단위의 포아송 폴링 기반 음성/비음성 분류기 및 분류방법에 대해 더욱 상세하게 설명한다. Hereinafter, a Poisson polling based speech / non-voice classifier and a classification method shown in FIG. 1 and FIG. 2 will be described in more detail.

세그먼트 단위의 포아송 폴링 기반 음성/비음성 분류기(104)는 스코어 변환기(102)와 스코어 양자화기(103)에서 추출된 프레임 단위의 스코어들 및 양자화된 스코어들로부터 포아송 폴링 기법을 적용하여 세그먼트 단위로 음성/비음성을 분류하는 기능을 한다. The segment-based Poisson polling based speech / non-speech classifier 104 applies the Poisson polling technique from the frame-based and quantized scores extracted from the score converter 102 and the score quantizer 103 to the segment-by-segment. Function to classify voice / non-voice.

본 발명에서는 관측 빈도수 추정 방법에 따라서 기본형 포아송 폴링 기반 세그먼트 단위의 음성/비음성 분류 기법, 전역 기반 소프트 카운팅형 포아송 폴링 기반 세그먼트 단위의 음성/비음성 분류 기법, 클래스 기반 소프트 카운팅형 포아송 폴링 기반 세그먼트 단위의 음성/비음성 분류 기법과 같은 세가지 방식을 선택적으로 사용할 수 있다. 또한, 음성 및 비음성 세그먼트들의 다양한 패턴을 효과적으로 나타내기 위하여, 혼합모델 포아송 폴링 기법을 위의 세가지 방법들에 각각 적용할 수 있다. In the present invention, according to the observation frequency estimation method, a speech / non-voice classification method based on basic Poisson polling, a segment-based voice / non-voice classification method based on global counting, and a soft counting Poisson polling based segment Three methods are available, such as unit speech / non-voice classification. In addition, in order to effectively represent various patterns of speech and non-voice segments, the mixed model Poisson polling technique can be applied to each of the above three methods.

이하에서는 이러한 네가지 기법에 대해 상세하게 설명한다. The following four techniques are described in detail.

(1) 기본형 포아송 폴링 기반 세그먼트 단위의 음성/비음성 분류기법(1) Basic Poisson Polling Based Segment Unit Segment / Negative Speech Classification

도 4는 도 2에 도시된 음성/비음성 분류기법 중 기본형 포아송 폴링 기반 분류 기법을 설명하는 흐름도이다. FIG. 4 is a flowchart illustrating a basic Poisson polling-based classification technique among the voice / non-voice classification techniques illustrated in FIG. 2.

기본형 포아송 폴링에 의한 세그먼트 단위의 음성/비음성 분류 기법은 먼저 양자화된 프레임 단위 스코어들로부터 관측 빈도수(observed frequency), f(q)를 추정한다(제400단계). 그리고, 수학식 2의 좌변과 같이 관측 빈도수로부터 기본형 포아송 폴링 기법에 의해 음성/비음성 로그우도비를 추출(제401단계)한 후, 추출된 로그우도비 값을 미리 정해진 문턱값과 비교함으로써 해당 세그먼트의 음성/비음성 여부를 판정한다(제402단계). 즉, 수학식 2처럼 로그우도비가 문턱값보다 크면 음 성 세그먼트로 판정하고, 로그우도비가 문턱값보다 작으면 비음성 세그먼트로 판정한다. Segmental speech / non-speech classification using basic Poisson polling first estimates the observed frequency, f (q), from the quantized frame-based scores (step 400). After extracting the voice / non-voice log likelihood ratio from the observation frequency using the basic Poisson polling technique (step 401) as shown in the left side of Equation 2, the extracted log likelihood ratio value is compared with a predetermined threshold. It is determined whether the segment is voiced or non-voiced (step 402). That is, as shown in Equation 2, if the log likelihood ratio is larger than the threshold value, it is determined as a voice segment, and if the log likelihood ratio is smaller than the threshold value, it is determined as a non-voice segment.

여기서, Q는 프레임 단위의 음성/비음성 분류용 스코어를 양자화하였을 때, 전체 양자화 단계들의 수를 나타낸다. λ_S(q)와 λ_N(q)은 각각 음성 세그먼트와 비음성 세그먼트에서 q로 양자화된 프레임 단위 스코어들의 평균 빈도수(expected frequency)를 나타내며 훈련 단계에서 음성 및 비음성 세그먼트들로부터 구해진다. THR_B-PPM은 기본형 포아송 폴링에 의한 음성/비음성 분류에 사용되는 문턱값이다. f(q)는 분석 세그먼트 내에서 양자화 단계 q로 양자화된 프레임 단위 스코어들에 대해, 기본형에 의한 관측 빈도수(observed frequency)를 나타내는데 다음 수학식 3과 같이 추정된다. Here, Q represents the total number of quantization steps when quantizing the score for speech / non-speech classification in units of frames. λ _S (q) and λ _N (q) represent the averaged frequency of frame unit scores quantized by q in the voice segment and the non-voice segment, respectively, and are obtained from the voice and non-voice segments in the training phase. THR _B-PPM is a threshold used for speech / non-voice classification by basic Poisson polling. f (q) represents the observed frequency by the basic type for the frame unit scores quantized in the quantization step q in the analysis segment, and is estimated as in Equation 3 below.

여기서, L은 분석 세그먼트의 길이를 프레임 단위로 나타낸 수이고, l은 이 세그먼트에 속하는 분석 프레임의 인덱스를 나타낸다. Where L is the number representing the length of the analysis segment in frames, and l represents the index of the analysis frame belonging to this segment.

기본형 포아송 폴링 기법에 의한 관측 빈도수 인자, P_B(l,q)는 다음의 수학 식 4와 같이 구해진다.The observation frequency factor, P _B (l, q), by the basic Poisson polling technique is obtained as in Equation 4 below.

여기서,

은 l번째 프레임 단위 스코어인 r(l)을 양자화한 값이다.here,

Is a value obtained by quantizing r (l) which is the lth frame unit score.

(2) 전역 기반 소프트 카운팅형 포아송 폴링에 의한 세그먼트 단위의 음성/비음성 분류 기법(2) Segmentation-based speech / non-voice classification technique using global-based soft counting Poisson polling

도 5는 도 2에 도시된 음성/비음성 분류기법 중 전역 기반 소프트 카운팅형 포아송 폴링에 의한 분류기법을 설명하는 흐름도이다. FIG. 5 is a flowchart illustrating a classification method using global-based soft counting Poisson polling among the voice / non-voice classification methods shown in FIG. 2.

포아송 폴링 기법에 의한 음성/비음성 분류의 성능은 관측 빈도수 추정의 정확도와 직접적인 연관성이 있기 때문에, 전역 기반 소프트 카운팅형 관측 빈도수 추정 방법을 이용할 수 있다. Since the performance of speech / non-speech classification by the Poisson polling technique is directly related to the accuracy of the estimation of the observation frequency, the global-based soft counting observation frequency estimation method can be used.

도 5를 참조하면, 먼저 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 전역 기반 세그먼트 단위의 관측빈도수를 추정한다(제500단계). 다음으로, 수학식 5의 좌변과 같이 전역 기반 소프트 카운팅형 포아송 폴링 기법에 의해 로그우도비를 추출(제501단계)한 후, 추출된 로그우도비 값을 미리 정해진 문턱값과 비교함으로써 주어진 세그먼트의 음성/비음성 여부를 판정한다(제502단계). 즉, 수학식 5처럼 로그우도비가 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도 비가 문턱값보다 작으면 비음성 세그먼트로 판정한다. Referring to FIG. 5, first, an observation frequency of a global base segment unit is estimated from frame unit scores and quantized frame unit scores (operation 500). Next, after extracting the log likelihood ratio by the global-based soft counting Poisson polling technique (step 501) as shown in the left side of Equation 5, the extracted log likelihood ratio value is compared with a predetermined threshold value of a given segment. It is determined whether voice / non-speech (step 502). That is, as shown in Equation 5, if the log likelihood ratio is larger than the threshold value, it is determined as a voice segment, and if the log likelihood ratio is smaller than the threshold value, it is determined as a non-voice segment.

여기서,

와

는 각각 음성과 비음성 세그먼트에서 전역 기반 소프트 카운팅형에 의해 추정된 q번째 양자화 단계에서의 평균 빈도수를 나타낸다. THR_GSC-PPM은 전역 기반 소프트 카운팅형 포아송 폴링에 의한 음성/비음성 분류에 사용되는 문턱값이다. here,

Wow

Denotes the average frequency in the q th quantization step estimated by the global-based soft counting in speech and non-speech segments, respectively. THR _GSC-PPM is a threshold used for speech / non-speech classification with global-based soft counting Poisson polling.

전역 기반 소프트 카운팅형에 의한 세그먼트 단위의 관측 빈도수,

의 추정은 다음의 수학식 6과 같이 한다.Observation frequency in segment by global based soft counting type,

Is estimated by the following Equation 6.

여기서, 전역 기반 소프트 카운팅형에 의한 관측 빈도수 인자, P_G(l,q)는 다음의 수학식 7과 같이 구해진다.Here, the observation frequency factor, P _G (l, q) by the global-based soft counting type, is obtained as in Equation 7 below.

여기서,

은 l번째 프레임 단위 스코어인 r(l)을 양자화한 값이고,

는 유클리디안(Euclidean) 거리 척도를 나타낸다. C_G(q)는 q 레벨로 양자화된 프레임 단위 스코어들로부터 구한 중심값(centroid)을 의미한다.here,

Is a quantized value of r (l), the lth frame unit score,

Represents the Euclidean distance scale. C _G (q) refers to a centroid obtained from frame unit scores quantized to q levels.

(3) 클래스 기반 소프트 카운팅형 포아송 폴링에 의한 세그먼트 단위 음성/비음성 분류 기법(3) Segmental unit voice / non-voice classification by class-based soft counting Poisson polling

도 6은 도 2에 도시된 음성/비음성 분류기법 중 클래스 기반 소프트 카운팅형 포아송 폴링에 의한 분류기법을 설명하는 흐름도이다. 클래스 기반 소프트 카운팅형 관측 빈도수 추정 방법을 이용하면, 관측 빈도수 추정의 신뢰성을 추가로 향상시킬 수 있다. FIG. 6 is a flowchart illustrating a classification method by class-based soft counting Poisson polling among the voice / non-voice classification methods shown in FIG. 2. By using the class-based soft counting observation frequency estimation method, the reliability of the observation frequency estimation can be further improved.

도 6을 참조하면, 먼저 프레임 단위 스코어들 및 양자화된 프레임 단위 스코어들로부터 클래스 기반 세그먼트 단위 관측 빈도수를 추정한다(제600단계). 그리고, 다음 수학식 8의 좌변과 같이 클래스 기반 세그먼트 단위 관측 빈도수를 이용 하여 클래스 기반 포아송 폴링 기법에 의해 로그우도비를 추출(제601단계)한 후, 추출된 로그우도비 값을 미리 정해진 문턱값과 비교함으로써 주어진 세그먼트의 음성/비음성 여부를 판정한다(제602단계). 즉, 수학식 8처럼 로그우도비가 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비가 문턱값보다 작으면 비음성 세그먼트로 판정한다. Referring to FIG. 6, first, class-based segment unit observation frequency is estimated from frame unit scores and quantized frame unit scores (operation 600). After extracting the log likelihood ratio using the class-based Poisson polling technique using the class-based segment unit observation frequency as shown in the left side of Equation 8 (step 601), the extracted log likelihood ratio value is determined as a predetermined threshold value. In operation 602, it is determined whether the given segment is voiced or non-speech. That is, as shown in Equation 8, if the log likelihood ratio is larger than the threshold value, it is determined as a voice segment, and if the log likelihood ratio is smaller than the threshold value, it is determined as a non-voice segment.

여기서,

와

는 각각 음성과 비음성 세그먼트에서 클래스 기반 소프트 카운팅형에 의해 구해진 q번째 양자화 단계에서의 평균 빈도수를 나타낸다. THR_CSC-PPM은 클래스 기반 소프트 카운팅형 포아송 폴링에 의한 음성/비음성 분류에 사용되는 문턱값이다. here,

Wow

Denotes the average frequency in the q th quantization step obtained by class-based soft counting in speech and non-speech segments, respectively. THR _CSC-PPM is a threshold used for speech / non-voice classification by class-based soft counting Poisson polling.

클래스 기반 소프트 카운팅형에 의한 관측 빈도수,

의 추정은 다음의 수학식 9와 같이 수행된다.Observation frequency by class-based soft counting,

Is estimated as in Equation 9 below.

여기서, Ws(q)는 양자화 단계 q가 음성일 경우의 확률값을 나타낸다. 클래스 기반 소프트 카운팅형에 의한 관측 빈도수 인자, P_CS(l,q)는 다음의 수학식 10과 같이 구해진다.Here, Ws (q) represents a probability value when the quantization step q is negative. The observation frequency factor, P _CS (1, q), by the class-based soft counting type, is obtained as in Equation 10 below.

여기서, C_CS(q)는 전체 훈련 음성 프레임들로부터 추출된 프레임 단위 스코어들 중에서 q 번째 단계로 양자화되는 원 스코어들로부터 구한 중심값(centroid)을 의미한다. Here, C _CS (q) means a centroid obtained from original scores quantized in a q th step among frame unit scores extracted from all training speech frames.

비슷한 방식으로 비음성에 대한 클래스 기반 소프트 카운팅형 관측 빈도수 인자, P_CN(l,q)는 다음의 수학식 11과 같이 구해진다.In a similar manner, the class-based soft counting observation frequency factor, P _CN (l, q) for non-voice, is obtained as in Equation 11 below.

여기서, C_CN(q)는 전체 훈련 비음성 프레임들로부터 추출된 프레임 단위 스코어들 중에서 q 번째 단계로 양자화되는 원 스코어들로부터 구한 중심값을 의미한다. Here, C _CN (q) means a center value obtained from original scores quantized in the q th step among the frame unit scores extracted from the entire training non-voice frames.

(4) 혼합 모델 기반 포아송 폴링 기법(4) Poisson polling technique based on mixed model

음성 세그먼트에 속한 프레임들로부터 추출된 프레임 단위 스코어들의 분포는 음성 세그먼트 내에 존재하는 음성 부분과 묵음의 지속시간, 음성의 정도 (degree of speech) 등에 의해 다양한 패턴을 나타낸다. 비음성 세그먼트의 경우도 비슷하게 비음성 잡음의 종류에 따라서 다양한 프레임 단위 스코어들의 분포를 가진다. 따라서, 다양한 패턴을 보이는 음성이나 비음성 세그먼트를 포아송 확률 분포로 모델링할 경우에 하나의 포아송 확률 모델을 사용하는 대신 복수의 대표 세그먼트 패턴들을 구하고, 이들에 대해 복수의 포아송 확률 모델을 혼합하여 적용하면 음성이나 비음성 세그먼트 모델링에서의 정확도를 향상시킬 수 있다. The distribution of frame-by-frame scores extracted from the frames belonging to the speech segment represents various patterns according to the speech portion existing in the speech segment, the duration of silence, the degree of speech, and the like. Similarly, in the case of the non-segmented segment, there is a distribution of various frame scores according to the type of non-voice noise. Therefore, when modeling a voice or non-voice segment showing various patterns as a Poisson probability distribution, instead of using a Poisson probability model, a plurality of representative segment patterns are obtained, and a plurality of Poisson probability models are mixed and applied. Improve accuracy in speech or non-voice segment modeling.

도 7은 도 2에 도시된 음성/비음성 분류기법 중 혼합 모델 기반 포아송 폴링 기법을 설명하는 흐름도이다. 혼합 모델 기반 포아송 폴링 기법은 이상에서 설명한 세가지 기법, 기본형 포아송 폴링 기반 세그먼트 단위의 음성/비음성 분류기법, 전역 기반 소프트 카운팅형 포아송 폴링에 의한 세그먼트 단위의 음성/비음성 분류 기법, 클래스 기반 소프트 카운팅형 포아송 폴링에 의한 세그먼트 단위 음성/비음성 분류 기법 중 하나로부터 추정된 세그먼트 단위 관측 빈도수를 이용한다. FIG. 7 is a flowchart illustrating a mixed model based Poisson polling technique of the speech / non-voice classification technique illustrated in FIG. 2. The mixed model-based Poisson polling technique includes three techniques described above, basic Poisson polling-based segmented speech / non-voice classification, segment-based speech / non-voice classification using global-based soft counting Poisson polling, and class-based soft counting. Segment observation frequency estimated from one of the segmented speech / non-speech classification schemes by type Poisson polling is used.

먼저, 세가지 기법 중 하나로부터 추정된 세그먼트 단위 관측 빈도수, 즉 f(q),

,

중 하나로부터 음성 세그먼트에 대한 혼합 확률 인자 및 비음성 세그먼트에 대한 혼합 확률 인자를 구한다(제700a단계,제700b단계). 음성 세그먼트에 대한 혼합 확률 인자, P_m(F|S)와 비음성 세그먼트에 대한 혼합 확률 인자, P_m(F|N)는 다음의 수학식 12와 같이 정해진다.First, the segmental observation frequency estimated from one of the three techniques, f (q),

,

The mixing probability factor for the negative segment and the mixing probability factor for the non-segment segment are obtained from one of

steps

700a and 700b. The mixing probability factor for the negative segment, P _m (F | S) and the mixing probability factor for the non-segment segment, P _m (F | N) are determined by Equation 12 below.

여기서, λ_S(m,q)과 λ_N(m,q)는 m 번째 혼합 패턴, q 번째 양자화 단계에 해당하는 음성과 비음성 대표 세그먼트들에 대한 평균 빈도수를 각각 나타낸다.Here, λ _S (m, q) and λ _N (m, q) represent the average frequencies for the negative and non-voice representative segments corresponding to the m th mixing pattern and the q th quantization step, respectively.

다음으로, 수학식 13의 좌변과 같이 음성 세그먼트에 대한 혼합 확률 인자 및 비음성 세그먼트에 대한 혼합 확률 인자로부터 혼합 모델 기반 포아송 폴링에 의한 음성/비음성 로그우도비를 추정(제701단계)한 후, 추정된 음성/비음성 로그 우도비로부터 혼합(mixture) 모델 기반 포아송 폴링에 의한 음성/비음성 여부를 판정한다(제702단계). 즉, 수학식 13처럼 로그우도비가 문턱값보다 크면 음성 세그먼트로 판정하고, 로그우도비가 문턱값보다 작으면 비음성 세그먼트로 판정한다. Next, as shown in the left side of Equation 13, after estimating the speech / non-voice log likelihood ratio based on the mixture model based Poisson polling from the mixing probability factor for the speech segment and the mixing probability factor for the non-segment segment (step 701), In operation 702, it is determined whether the voice / non-voice is determined by the Poisson polling based on the mixture model from the estimated voice / non-voice log likelihood ratio. That is, as shown in Equation 13, if the log likelihood ratio is larger than the threshold value, it is determined as a voice segment, and if the log likelihood ratio is smaller than the threshold value, it is determined as a non-voice segment.

여기서, M은 음성이나 비음성 세그먼트를 각각의 혼합 세그먼트 패턴들로 모델링할 때의 전체 혼합 패턴들의 수이다. π_S(m)과 π_N(m)는 각각 음성과 비음성 세그먼트에 대한 m 번째 혼합 가중치를 나타낸다. THR_M-PPM은 혼합 모델 포아송 폴링 기법에 의한 음성/비음성 분류에 사용되는 문턱값이다. Here, M is the total number of mixed patterns when modeling the voice or non-voice segment into the respective mixed segment patterns. π _S (m) and π _N (m) represent the m th mixing weights for the speech and non-voice segments, respectively. THR _M-PPM is a threshold used for speech / non-voice classification by the mixed model Poisson polling technique.

상기한 바와 같이 본 발명에 의하면, 입력된 신호를 세그먼트 구간 별로 음성이나 비음성으로 분류할 경우에 잡음 환경에서 운용되는 음성인식시스템의 운영 효율성을 증가시킴과 동시에 비음성 잡음으로 인한 오인식을 차단하여 성능 저하를 방지할 수 있는 효과가 있다. 아울러, 본 발명은 일정한 구간에 대해서 입력 신호로부터 음성과 비음성의 판별이 필요한 여러 음성처리 분야에도 효과적으로 적용될 수 있다.As described above, according to the present invention, when the input signal is classified into voice or non-voice for each segment section, the operation efficiency of the voice recognition system operated in a noise environment is increased and at the same time, the false recognition due to non-voice noise is blocked. There is an effect that can prevent performance degradation. In addition, the present invention can be effectively applied to various speech processing fields that require discrimination of speech and non-voice from an input signal for a certain section.

이상, 본 발명의 바람직한 실시예를 들어 상세하게 설명하였으나, 본 발명은 상기 실시예에 한정되는 것은 아니며, 본 발명의 기술적 사상의 범위내에서 당 분 야에서 통상의 지식을 가진 자에 의하여 여러 가지 변형이 가능하다.As mentioned above, although preferred embodiment of this invention was described in detail, this invention is not limited to the said Example, A various thing by the person of ordinary skill in the art within the range of the technical idea of this invention. Modifications are possible.

Claims

(a) receiving a voice or non-voice signal and converting the signal into a digital signal;

(b) dividing the digital signal into frame units and extracting feature parameters for each frame;

(c) converting the degree of speech or non-voice into a score by comparing the feature parameter on a frame basis with the speech and non-voice pattern;

(d) quantizing the scores;

(e) determining voice / non-speech per segment based on Poisson polling based on frame-by-frame and quantized scores;

(f) outputting the determined voice / non-voice information.

The method of claim 1, wherein step (c) comprises:

Extracting a likelihood for the voice class and the non-voice class using the frame unit feature parameter;

Converting the extracted speech class likelihood and non-voice class likelihood into a log;

And subtracting the non-voice class likelihood from the log-like voice class likelihood to generate a frame-by-frame score.

The method of claim 1 or 2, wherein step (e)

Estimating the observation frequency f (q) from the quantized frame unit scores;

Equation from observation frequency f (q)

Extracting a voice / non-voice log likelihood ratio by a basic Poisson polling technique as described above;

Comparing the extracted log likelihood ratio value with a predetermined threshold, determining that the log likelihood ratio value is greater than the threshold value as a voice segment, and determining the log likelihood ratio value as less than the threshold value as a non-segment segment,

Q is the total number of quantization steps, and λ _S (q) and λ _N (q) are the average frequencies of the frame unit scores quantized to q in the speech and non-speech segments, respectively.

The method of claim 3, wherein the observation frequency f (q) is

Is estimated as

L is the number of frames representing the length of the analysis segment, l is the index of the analysis frame belonging to this segment,

The observation frequency factor P _B (l, q) is

Obtained as

Is a quantized value of the l-th frame unit score.

The method of claim 1 or 2, wherein step (e)

Frequency of observation in global based segment from frame-based scores and quantized frame-by-frame scores

Estimating;

Observation frequency

Equation from

Extracting the voice / non-voice log likelihood ratio by a global based soft counting Poisson polling technique as described above;

Q is the total number of quantization steps,

Wow

Is the average frequency in the q th quantization step estimated by the global-based soft counting in speech and non-speech segments, respectively.

The method of claim 5, wherein the observation frequency

Is

Is estimated as

P _G (l, q) is

Obtained as

Is the quantized value of the lth frame unit,

The Euclidean distance scale, C _G (q) is the center value obtained from the frame unit scores quantized to the q level.

The method of claim 1 or 2, wherein step (e)

Class-based segmented observation frequency from framed scores and quantized framed scores

Estimating;

Observation frequency

Equation from

Extracting the voice / non-voice log likelihood ratio by a class-based Poisson polling technique as in the following description;

Q is the total number of quantization steps,

Wow

Is the average frequency in the q-th quantization step obtained by class-based soft counting in speech and non-speech segments, respectively.

8. The frequency of observation according to claim 7,

Is

Is estimated as

L is the number representing the length of the analysis segment in frames, l is the index of the analysis frame belonging to this segment, Ws (q) is the probability value when the quantization step q is negative,

The observation frequency factor P _CS (l, q) is

C _CS (q) is the center value obtained from the original scores quantized in the q th step among the frame unit scores extracted from the entire training speech frames,

The observation frequency factor P _CN (l, q) is

And C _CN (q) is a center value obtained from original scores quantized in the q th step among the frame unit scores extracted from the entire training non-voice frames.

The method of claim 1, wherein step (e)

Observation frequency f (q) estimated from quantized framed scores, frame-based scores, and observed frequency in global-based segmented units estimated from quantized framed scores

From one of the

Obtaining a mixing probability factor P _m (F | S) for the negative segment and a mixing probability factor P _m (F | N) for the non-segment segment as follows;

From the mixed probability factor for negative segments and the mixed probability factor for nonvoice segments

Extracting a voice / non-voice log likelihood ratio based on a mixed model based Poisson polling;

Q is the total number of quantization steps, λ _S (m, q) and λ _N (m, q) are the m-th mixed pattern, the average frequency for the negative and non-voice representative segments corresponding to the q-th quantization step, and M is The total number of mixed patterns, π _S (m) and π _N (m), when modeling a voice or non-voice segment as the respective mixed segment patterns, is the mth mixed weight for the voice and non-voice segments, respectively. Voice / non-voice classification method.

The method of claim 9,

Observation frequency f (q) is

Is estimated as

The observation frequency factor P _B (l, q) is

Obtained as

Is a quantized value of the lth frame unit score,

Observation frequency

Is

Is estimated as

P _G (l, q) is

Obtained as Is the quantized value of the lth frame unit,

Is the Euclidean distance scale, C _G (q) is the center value obtained from the frame unit scores quantized to q level,

Observation frequency

Is

Is estimated as

The observation frequency factor P _CS (l, q) is

The observation frequency factor P _CN (l, q) is

A signal input unit for receiving a voice or non-voice signal and converting the signal into a digital signal;

A feature extractor for dividing the digital signal input from the signal input unit in units of frames and extracting feature parameters for each frame;

A score converter which receives the feature parameters in units of frames extracted by the feature extractor, and converts the degree of speech or non-voice into a score by comparing the speech and non-voice patterns;

A score quantizer for quantizing the scores obtained in the score converter;

A speech / non-voice classifier for determining whether speech / non-speech is performed for each segment based on Poisson polling based on frame-based scores and quantized scores input from the score converter and score quantizer;

And an output unit for outputting voice / non-voice information determined by the voice / non-voice classifier.

The method of claim 11, wherein the score converter,

A voice class likelihood extractor for receiving a feature of a frame unit as an input and extracting a likelihood for the voice class;

A non-speech class likelihood extractor which receives a frame feature parameter as an input and extracts a likelihood for the non-speech class;

A first log converter which receives the output of the voice class likelihood extractor as an input and converts it into a log;

A second log converter which receives an output of the non-voice class likelihood extractor as an input and converts it into a log;

And a subtractor configured to subtract the output of the second log converter from the output of the first log converter and output a frame unit score.

The method of claim 11 or 12, wherein the voice / non-voice classifier,

To determine the voice or non-voice segment,

f (q) is the observed frequency estimated from the quantized frame unit scores, Q is the total number of quantization steps, and λ _S (q) and λ _N (q) are the quantized frames of q in the speech and non-voice segments, respectively. The average frequency of the unit scores, THR _B-PPM is a threshold used for speech / non-speech classification by basic Poisson polling.

The method of claim 13,

Observation frequency f (q) is

Is estimated as

The observation frequency factor P _B (l, q) is

Obtained as

Is a quantized value of the l-th frame unit score.

The method of claim 11 or 12, wherein the voice / non-voice classifier,

To determine the voice or non-voice segment,

Is the observed frequency in units of global based segments estimated from frame-based scores and quantized frame-based scores, Q is the total number of quantization steps,

Wow

Are each average frequency at the q-th quantization step estimated by the speech and non-speech segment, the global-based soft counting type in, THR _GSC-PPM is the threshold used for speech / non-speech classification according to the global-based soft counting type Poisson poll Voice / non-voice classification apparatus, characterized in that.

The method of claim 15,

Observation frequency

Is

Is estimated as

P _G (l, q) is

Obtained as

Is the quantized value of the lth frame unit,

The method of claim 11 or 12, wherein the voice / non-voice classifier,

To determine the voice or non-voice segment,

Is the class-based segment-based observation frequency estimated from the frame unit scores and the quantized frame unit scores, Q is the total number of quantization steps,

Wow

Is the average frequency at the q th quantization step obtained by class-based soft counting in speech and non-voice segments, respectively, and THR _CSC-PPM is the threshold used for speech / non-voice classification by class-based soft counting Poisson polling. Voice / non-voice classification device, characterized in that.

The method of claim 17,

Observation frequency

Is

Is estimated as

The observation frequency factor P _CS (l, q) is

The observation frequency factor P _CN (l, q) is

The method of claim 11 or 12, wherein the voice / non-voice classifier,

To determine the voice or non-voice segment,

The mixing probability factor P _m (F | S) for the negative segment and the mixing probability factor P _m (F | N) for the non-segment segment are the observation frequency f (q) estimated from the quantized frame unit scores, the frame unit score Observed frequency in units of global basis estimated from the

From one of the

To obtain,

Q is the total number of quantization steps, λ _S (m, q) and λ _N (m, q) are the m-th mixed pattern, the average frequency for the negative and non-voice representative segments corresponding to the q-th quantization step, and M is The total number of mixed patterns, π _S (m) and π _N (m), when modeling the negative or non-segmented segments with the respective mixed segment patterns, is the mth mixed weight for the negative and non-segmented segments, respectively. _M-PPM is a threshold used for speech / non-voice classification by the mixed model Poisson polling technique.

The method of claim 19,