KR100391123B1

KR100391123B1 - speech recognition method and system using every single pitch-period data analysis

Info

Publication number: KR100391123B1
Application number: KR10-2001-0004229A
Authority: KR
Inventors: 이태성
Original assignee: 이태성
Priority date: 2001-01-30
Filing date: 2001-01-30
Publication date: 2003-07-12
Also published as: KR20020063665A

Abstract

본 발명은 피치 검출을 이용하여 음소단위로 패턴매칭을 처리함으로써 처리속도를 향상시킨 음성인식기술에 관한 것이다.The present invention relates to a speech recognition technique that improves the processing speed by processing pattern matching on a phoneme basis using pitch detection.

이러한 본 발명의 시스템은 음성신호를 입력받아 피치를 검출하여 피치위치정보를 출력하는 피치검출기; 상기 피치검출기로부터 피치위치정보를 입력받아 피치데이터를 규격화한 후 특징벡터를 추출하여 기 등록된 표준피치 데이터와 비교하여 유성음소 문자열을 생성하는 피치데이터 분석기; 음성신호를 입력받아 묵음구간을 검출하여 묵음구간정보를 출력하는 묵음구간 탐색기; 상기 피치위치정보와 묵음구간정보를 입력받아 무성음구간을 정하고, 상기 음성신호를 입력받아 무성음소 문자열을 출력하는 무성음 식별기; 상기 유성음소 문자열과 무성음소 문자열, 묵음구간정보를 입력받아 발음문자에 대응하는 단음절을 생성하는 음절구분기; 및 상기 음절구분기의 단음절을 입력받아 문법규칙을 적용하여 표준 언어 문자열을 생성하는 언어분석기를 구비한다. 따라서, 본 발명에 따르면 음성신호의 피치를 검출한 후 피치주기로 패턴매칭을 실시하여 음소를 판별하므로 데이터베이스와 계산용량이 작아 상대적으로 적은 용량의 컴퓨터에 의해서도 음성을 정확하게 인식할 수 있다.Such a system of the present invention includes a pitch detector for receiving a voice signal and detecting a pitch to output pitch position information; A pitch data analyzer configured to receive pitch position information from the pitch detector, standardize pitch data, extract feature vectors, and compare voice data with pre-registered standard pitch data to generate voiced phonetic strings; A silent section searcher which receives a voice signal and detects a silent section and outputs silent section information; An unvoiced sound discriminator that receives the pitch position information and the silence section information, determines an unvoiced sound section, and outputs an unvoiced phone string by receiving the sound signal; A syllable branch that receives the voiced phoneme string, the unvoiced phone string, and the silent section information to generate a single syllable corresponding to a phonetic character; And a language analyzer configured to receive a single syllable of the syllable branch and apply a grammar rule to generate a standard language string. Therefore, according to the present invention, after detecting the pitch of the voice signal, pattern matching is performed at the pitch period to determine the phonemes. Thus, the database and the calculation capacity are small, so that the voice can be accurately recognized by a relatively small computer.

Description

Speech recognition method and system using every single pitch-period data analysis}

본 발명은 음성인식기술에 관한 것으로, 더욱 상세하게는 피치 검출을 이용하여 음소단위로 패턴매칭을 처리함으로써 처리속도를 향상시킨 음성인식기술에 관한 것이다.The present invention relates to a speech recognition technology, and more particularly, to a speech recognition technology of improving processing speed by processing pattern matching on a phoneme basis using pitch detection.

일반적으로, 음성인식(speech recognition)은 마이크나 전화기를 통해 화자에 의해 발성된 음향적인 신호를 인간이 이해할 수 있는 단어나 구문들로써 표현하는 일련의 과정을 의미하며, 최종적으로 인식된 단어나 구문등을 컴퓨터나 기계상의 명령이나 제어, 자료입력, 문서의 준비 등을 위한 용도로써 사용하게 된다.In general, speech recognition refers to a process of expressing an acoustic signal produced by a speaker through a microphone or a telephone as words or phrases that can be understood by a human being. It is used for command or control on computer or machine, data input, document preparation, etc.

음성인식은 음성신호로부터 음소/음절 혹은 단어를 인식하는 AD(Acoustic Decoder)와 AD의 인식결과와 언어학적 정보를 종합해서 문장을 인식하는 LD로 구성되어 있는데, AD분야를 통상 음성인식이라 한다. 음성인식은 발음된 음성패턴이 주어졌을 때, 인식모델과 패턴매칭을 하여, 가장 근접한 모델의 계수(W)로 인식하는 과정이다. 음성신호는 아날로그-디지털변환(ADC)을 거쳐 음성신호 전처리기로 입력되고, 음성신호 전처리기는 시간도메인의 음성신호를 주파수도메인으로 변환하여 음성에 내재하는 정보가 다음 단의 인식기에서 보다 효과적으로 인식하도록 변환한다. 그리고 음성인식에 사용되는 음성인식 알고리즘으로는 DTW(Dynamic Time Warping)은 HMM(Hidden Markov Mode), 신경망(NN:Neural Network) 등이 있으며, 기본적인 이론은 패턴매칭에 기인한다.Speech recognition consists of AD (Acoustic Decoder) that recognizes phonemes / syllables or words from speech signals, and LD that recognizes sentences by combining the AD results and linguistic information. The field of AD is generally called speech recognition. Speech recognition is a process of recognizing a coefficient (W) of the nearest model by pattern matching with a recognition model when a pronounced speech pattern is given. The voice signal is input to the voice signal preprocessor via analog-to-digital conversion (ADC), and the voice signal preprocessor converts the voice signal of the time domain into a frequency domain so that the information inherent in the voice can be more effectively recognized by the next stage recognizer. do. The voice recognition algorithms used for speech recognition include DTD (Dynamic Time Warping), HMM (Hidden Markov Mode) and Neural Network (NN). The basic theory is based on pattern matching.

음성인식 기술중에서 음성신호를 문자열로 변환하는 방법을 Speech-To-Text(STT)기술이라 하는데, SST를 위한 종래의 음성인식 기술은 도 14에 도시된 바와 같이 신호처리 과정(2), 목소리 특징 추출 과정(4), 소리 유사성 분석 과정(6), 언어학적 유사성 분석 과정(8)으로 구성되어 목소리로부터 문자와 문장을 구성하였다. 즉, 신호처리과정(2)에서 입력된 음성신호를 디지털로 증폭처리한 후 목소리 특징 추출 과정(4)에서 특징 변환을 하고, 소리 유사성 분석과정(6)에서 소리 DB(10)를 이용하여 상태/음소를 판단한 후 언어학적 유사성 분석과정(8)에서 언어와 발음 DB(12,14)를 이용하여 문장을 판단하였다.Among speech recognition techniques, a method of converting a speech signal into a string is called a speech-to-text (STT) technique. A conventional speech recognition technique for SST is a signal processing process (2) and a voice feature as shown in FIG. It consisted of extraction process (4), sound similarity analysis process (6), and linguistic similarity analysis process (8). That is, after digitally amplifying the voice signal input in the signal processing process (2), the feature is transformed in the voice feature extraction process (4), and the state using the sound DB 10 in the sound similarity analysis process (6). / After determining the phoneme, the sentence was judged using the language and pronunciation DB (12,14) in the linguistic similarity analysis process (8).

그런데 이와 같은 종래의 기술은 단어나 일정 크기의 버퍼를 기준으로 주파수분석을 하여 음성을 인식하므로 많은 연산량으로 인해 인식속도가 느리고, 표준 데이터 베이스의 크기도 큰 문제점이 있다. 특히, 단어 단위로 인식하는 경우에는 소규모 어휘 인식을 제외하고, 대규모의 어휘 인식을 하기 위해서는 대용량의 컴퓨터가 요구되므로 핸드폰이나 PDA와 같은 휴대용 기기에 적용하기 아려운 문제점이 있다.However, since the conventional technology recognizes speech based on frequency analysis based on a word or a buffer of a predetermined size, the recognition speed is slow due to a large amount of computation, and the size of a standard database has a big problem. In particular, in the case of recognizing word units, there is a problem that it is difficult to apply to a portable device such as a mobile phone or a PDA because a large computer is required to recognize a large vocabulary, except for a small vocabulary recognition.

본 발명은 상기와 같은 문제점을 해결하기 위하여 음성신호를 한 주기의 피치 데이터 단위로 분석하여 음소단위로 음성인식을 하는 피치검출기 및 피치 단위 데이터 분석을 이용한 음성인식 방법 및 시스템을 제공하는 데 그 목적이 있다.The present invention provides a speech detector and a speech recognition method and system using a pitch unit data analysis and a pitch detector to analyze the speech signal in one cycle pitch data unit to solve the above problems. There is this.

도 1은 본 발명에 따른 음성인식 시스템을 도시한 블럭도,1 is a block diagram showing a speech recognition system according to the present invention;

도 2는 도 1에 도시된 피치 검출기를 도시한 세부 블럭도,FIG. 2 is a detailed block diagram showing the pitch detector shown in FIG. 1;

도 3은 도 2에 도시된 파형 단순화 필터를 도시한 기능 블럭도,3 is a functional block diagram showing the waveform simplification filter shown in FIG.

도 4는 본 발명에 따른 임펄스 트레인 필터의 개념도,4 is a conceptual diagram of an impulse train filter according to the present invention;

도 5는 본 발명에 따른 피치 선택방법을 도시한 흐름도,5 is a flowchart illustrating a pitch selection method according to the present invention;

도 6은 도 2에 도시된 위치보상기의 개념도,6 is a conceptual diagram of the position compensator shown in FIG. 2;

도 7은 도 1에 도시된 피치 데이터 분석기를 도시한 세부 블럭도,7 is a detailed block diagram illustrating the pitch data analyzer shown in FIG. 1;

도 8은 도 1에 도시된 무성음 식별기를 도시한 세부 블럭도,8 is a detailed block diagram illustrating the unvoiced voice identifier shown in FIG. 1;

도 9는 도 1에 도시된 묵음구간 탐색기를 도시한 세부 블럭도,FIG. 9 is a detailed block diagram showing the silence section searcher shown in FIG. 1; FIG.

도 10은 도 1에 도시된 음절 구분기를 도시한 세부 블럭도,10 is a detailed block diagram illustrating a syllable separator shown in FIG. 1;

도 11은 도 1에 도시된 언어분석기를 도시한 세부 블럭도,FIG. 11 is a detailed block diagram showing a language analyzer shown in FIG. 1;

도 12는 도 7에 도시된 피치데이터 규격화부의 개념을 도시한 개념도,12 is a conceptual diagram illustrating a concept of a pitch data normalization unit shown in FIG. 7;

도 13은 본 발명에 따른 음성인식과정의 예를 도시한 도면,13 is a view showing an example of a speech recognition process according to the present invention;

도 14는 일반적인 음성인식 절차를 도시한 도면이다.14 is a diagram illustrating a general voice recognition procedure.

*도면의 주요부분에 대한 부호의 설명* Explanation of symbols for main parts of the drawings

102: 피치검출기 104: 피치데이터 분석기102: pitch detector 104: pitch data analyzer

106: 무성음 식별기 108: 묵음구간 탐색기106: unvoiced identifier 108: silent section searcher

110: 음절구분기 112: 언어분석기110: syllable branch 112: language analyzer

202: 파형단순화필터 204: 임펄스 트레인 생성기202: waveform simplification filter 204: impulse train generator

206: 피치선택부 208: 위치보상기206: pitch selector 208: position compensator

702: 피치추출기 704: 피치데이터 규격화부702: pitch extractor 704: pitch data normalization unit

706: 특징추출기 708:표준피치 데이터베이스706: Feature Extractor 708: Standard Pitch Database

710: 특징벡터비교기 802: 무성음 분류기710: feature vector comparator 802: unvoiced classifier

804: 무성음 특징 추출기 808: 무성음 특징 비교기804: Unvoiced Feature Extractor 808: Unvoiced Feature Comparator

상기와 같은 목적을 달성하기 위하여 본 발명의 음성인식시스템은, 음성신호를 입력받아 피치를 검출하여 피치위치정보를 출력하는 피치검출기; 상기 피치검출기로부터 피치위치정보를 입력받아 피치데이터를 규격화한 후 특징벡터를 추출하여 기 등록된 표준피치 데이터와 비교하여 유성음소 문자열을 생성하는 피치데이터 분석기; 음성신호를 입력받아 묵음구간을 검출하여 묵음구간정보를 출력하는 묵음구간 탐색기; 상기 피치위치정보와 묵음구간정보를 입력받아 무성음구간을 정하고, 상기 음성신호를 입력받아 무성음소 문자열을 출력하는 무성음 식별기; 상기 유성음소 문자열과 무성음소 문자열, 묵음구간정보를 입력받아 발음문자에 대응하는 단음절을 생성하는 음절구분기; 및 상기 음절구분기의 단음절을 입력받아 문법규칙을 적용하여 표준 언어 문자열을 생성하는 언어분석기를 구비한 것을 특징으로 한다.In order to achieve the above object, the voice recognition system of the present invention includes: a pitch detector for detecting a pitch by receiving a voice signal and outputting pitch position information; A pitch data analyzer configured to receive pitch position information from the pitch detector, standardize pitch data, extract feature vectors, and compare voice data with pre-registered standard pitch data to generate voiced phonetic strings; A silent section searcher which receives a voice signal and detects a silent section and outputs silent section information; An unvoiced sound discriminator that receives the pitch position information and the silence section information, determines an unvoiced sound section, and outputs an unvoiced phone string by receiving the sound signal; A syllable branch that receives the voiced phoneme string, the unvoiced phone string, and the silent section information to generate a single syllable corresponding to a phonetic character; And a language analyzer configured to receive a single syllable of the syllable branch and apply a grammar rule to generate a standard language string.

상기와 같은 목적을 달성하기 위하여 본 발명의 음성인식방법은, 음성신호를 입력받아 음성신호에 대응하는 문자열을 생성하는 스피치문자변환(STT) 방법에 있어서, 상기 입력된 음성신호에서 피치를 검출하는 단계; 상기 검출된 피치위치정보와 음성신호를 분석하여 피치구간과 묵음구간, 및 무성음구간을 설정하는 단계; 상기 피치구간에서 한 주기 피치패턴을 소정의 표준 데이터와 비교하여 유성음소를 판별하는 단계; 상기 무성음구간에서 한 주기 피치패턴을 소정의 표준 데이터와 비교하여 무성음소를 판별하는 단계; 및 상기 판별된 유성음소와 무성음소를 소정 규칙에 따라 결합하여 단음절을 생성하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the speech recognition method of the present invention is a speech character conversion (STT) method for receiving a speech signal and generating a character string corresponding to the speech signal, wherein the pitch is detected from the input speech signal. step; Analyzing the detected pitch position information and a voice signal to set a pitch section, a silent section, and an unvoiced section; Discriminating voiced phonemes by comparing one period pitch pattern in the pitch period with predetermined standard data; Discriminating unvoiced phonemes by comparing one period pitch pattern in the unvoiced sound interval with predetermined standard data; And combining the determined voiced phones and unvoiced phones according to a predetermined rule to generate single syllables.

또한, 상기와 같은 목적을 달성하기 위하여 본 발명의 피치검출기는, 음성신호를 입력받아 파형을 단순화하는 파형 단순화 필터; 상기 파형단순화 필터의 출력을 기준으로 소정 채널의 임펄스 트레인을 생성하는 임펄스 트레인 생성기; 소정의 피치선택규칙을 적용하여 피치위치를 검출하는 피치선택부; 및 상기 피치선택부에 의해 검출된 피치위치를 원 입력신호와 단순화된 신호 사이의 위치 차이만큼 보상하는 위치보상기를 포함하는 것을 특징으로 한다.In addition, the pitch detector of the present invention to achieve the above object, a waveform simplification filter for receiving a voice signal and simplifying the waveform; An impulse train generator for generating an impulse train of a predetermined channel based on the output of the waveform simplification filter; A pitch selection unit detecting a pitch position by applying a predetermined pitch selection rule; And a position compensator for compensating the pitch position detected by the pitch selector by a position difference between the original input signal and the simplified signal.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 자세히 설명하기로 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명에 따른 음성인식 시스템을 도시한 블럭도로서, 본 발명의 음성인식시스템(100)은 피치검출기(102), 피치데이터 분석기(104), 무성음 식별기(106), 묵음구간 탐색기(108), 음절 구분기(110), 언어 분석기(112)로 구성되어 음성신호를 입력받아 표준언어 문자열을 출력한다.1 is a block diagram showing a speech recognition system according to the present invention, the speech recognition system 100 of the present invention is a pitch detector 102, pitch data analyzer 104, unvoiced identifier 106, silence interval searcher ( 108), the syllable separator 110, and the language analyzer 112 receives a voice signal and outputs a standard language string.

도 1을 참조하면, 피치검출기(102)는 도 2에 도시된 바와 같이, 파형단순화필터(202), 임펄스 트레인 생성기(204), 피치선택부(206), 위치보상기(208)로 구성되어 음성신호를 입력받아 파형단순화 필터를 거친신호를 기준으로 임펄스 트레인을 생성하고, 정해진 규칙에 의거하여 필터링한 후 피치선택규칙을 적용하여 피치위치를 찾은 후 원 입력신호와 단순화된 신호 사이의 위치 차이를 보상하는 과정을 거쳐 피치위치정보를 출력한다.Referring to FIG. 1, the pitch detector 102 includes a waveform simplification filter 202, an impulse train generator 204, a pitch selector 206, and a position compensator 208 as shown in FIG. 2. Generates an impulse train based on the signal passed through the waveform simplification filter, filters it according to the specified rule, finds the pitch position by applying the pitch selection rule, and then calculates the position difference between the original input signal and the simplified signal. The pitch position information is output through the compensating process.

도 2를 참조하면, 파형 단순화 필터(202)는 도 3에 도시된 바와 같이, 5점 산술평균 필터로 구현된다. 도 3에 따르면, 5점 산술평균 필터는 개념적으로 5개의 지연기(302, 304, 306, 308, 310)와 4개의 가산기(312, 314, 316, 318), 멀티플라이어(320)로 구현되어 입력신호를 산술 평균하여 출력한다.Referring to FIG. 2, the waveform simplification filter 202 is implemented as a five point arithmetic mean filter, as shown in FIG. According to FIG. 3, a five-point arithmetic mean filter is conceptually implemented with five delayers 302, 304, 306, 308, 310, four adders 312, 314, 316, 318, and a multiplier 320. Arithmetic mean of the input signal is output.

그리고 도 2의 임펄스 트레인 생성기(204)는 입력신호의 극대점과 극소점에서만 값을 가지는 임펄스들의 집합을 정의한다. 특히, 본 발명에서는 임펄스 트레인 필터를 도 4에 도시된 바와 같이, 종래의 신호무시 구간과 지수감소 구간으로 단순하게 구분하지 않고, "거리에 비례적인 피치점 이동구간"과 '지수감소구간'으로 구분하여 피치점 이동을 통해 보다 정확하게 피치주기를 검출할 수 있다. 즉, 본 발명에서는 정확한 피치점을 찾기 위해 신호무시 구간에서라도 이전 피치점으로부터 거리에 비례한 크기보다 큰 임펄스가 있으면 피치점을 이동하는 방식을 택한다. 이 때, 피치검출과정의 유효구간을 정의하기 위하여 입력신호의 낮은 한계치를 평균잡음치의 약 1.5 배로 설정한다. 그리고 본 발명의 실시예에서는 6채널의 임펄스 트레인 필터를 이용하여 6채널의 임펄스 트레인을 출력한다.In addition, the impulse train generator 204 of FIG. 2 defines a set of impulses having values only at maximum and minimum points of the input signal. In particular, in the present invention, as shown in Fig. 4, the impulse train filter is not simply divided into a conventional signal disregarding section and an exponentially decreasing section. The pitch period can be detected more accurately by moving the pitch point. That is, in the present invention, in order to find the correct pitch point, even if there is an impulse larger than the magnitude proportional to the distance from the previous pitch point in the signal ignore section, a method of moving the pitch point is selected. At this time, the lower limit of the input signal is set to about 1.5 times the average noise value to define the effective section of the pitch detection process. In the embodiment of the present invention, the six-channel impulse train filter is output using the six-channel impulse train filter.

도 2의 피치선택부(206)는 도 5에 도시된 비와 같은 피치선택규칙에 따라 피치를 선택한다. 피치선택규칙은 임펄스트레인 채널 1~3이 일치되는 위치를 기준으로 한 선택과, 채널4~6이 일치되는 위치를 기준으로 한 선택으로 구성되어 있다. 피치점으로 선택하는 경우는 모두 세가지로 다음과 같다.The pitch selector 206 of FIG. 2 selects the pitch according to the pitch selection rule as shown in FIG. The pitch selection rule consists of a selection based on the position where the impulse strain channels 1 to 3 match and a selection based on the position where the channels 4 to 6 match. There are three cases of selecting the pitch point as follows.

<선택규칙1><Selection Rule 1>

현재 데이터 위치에서 채널2의 임펄스가 존재하고, 채널1과 채널3 중에서 적어도 하나의 임펄스가 존재하며, 이전에 선택된 피치점과 현재 위치 사이에 채널 4~6의 임펄스가 3개 이상 존재하는 경우, 현재 위치를 피치점으로 선택한다.If there is an impulse of channel 2 at the current data position, at least one impulse among channels 1 and 3, and at least three impulses of channels 4 to 6 exist between the previously selected pitch point and the current position, Select the current position as the pitch point.

<선택규칙2><Selection Rule 2>

현재 데이터 위치에서 채널 4~6의 임펄스가 모두 나타나고, 이전 피치점과 현재 위치 사이에 채널2의 임펄스가 존재하며, 그 채널2 임펄스와 이전에 선택된 피치점 사이에 채널4~6의 임펄스가 2개 이상 존재하는 경우, 채널2 임펄스의 위치를 피치점으로 선택한다.All the impulses of channels 4 to 6 appear at the current data position, there is an impulse of channel 2 between the previous pitch point and the current position, and an impulse of channels 4 to 6 between the channel 2 impulse and the previously selected pitch point. If more than one is present, the position of the channel 2 impulse is selected as the pitch point.

<선택규칙3>Rule 3

현재 데이터 위치에서 채널4의 임펄스가 존재하고, 채널5와 6중에서 적어도 하나의 임펄스가 존재하며, 이전에 선택된 피치점과 현재 위치 사이에 피치점 후보가 존재하고, 그 피치점 후보와 이전에 선택된 피치점 사이에 채널 4~6의 임펄스가 2개 이상 존재하는 경우, 피치점 후보의 위치를 피치점으로 선택한다.There is an impulse of channel 4 at the current data position, there is at least one impulse among channels 5 and 6, a pitch point candidate exists between the previously selected pitch point and the current position, the pitch point candidate and the previously selected When two or more impulses of channels 4 to 6 exist between pitch points, the position of the pitch point candidate is selected as the pitch point.

이와 같은 피치선택규칙을 적용하기 위한 절차는 도 5에 도시된 바와 같이, 현재 데이터를 입력받아 마지막 데이터인지를 판단하여 마지막 데이터이면 피치선택규칙을 종료한다(S1).As shown in FIG. 5, the procedure for applying the pitch selection rule as described above receives the current data, determines whether the data is the last data, and ends the pitch selection rule if it is the last data (S1).

마지막 데이터가 아니면, 채널2를 포함한 피크 임펄스 수가 2 이상인가를 판단(S2)하여 예(yes)이면 이전에 선택된 피치점과 현재위치 사이에 벨리 임펄스가 3개 이상 존재하는가를 판단한다(S3). 판단결과 3개 이상 존재하면 현재위치를 피치점으로 선택한다(S4).If it is not the last data, it is determined whether the number of peak impulses including channel 2 is 2 or more (S2), and if yes, it is determined whether three or more belly impulses exist between the previously selected pitch point and the current position (S3). . If three or more exist as a result of the determination, the current position is selected as the pitch point (S4).

S2단계에서 판단결과 아니오(No)이면 벨리 임펄스가 3개인가를 판단(S5)하여 예(Yes)이면 이전에 선택한 피치점과 현재위치 사이에 0이 아닌 채널1이 존재하는가를 판단(S6)하여 예(Yes)이면 이전 채널1의 위치를 피치점으로 선택한다(S8).If the determination result in step S2 is No, it is determined whether there are 3 belly impulses (S5). If yes, it is determined whether there is a channel 0 other than 0 between the previously selected pitch point and the current position (S6). If yes, the position of the previous channel 1 is selected as the pitch point (S8).

S5단계에서 판단결과 아니오(No)이면 벨리 임펄스가 2개이고 채널4가 0보다 큰값인가를 판단(S9)하여 예(Yes)이면 이전에 선택한 피치점과 현재위치 사이에 피치점 후보가 존재하고, 그 사이에 벨리 임펄스가 2개이상 존재하는가를 판단(S10)하여 예(Yes)이면 이전 피크 후보의 위치를 피치점으로 선택한다(S11).If the determination result in step S5 is No, it is determined whether there are two belly impulses and channel 4 is greater than zero (S9). If yes, a pitch point candidate exists between the previously selected pitch point and the current position. In the meantime, it is determined whether two or more belly impulses exist (S10), and if yes, the position of the previous peak candidate is selected as the pitch point (S11).

위치보상기(도 2의 208)의 개념은 도 6에 도시된 바와 같다.The concept of the position compensator (208 in FIG. 2) is as shown in FIG.

일반적으로, 파형 단순화 필터를 거친 신호의 피크점 위치와 원래 입력신호의 피크점 위치는 일치하지 않기 때문에, 이 차이를 보상해 주지 않으면 원래 입력신호의 피치를 추출하고 분석할 때 정확한 결과를 얻을 수 없다. 위치보상기(208)는 이 차이를 보상하여 입력신호에서 파형을 추출할 경우에도 정확한 피치의 위치를 찾도록 한다. 위치보상기(208)는 파형 단순화 필터를 거친 신호에서 얻어진 피치점의 위치를 중심으로 산술평균 필터에 사용한 차수(degree of median filter)와 동일한 크기의 데이터 영역을 탐색하여 최대값을 찾는 구조로 되어 있다.In general, the peak point position of the signal passed through the waveform simplification filter does not coincide with the peak point position of the original input signal, so that accurate results can be obtained when extracting and analyzing the pitch of the original input signal without compensating for this difference. none. The position compensator 208 compensates for the difference so that the position of the correct pitch can be found even when the waveform is extracted from the input signal. The position compensator 208 searches for a data area of the same size as the degree of median filter used for the arithmetic mean filter centering on the position of the pitch point obtained from the signal passed through the waveform simplifying filter and finds the maximum value. .

이와 같이 본 발명에 따라 임펄스 트레인을 이용하여 피치주기를 예측하는 방법은 피치주기의 정확한 예측뿐만 아니라 한 주기의 피치 데이터를 분석하여 그 특징을 추출하고 비교할 수 있는 기반을 제시하여 음성인식을 위한 파형분석이나 화자의 특성을 추출하는 데 효과적이다.As described above, the method of predicting the pitch period using the impulse train according to the present invention provides not only accurate prediction of the pitch period but also a basis for analyzing the pitch data of one period and extracting and comparing the characteristics thereof, thereby providing a waveform for speech recognition. Effective for analyzing or extracting speaker characteristics.

다시 도 1을 참조하면, 피치 데이터 분석기(104)는 도 7에 도시된 바와 같이 피치추출기(702), 피치데이터 규격화부(704), 특징추출기(706), 표준 피치 데이터베이스(708), 특징 벡터 비교기(710)로 구성되어 피치위치정보를 입력받아 피치데이터를 규격화한 후 특징벡터를 추출하고, 추출된 특징벡터와 피치길이정보를 기준으로 기 등록된 표준 피치 데이터와 비교하여 유성음소 문자열을 생성한다. 이 때 사용되는 표준 피치데이터 베이스의 구조는 다음 표 1과 같다.Referring back to FIG. 1, the pitch data analyzer 104 includes a pitch extractor 702, a pitch data normalizer 704, a feature extractor 706, a standard pitch database 708, a feature vector, as shown in FIG. 7. Comprising a comparator 710 receives pitch position information, normalizes pitch data, extracts a feature vector, and generates a voiced phonetic string by comparing the registered feature pitch with standard pitch data based on the extracted feature vector and pitch length information. do. The structure of the standard pitch database used at this time is shown in Table 1 below.

길이음소Phoneme 7070 8080 9090 .............. 340340 350350 360360 370370 아Ah 에on 이this 오Five 우Ooh 으Ugh 어uh ........

상기 표 1을 참조하면, 표준 피치 데이터베이스는 음소별로 피치길이를 소정주파수 대역( 예컨대, 70Hz~370Hz대역)에서 소정 주파수(예컨대, 10Hz) 간격으로 구별한다.Referring to Table 1, the standard pitch database distinguishes pitch lengths by phonemes in predetermined frequency bands (for example, 70 Hz to 370 Hz bands) at predetermined frequency intervals (for example, 10 Hz).

도 7을 참조하면, 피치추출기(702)는 피치위치정보를 입력받아 입력신호에서 피치를 추출하여 피치길이정보를 특징벡터 비교기(710)와 피치데이터 규격화부(704)로 출력한다.Referring to FIG. 7, the pitch extractor 702 receives pitch position information, extracts a pitch from an input signal, and outputs pitch length information to the feature vector comparator 710 and the pitch data standardizer 704.

피치데이터 규격화부(704)는 피치길이정보를 입력받아 도 12에 도시된 바와 같이 규격화한다. 도 12를 참조하여 피치 데이터 규격화 과정을 설명하면 다음과 같다.The pitch data normalization unit 704 receives the pitch length information and normalizes it as shown in FIG. 12. A pitch data normalization process will be described below with reference to FIG. 12.

먼저, 단음절 전체 데이터에서 인접한 두 피치점을 잇는 선의 기울기 A를 구한다. 이어 각 피치점을 기준으로 소정 시간(예컨대, 약 10ms) 영역을 탐색하여 극소점(minimum position)을 찾아 두 극소점을 잇는 선의 기울기 B를 구한다. 이어 다음 수학식1에 따라 두 기울기의 평균(C)을 구하고, 원래의 기울기 A에서 평균기울기(C)를 감하여 규격화된 기울기(A')를 구한다(A'=A-C). 이와 같이 규격화된 n 번째 피치 데이터는 다음 수학식 2와 같이 구할 수 있다.First, the slope A of the line connecting two adjacent pitch points in the entire syllable data is obtained. Next, the minimum position is found by searching a region for a predetermined time (eg, about 10 ms) based on each pitch point, and the slope B of the line connecting the two minimum points is obtained. Subsequently, the average C of two slopes is obtained according to Equation 1 below, and the standardized slope A 'is obtained by subtracting the average gradient C from the original slope A (A' = A-C). The n th pitch data normalized as above may be obtained as in Equation 2 below.

상기 수학식 2에서 X'(n)은 규격화된 피치데이터이고, X(n)은 원래 피치데이터이다.In Equation 2, X '(n) is normalized pitch data, and X (n) is original pitch data.

특징 추출기(706)는 규격화된 피치 데이터에서 특징벡터를 추출한다. 특징벡터 비교기(710)는 특징 추출기(706)로부터 입력받은 특징벡터와 피치 추출기(702)로부터 입력받은 피치길이정보를 표준 피치 데이터베이스(708)의 기준치와 비교하여 유성음소 문자열을 결정한다.The feature extractor 706 extracts the feature vector from the normalized pitch data. The feature vector comparator 710 compares the feature vector input from the feature extractor 706 and the pitch length information input from the pitch extractor 702 with reference values of the standard pitch database 708 to determine the voiced phone string.

도 1을 참조하면, 무성음 식별기(106)는 도 8에 도시된 바와 같이, 무성음 분류기(802), 무성음 특징 추출기(804), 무성음 표준 데이터베이스(806), 무성음 특징 비교기(808)로 구성되어 피치위치정보와 음성신호를 입력으로 받아 피치구간의 시작점 이전 소정 시간(예컨대, 125msec)의 신호를 추출하여 '무성음 구간'으로 정하고, 그 구간의 음성신호를 분석하여 무성음을 크게 마찰음, 파열음, 비음으로 구분한 후 무성음 특징벡터를 추출하고, 무성음 표준 데이터와 비교하여 무성음소 문자열을 출력한다.Referring to FIG. 1, the unvoiced identifier 106 is composed of an unvoiced classifier 802, an unvoiced feature extractor 804, an unvoiced standard database 806, and an unvoiced feature comparator 808, as shown in FIG. 8. It receives location information and voice signal as input and extracts the signal of predetermined time (for example, 125msec) before the start point of pitch section and decides as 'unvoiced section', and analyzes the voice signal of the section as loud noise, rupture sound, nasal sound After classifying, the unvoiced feature vector is extracted and compared with unvoiced standard data.

도 8을 참조하면, 무성음 분류기(802)는 음성신호와 피치위치정보를 입력받아 무성음구간에서 무성음신호를 추출하고, 무성음 특징 추출기(804)는 무성음구간의 음성신호를 분석하여 무성음을 마찰음, 파열음, 비음 등으로 구분한 후 무성음 특징벡터를 추출하며, 무성음 특징 비교기(808)는 무성음 특징벡터를 무성음 표준 데이터와 비교하여 무성음소 문자열을 결정한다.Referring to FIG. 8, the unvoiced classifier 802 receives a voice signal and pitch position information, extracts an unvoiced signal from an unvoiced section, and the unvoiced feature extractor 804 analyzes the unvoiced section of the unvoiced section, and generates unvoiced sounds, broken sounds, and the like. After extracting the unvoiced feature vector, the unvoiced feature comparator 808 compares the unvoiced feature vector with unvoiced standard data to determine the unvoiced character string.

다시 도 1을 참조하면, 묵음구간 탐색기(108)는 도 9에 도시된 바와 같이, 영교차율 측정부(902), 평균음압 측정부(904), 묵음구간 추출부(906)로 구성되어 음성신호를 입력받아 영교차율(Zero Crossing Rate)과 평균진폭(AverageAmplitude)을 측정하여 평균진폭의 크기와 영교차율이 모두 기준 값보다 작으면 묵음구간으로 설정한다.Referring back to FIG. 1, the silence section finder 108 includes a zero crossing rate measuring unit 902, an average sound pressure measuring unit 904, and a silence section extracting unit 906 as shown in FIG. 9. After inputting, measure Zero Crossing Rate and Average Amplitude. If both magnitude and zero crossing are less than the reference value, set as silent period.

도 9을 참조하면, 영교차율 측정부(902)는 입력신호에서 영교차율을 측정하고, 평균음압 측정부(904)는 입력신호에서 평균음압을 측정하며, 묵음구간 추출부(906)는 영교차율과 평균음압, 피치위치정보를 입력받아 묵음구간정보를 출력한다.Referring to FIG. 9, the zero crossing rate measuring unit 902 measures a zero crossing rate in an input signal, the average sound pressure measuring unit 904 measures an average sound pressure in an input signal, and the silent section extraction unit 906 includes a zero crossing rate. And average sound pressure and pitch position information are received and silence section information is output.

다시 도 1을 참조하면, 음절 구분기(110)는 도 10에 도시된 바와 같이, 문자열 정보 동기화부(1002)와 음절구분부(1004)로 구성되어 유성음소 문자열과 무성음소 문자열, 그리고 묵음구간정보를 입력받아 단음절의 경계를 설정하고, 경계내에 있는 유성음소 문자열중에서 가장 빈도가 높은 유성음소 문자열을 '중성'으로 정하고 피치 구간에 앞서 오는 무성음구간으로부터 얻어진 무성음소문자를 '초성'으로, 피치구간뒤에 오는 약 피치구간 또는 묵음 구간으로부터 얻어진 무성음소문자를 '종성'으로 한 후 이를 결합하여 발음 문자 한글자를 생성한다.Referring again to FIG. 1, the syllable separator 110 includes a string information synchronization unit 1002 and a syllable separator 1004, as shown in FIG. 10, for a voiced phonetic string, an unvoiced phonetic string, and a silent section. Set the boundary of single syllables by receiving the information, and set the most frequent voiced phonetic string among the voiced phonetic strings within the boundary as 'neutral', and the unvoiced lower case letter obtained from the unvoiced section preceding the pitch section is called 'first', and the pitch section The unvoiced lower case letters obtained from the following pitch intervals or silent intervals are set to 'jongseong' and combined to generate phonetic characters.

도 10을 참조하면, 문자열정보 동기화부(1002)는 유성음소 문자열과 무성음소 문자열, 묵음구간정보를 입력받아 초성, 중성, 종성으로 구분하고, 음절구분부(1004)는 음절구분규칙에 따라 발음 문자열을 출력한다.Referring to FIG. 10, the string information synchronization unit 1002 receives voiced phonetic strings, unvoiced strings, and silent section information, and divides them into initial, neutral, and final syllables, and the syllable division unit 1004 is pronounced according to syllable division rules. Print a string.

도 1에서 언어분석기(112)는 도 11에 도시된 바와 같이, 언어/문법 데이터베이스(1102)와 언어모델부(1104)로 구성되어 음절구분기(110)에서 얻어진 발음 문자열을 입력받아 단어 데이터 베이스와 비교하고, 문법규칙을 적용하여 표준언어 문자열을 생성한다.In FIG. 1, the language analyzer 112 is composed of a language / grammar database 1102 and a language model unit 1104 as shown in FIG. 11 and receives a pronunciation string obtained from a syllable branch 110. Compare with, and apply grammar rules to generate standard language strings.

도 13은 본 발명에 따른 음성인식과정의 한 예를 도시한 도면이다.13 is a diagram illustrating an example of a voice recognition process according to the present invention.

도 13을 참조하면, 단음절 "강"이 발음된 경우, 음성신호 파형은 시간영역에서 묵음구간, 무성음구간(초성구간), 피치구간(중성구간), 약피치구간(종성구간), 묵음구간으로 구분되어 나타난다. 본 발명에 따라 무성음구간의 파형분석에 의해 무성음소 "ㄱ"을 인식하고, 피치구간에서 유성음소를 분석하여 다수의 "ㅏ" 특징을 검출한다. 이어 약피치구간에서 무성음소 "ㅇ"을 검출하여 음소문자열을 정의한다. 이어 음절구분규칙에 따라 음소문자를 조합하여 "강"이라는 단어를 인식한다.Referring to FIG. 13, when a single syllable “strong” is pronounced, a voice signal waveform is divided into a silent section, an unvoiced section (first section), a pitch section (neutral section), a weak pitch section (a final section), and a silent section in a time domain. Appear separately. According to the present invention, the unvoiced phone "a" is recognized by the waveform analysis of the unvoiced section, and the voiced phone is analyzed in the pitch section to detect a plurality of "ㅏ" features. Subsequently, the unvoiced phoneme “ㅇ” is detected in the weak pitch section to define a phoneme string. Then, the word "strong" is recognized by combining the lower and lower letters according to the syllable classification rule.

이상에서 설명한 바와 같이, 본 발명에 따르면 음성신호의 피치를 검출한 후 피치주기로 패턴매칭을 실시하여 음소를 판별하므로 데이터베이스와 계산용량이 작아 상대적으로 적은 용량의 컴퓨터에 의해서도 음성을 정확하게 인식할 수 있다. 따라서 본 발명은 핸드폰이나 PDA 등과 같이 휴대용 장치의 음성인식수단(STT)에 널리 사용될 수 있다. 또한 음소단위로 인식과정을 수행함으로써 인식 어휘 수에 제한이 없다.As described above, according to the present invention, since the pitch of the voice signal is detected and pattern matching is performed by pitch period, the phoneme is discriminated so that the database and the calculation capacity are small, so that the voice can be accurately recognized by a relatively small computer. . Therefore, the present invention can be widely used in the voice recognition means (STT) of a portable device such as a mobile phone or a PDA. In addition, there is no limit to the number of recognized words by performing the recognition process in phoneme units.

Claims

A pitch detector for receiving a voice signal and detecting a pitch to output pitch position information;

A pitch data analyzer configured to receive pitch position information from the pitch detector, standardize pitch data, extract feature vectors, and compare voice data with pre-registered standard pitch data to generate voiced phonetic strings;

A silent section searcher which receives a voice signal and detects a silent section and outputs silent section information;

An unvoiced sound discriminator that receives the pitch position information and the silence section information, determines an unvoiced sound section, and outputs an unvoiced phone string by receiving the sound signal;

A syllable branch that receives the voiced phoneme string, the unvoiced phone string, and the silent section information to generate a single syllable corresponding to a phonetic character; And

And a language analyzer configured to receive a single syllable of the syllable branch and apply a grammar rule to generate a standard language string.

The method of claim 1, wherein the pitch detector

A waveform simplification filter that receives a voice signal and simplifies the waveform;

An impulse train generator for generating an impulse train of a predetermined channel based on an output of the waveform simplification filter;

A pitch selection unit detecting a pitch position by applying a predetermined pitch selection rule; And

And a position compensator for compensating the pitch position detected by the pitch selector by a position difference between the original input signal and the simplified signal.

The speech recognition system of claim 2, wherein the waveform simplification filter is implemented with an arithmetic mean filter.

The voice recognition system of claim 2, wherein the pitch selector selects a pitch point based on a position at which the impulse train channels 1 to 3 match, or selects a pitch point based on a position at which the channels 4 to 6 match.

The method of claim 1, wherein the pitch data analyzer

A pitch extractor for receiving pitch position information and extracting pitch from an input signal to output pitch length information; A pitch data normalizing unit configured to receive and standardize the pitch length information; A feature extractor for extracting feature vectors from the normalized pitch data; A standard pitch database storing predefined standard pitch data; And a feature vector comparator configured to determine a voiced phonetic string by comparing the feature vector input from the feature extractor and the pitch length information input from the pitch extractor with reference values of the standard pitch database.

6. The speech recognition system of claim 5, wherein the standard pitch database distinguishes pitch lengths by phonemes in predetermined frequency intervals in a predetermined frequency band.

The method of claim 5, wherein the pitch data normalization unit

Find the slope A of the line connecting two adjacent pitch points, search the predetermined time domain based on each pitch point, find the minimum point, and find the slope B of the line connecting the two minimum points. ), Then subtract the mean slope C from the original slope A to find the normalized slope A '(A' = AC),

nth pitch data X (n) Voice recognition system, characterized in that the standardized operation.

The method of claim 1, wherein the silent section searcher

A zero crossing rate measuring unit measuring a zero crossing rate from an input signal; An average sound pressure measuring unit measuring an average sound pressure in the input signal; Speech recognition system, characterized in that composed of a silent section extraction unit for receiving the zero crossing rate, average sound pressure, pitch position information and outputs silent section information.

The method of claim 1, wherein the unvoiced identifier

An unvoiced sound classifier which receives an audio signal and pitch position information and extracts an unvoiced sound signal from the unvoiced sound section; An unvoiced feature extractor for extracting unvoiced feature vectors by analyzing voice signals of unvoiced sections, dividing unvoiced sounds into friction sounds, burst sounds, nasal sounds, etc .; A voice recognition system comprising: an unvoiced feature comparator for comparing unvoiced feature vectors with unvoiced standard data to determine unvoiced strings.

An impulse train generator for generating an impulse train of a predetermined channel based on the output of the waveform simplification filter;

11. The pitch detector of claim 10, wherein the waveform simplification filter is implemented as an arithmetic mean filter.

The method of claim 10, wherein the pitch selector

When the impulse train generator generates an impulse train of 6 channels,

Selection Rule 1: Impulse of channel 2 exists at the current data position, at least one impulse exists between channel 1 and channel 3, and at least three impulses of channels 4 to 6 between the previously selected pitch point and the current position. If present, select the current position as the pitch point,

Selection Rule 2: Impulses of channels 4 to 6 appear at the current data position, an impulse of channel 2 exists between the previous pitch point and the current position, and channels 4 to 6 between the channel 2 impulse and the previously selected pitch point. If there are two or more impulses, select the position of the channel 2 impulse as the pitch point,

Selection Rule 3: There is an impulse of channel 4 at the current data position, at least one impulse among channels 5 and 6, a pitch point candidate exists between the previously selected pitch point and the current position, and the pitch point candidate And if more than two impulses of channels 4 to 6 exist between the previously selected pitch points, the pitch detector is selected as the pitch point.

In the speech text conversion (STT) method of receiving a voice signal and generating a character string corresponding to the voice signal,

Detecting a pitch from the input voice signal;

Analyzing the detected pitch position information and a voice signal to set a pitch section, a silent section, and an unvoiced section;

Discriminating voiced phonemes by comparing one period pitch pattern in the pitch period with predetermined standard data;

Discriminating unvoiced phonemes by comparing one period pitch pattern in the unvoiced sound interval with predetermined standard data; And

And combining the determined voiced phones and unvoiced phones according to a predetermined rule to generate single syllables.

15. The voice recognition method of claim 13, further comprising: setting a weak pitch section after the pitch section, and determining a voiced consonant coming into the final pitch from the weak pitch section.