KR100304788B1

KR100304788B1 - Method for telephone number information using continuous speech recognition

Info

Publication number: KR100304788B1
Application number: KR1019990022546A
Authority: KR
Inventors: 고한석; 김우일; 김태윤; 강선미
Original assignee: 채문식; 학교법인고려중앙학원
Priority date: 1999-06-16
Filing date: 1999-06-16
Publication date: 2001-11-01
Also published as: KR20010002646A

Abstract

본 발명은 전화번호 안내시 발음이 예측되는 문장 구조를 설정하고 그 문장 구조로부터 만들어질 수 있는 문장들 가운데 복수개의 문장을 가지고 훈련에 필요한 음성 데이터를 구축함으로써 사용자의 보다 자연스러운 발음을 유도하며, 한글 연속어 음성인식을 기반으로 한 신뢰성 있는 음성 인식으로 전화번호 안내에 정확성을 기하도록 한 연속 음성 인식을 이용한 전화번호 안내방법에 관한 것으로서, 이러한 본 발명은, 사용자가 다이얼링을 하여 전화번호 안내 시스템이 오프훅크 상태가 되면, 시스템 소개 멘트를 출력하고 원하는 과 및 교수 이름의 음성 입력을 요구하며, 사용자가 해당 과 및 교수의 이름을 입력하면 그 입력되는 연속 음성과 발화가 예측되는 문장 구조를 갖는 단어 네트워크를 가지고 훈련한 훈련 모델을 비교하여 연속 음성을 인식한다. 이후 인식한 입력 음성에 대한 인식 결과를 전화번호 안내부에 전달해주어 해당 과 교수의 전화번호가 송출토록 제어한다.The present invention sets the sentence structure predicted pronunciation during phone number guidance, and by incorporating a plurality of sentences among sentences that can be made from the sentence structure to build the voice data necessary for training to induce more natural pronunciation of the user, The present invention relates to a telephone number guidance method using continuous speech recognition to ensure accurate phone number guidance with reliable speech recognition based on continuous speech recognition. When it is off-hook, it outputs the system introduction comment and requests voice input of the desired department and professor name, and the user has a sentence structure that predicts the continuous voice and utterance inputted when the user inputs the name of the department and professor. Compare training models trained with a network Recognize. After that, the recognition result of the recognized input voice is transmitted to the telephone number guide so that the telephone number of the professor concerned is transmitted.

Description

Method for telephone number information using continuous speech recognition}

본 발명은 연속 음성 인식을 이용한 전화번호 안내에 관한 것으로, 특히 전화번호 안내시 발음이 예측되는 문장 구조를 설정하고 그 문장 구조로부터 만들어질 수 있는 문장들 가운데 복수개의 문장을 가지고 훈련에 필요한 음성 데이터를 구축함으로써 사용자의 보다 자연스러운 발음을 유도하며, 한글 연속어 음성인식을 기반으로 한 신뢰성 있는 음성 인식으로 전화번호 안내에 정확성을 기하도록 한 연속 음성 인식을 이용한 전화번호 안내방법에 관한 것이다.The present invention relates to phone number guidance using continuous speech recognition, and in particular, sets a sentence structure in which a pronunciation is predicted when guiding a phone number, and has a plurality of sentences among sentences that can be made from the sentence structure. The present invention relates to a telephone number guidance method using continuous speech recognition that induces a more natural pronunciation of a user by constructing and provides accurate phone number guidance with reliable speech recognition based on Korean continuous speech recognition.

인간과 기계의 대화가 자연스런 대화의 음성으로 인터페이스가 이루어진다면, 그 응용은 많은 분야에 걸쳐서 확산된다.If human-machine dialogue is interfaced with the voice of natural dialogue, its application spreads across many fields.

기계는 기본적으로 약속된 코드에 의해 정보의 교류가 이루어지게 되어 있어서, 인간과 기계와의 대화에서는 음성을 기계가 알아듣는 코드로 변환해 주어야 하는 제약이 따르게 된다. 이 과정을 '음성 인식'이라고 한다.The machine basically exchanges information by the promised code, and in the dialogue between human and machine, there is a restriction to convert the voice into a code that the machine understands. This process is called 'voice recognition'.

음성 인식의 궁극적 목표는 사람과 사람 사이의 자연스런 대화를 인간과 기계 사이에서 실현하는 것이다.The ultimate goal of speech recognition is to realize the natural dialogue between man and machine between man and machine.

음성 인식 방법에는 여러 가지가 있지만 가장 널리 쓰이는 방법으로는 HMM(Hidden Markov Model) 및 DTW(Dynamic Time Warping), 인공 신경망에 의한 음성 인식 방법 등이 있다.There are many methods of speech recognition, but the most widely used methods include a Hidden Markov Model (HMM), Dynamic Time Warping (DTW), and a speech recognition method using an artificial neural network.

상기 HMM은 음성의 통계적 특성을 이용한 인식 방법으로 다양한 사람들의 다양한 발음 패턴을 통계적 처리를 통해 알아낼 수 있으며, 음성의 시간적 특성을 잘 잡아낼 수 있다. 그러나 통계적 가설을 만족시키기 위해 상당한 양의 학습 데이터가 필요하고, 조음 현상에 의한 변이를 흡수하기 위해 변이음을 도입하게 되면 학습의 복잡도가 증가하는 단점이 있다.The HMM is a recognition method using statistical characteristics of speech, and it is possible to find out various pronunciation patterns of various people through statistical processing, and to grasp temporal characteristics of speech well. However, a significant amount of learning data is required to satisfy the statistical hypothesis, and the introduction of variation sounds to absorb variations caused by articulations increases the complexity of learning.

또한, 상기 DTW는 기준 패턴과의 비교를 통해 인식 여부를 알 수 있으므로 비교적 간단한 원리에 기초하며 음성을 인식하기 위해서는 시간 신축이라는 과정을 거쳐 기준 패턴과의 정합을 시도해야 하므로 기준 패턴이 많아질수록 계산량과 메모리 사용량이 급증하는 단점이 있다.In addition, since the DTW can recognize whether it is recognized through comparison with the reference pattern, it is based on a relatively simple principle, and in order to recognize the speech, the DTW must attempt to match the reference pattern through a process called time stretching. There is a drawback that the amount of computation and memory usage increase rapidly.

아울러 신경회로망은 비교적 간단한 원리에 의해 학습하므로 복잡한 알고리즘이 필요치 않고 인식시에는 실시간 구현이 가능하고 하드웨어로 구현하기도 쉽다. 반면에 음성 신호의 시간적 특성을 반영하기 어려우며 학습 시간이 길다는 단점이 있다.In addition, since neural networks are learned by relatively simple principles, complex algorithms are not required, and real-time implementation is possible at the time of recognition, and it is easy to implement in hardware. On the other hand, it is difficult to reflect the temporal characteristics of the voice signal and has a disadvantage of long learning time.

상기와 같은 음성 인식 방법들을 적용한 종래의 전화번호 안내 시스템을 살펴보면, 고객이 전화를 걸어 연결하고자 하는 업체나 서비스명을 말하면, 고객의음성을 인식하여 자동으로 업소로 연결해주는 '음성 자동 연결 시스템'이 있다. 이 시스템에 적용된 음성 인식 기술은 화자 독립 방식으로 미리 정해진 단어(150여개정도)를 인식한다.Looking at the conventional telephone number guide system applying the voice recognition method as described above, if the customer speaks the name of the company or service to make a call, 'voice automatic connection system' that automatically recognizes the customer's voice and connects to the business There is this. Speech recognition technology applied to this system recognizes predetermined words (about 150 words) in speaker independent manner.

다시 말해, 전화를 걸어 리스트(음성인식이 가능한 서비스 안내 리스트로서, 꽃집, 꽃배달, 화원, 꽃가게, 부동산, 복덕방, 공인중개사, 이삿짐, 익스프레스, 이사, 포장이사, 이삿짐센타, 중화요리, 중국집, 중국음식, 짜장면집, 짜장면, 족발, 족발집 등의 150여개 단어)의 단어중에서 하나를 발음하면, 인식이 되어 연결이 된다.In other words, call the list (list of voice recognition service guide, florist, flower delivery, flower shop, florist, real estate, Bokdeokbang, realtor, moving, express, director, packaging director, moving center, Chinese cuisine, Chinese restaurant, If one pronounces one of the words of Chinese food, Jjajangmyeon, Jjajangmyeon, Jjajangmyeon, jokbal, jokbal, etc.), it is recognized and connected.

그러나 이러한 전화번호 안내 시스템은, 고립단어 인식 기술을 사용한 전화번호 안내 시스템이다. 여기서 고립단어 인식의 경우 원하는 단어 이외의 발음이 입력되는 경우 이를 인식하는 것이 어렵다. 따라서 사용자가 많이 제한된 환경에서 사용할 수 밖에 없는 단점이 있다.However, such a phone number guide system is a phone number guide system using isolated word recognition technology. In the case of isolated word recognition, when a pronunciation other than a desired word is input, it is difficult to recognize it. Therefore, there is a disadvantage that users can only use in a limited environment.

다음으로, 음성 인식에 의한 '증권정보 시스템'이라는 음성 인식 시스템이 있다. 이 시스템은 120명이 동시에 사용할 수 있는 시스템이며, 소프트웨어(S/W)와 하드웨어(H/W)를 분리시켜 소프트웨어의 버전을 갱신하더라도 하드웨어의 변경이 작도록 설계되어 있다. 1채널의 음성인식을 위하여 2개의 DSP(디지털 신호 처리기)를 사용하는 대신에 특징 추출용 DSP와 탐색용 DSP를 적당한 비율로 재설계하여 1채널의 음성인식에 1.**의 DSP를 사용할 수 있도록 하였다. 전화정보 시스템에서 사용되고 있는 입력방식인 전자식 전화기 버튼을 누르더라도 작동이 가능하도록 설계하였으며, 음성과 동시에 전화기 버튼이 입력이 되면 전화기 버튼에 의해 우선적으로 동작하도록 순위를 두었다. 매일 상장회서 개수가 조정이 되는 주식시장의 특성을 감안하여 매일 일정한 시간에 음성인식 대상 단어가 자동적으로 시스템으로 로딩되어 인식 대상 단어가 갱신된다. 음성인식 기술로는 연속 음성 인식 기술을 이용한 고립단어 인식이 사용되고 있다.Next, there is a voice recognition system called 'Stock information system' by voice recognition. This system can be used by 120 people at the same time. Even if the software (S / W) and the hardware (H / W) are separated and the software version is updated, the hardware is designed to be small. Instead of using two DSPs (digital signal processor) for 1-channel voice recognition, we can redesign the DSP for feature extraction and the DSP for search at an appropriate ratio so that 1. ** DSP can be used for 1-channel voice recognition. It was made. It is designed to be operated even if the electronic phone button, which is the input method used in the telephone information system, is pressed, and it is prioritized to operate by telephone button when the telephone button is input at the same time as voice. Considering the characteristics of the stock market where the number of listings is adjusted every day, the words to be recognized are automatically loaded into the system at a certain time every day, and the words to be recognized are updated. As the speech recognition technology, isolated word recognition using continuous speech recognition technology is used.

그러나 이러한 증권정보 시스템에서의 음성 인식 방법도, 연속 음성 인식 기술을 사용하지만, 핵심어(keyword)만을 인식하는 시스템으로서, 연속된 문장 등의 음성 인식은 불가능한 단점이 있다.However, the speech recognition method in the securities information system also uses a continuous speech recognition technology, but it is a system that recognizes only keywords, and there is a disadvantage in that speech recognition such as continuous sentences is impossible.

따라서 본 발명의 목적은, 상기와 같은 종래 음성 인식을 이용한 서비스 장치 및 방법에서 발생하는 제반 문제점을 해결하고, 연속 음성 인식이 가능하며 한글 자연어 처리도 가능토록 한 음성 인식 방법을 제공하는 데 있는 것으로,Accordingly, an object of the present invention is to provide a speech recognition method capable of solving various problems occurring in the service apparatus and method using the conventional speech recognition as described above, and enabling continuous speech recognition and processing Korean natural language. ,

좀 더 상세하게는 전화번호 안내시 발음이 예측되는 문장 구조를 설정하고 그 문장 구조로부터 만들어질 수 있는 문장들 가운데 복수개의 문장을 가지고 훈련에 필요한 음성 데이터를 구축함으로써 사용자의 보다 자연스러운 발음을 유도하며, 한글 연속어 음성인식을 기반으로 한 신뢰성 있는 음성 인식으로 전화번호 안내에 정확성을 기하도록 한 연속 음성 인식을 이용한 전화번호 안내방법을 제공하는 데 있다.More specifically, by setting the sentence structure predicted pronunciation during phone number guidance and by constructing the voice data necessary for training with a plurality of sentences among the sentences that can be made from the sentence structure to induce more natural pronunciation of the user In addition, the present invention provides a phone number guidance method using continuous speech recognition to ensure accurate phone number guidance with reliable speech recognition based on Korean continuous speech recognition.

상기와 같은 목적을 달성하기 위한 본 발명은,The present invention for achieving the above object,

음성 인식을 통한 전화번호 안내 방법에 있어서,In the phone number guide method through voice recognition,

사용자가 다이얼링을 하여 전화번호 안내 시스템이 오프훅크 상태가 되면, 시스템 소개 멘트를 출력하고 원하는 과 및 교수 이름의 음성 입력을 요구하는 멘트를 송출해주는 단계와;When the user dials and the telephone number guide system is in an off-hook state, outputting a system introduction comment and sending a comment requesting voice input of a desired class and professor's name;

상기 음성 입력 요구후 사용자가 해당 과 및 교수의 이름이 포함된 연속 음성을 입력하면 그 입력되는 연속 음성과 미리 예측하여 설정한 문장을 가지고 훈련한 훈련 모델을 비교하여 연속 음성을 인식하는 단계와;When the user inputs a continuous voice including the name of the department and the professor after the voice input request, recognizes the continuous voice by comparing the inputted continuous voice with a training model trained with a sentence set in advance and predicted;

상기 인식한 입력 음성에 대한 인식 결과를 전화번호 안내부에 전달해주어 해당 과 교수의 전화번호가 송출토록 제어하는 단계와;Transmitting the recognition result of the recognized input voice to a telephone number guide to control the telephone number of the corresponding professor to be transmitted;

상기 해당 과 교수의 전화번호 송출후 종료 멘트를 송출하는 단계로 이루어진다.After sending the phone number of the corresponding professor consists of sending the end comment.

상기에서, 원하는 과 및 교수 이름의 음성 입력을 요구한 상태에서, 소정 시간이내에 음성이 입력되지 않으면, 설정 시간의 경과 여부를 확인하는 단계와; 상기 설정 시간이 경과되지 않았으면 상기 원하는 과 및 교수 이름의 음성 입력을 요구하는 멘트를 재송출해주는 단계와; 상기 음성 입력이 없는 상태에서 설정 시간이 경과하면 상기 종료 멘트를 송출해주는 단계를 수행하는 것을 특징으로 한다.In the above, if a voice is not input within a predetermined time while requesting voice input of a desired class and professor's name, checking whether the set time has elapsed; Resending a comment requesting voice input of the desired department and professor's name if the set time has not elapsed; And transmitting the termination message when a set time has elapsed in the absence of the voice input.

또한, 상기 음성을 인식하는 단계는, 이용자의 발음이 예측되는 문장 구조를 설정하는 단계와, 상기 설정한 문장 구조로부터 만들어 질 수 있는 문장을 설정하는 단계와, 상기 설정된 문장들 가운데 10개의 표본 문장을 가지고 훈련에 필요한 음성 데이터를 구축하는 단계와, 상기 구축한 음성 데이터를 훈련 모델로 훈련 모델부에 저장하는 단계와, 상기 사용자로부터 음성이 입력되면 입력 음성과 저장한훈련 모델을 비교하여 연속 음성을 인식하는 단계로 이루어짐을 특징으로 한다.The recognizing of the voice may include setting a sentence structure in which the user's pronunciation is predicted, setting a sentence that may be made from the set sentence structure, and 10 sample sentences among the set sentences. Constructing the voice data required for training, storing the constructed voice data as a training model in a training model unit, and comparing the input voice with the stored training model when a voice is input from the user. Characterized in that it consists of a step of recognizing.

도1은 본 발명이 적용되는 연속 음성 인식 시스템 블록도,1 is a block diagram of a continuous speech recognition system to which the present invention is applied;

도2는 사용자의 발음이 예측되는 문장 구조를 설정하기 위한 단어 네트워크,2 is a word network for setting a sentence structure in which the pronunciation of the user is predicted;

도3은 음성 데이터 수집을 위해 설정한 문장 표본 테이블,3 is a sentence sample table set up for collecting voice data;

도4는 본 발명에 의한 연속 음성 인식을 이용한 전화번호 안내 방법을 보인 흐름도,4 is a flowchart showing a phone number guide method using continuous speech recognition according to the present invention;

도5는 도4의 연속 음성 인식 과정을 좀 더 구체적으로 보인 흐름도,5 is a flow chart illustrating in more detail the continuous speech recognition process of FIG. 4;

도6은 본 발명에서 끝점 검출 과정을 보인 흐름도,6 is a flow chart showing an endpoint detection process in the present invention;

도7은 본 발명에서 영교차율을 구하는 방법을 설명하기 위한 설명도,7 is an explanatory diagram for explaining a method for obtaining a zero crossing rate in the present invention;

도8은 에너지에 의한 시작점과 끝점검출 과정을 설명하기 위한 설명도,8 is an explanatory diagram for explaining a starting point and an end point detection process by energy;

도9는 윈도우의 사용과 중첩을 설명하기 위한 설명도,9 is an explanatory diagram for explaining the use and overlapping of windows;

도10은 물리적인 주파수의 함수로 표현되는 사람이 느끼는 피치의 스케일도,10 is a scale diagram of a pitch felt by a person expressed as a function of physical frequency;

도11은 멜 스케일 필터 뱅크를 설명하기 위한 설명도,11 is an explanatory diagram for explaining a mel scale filter bank;

도12는 혼합 가우시안 확률 분포 함수를 나타낸 파형,12 is a waveform showing a mixed Gaussian probability distribution function,

도13은 문맥 독립형 음소 모델의 테이블.Figure 13 is a table of context independent phoneme models.

<도면의 주요 부분에 대한 부호의 설명><Explanation of symbols for main parts of the drawings>

10 : 아날로그/디지털 변환부 20 : 끝점 검출부10: analog / digital converter 20: endpoint detection unit

30 : 특징 추출부 40 : 비터비 디코더30: feature extraction unit 40: Viterbi decoder

50 : 훈련 모델부 60 : 음성 인식부50: training model unit 60: speech recognition unit

70 : 전화번호 안내부70: phone number guide

이하, 상기와 같은 기술적 사상에 따른 본 발명의 바람직한 실시예를 첨부한 도면에 의거 상세히 설명하면 다음과 같다.Hereinafter, with reference to the accompanying drawings, preferred embodiments of the present invention according to the technical spirit as described above in detail.

첨부한 도면 도1은 본 발명이 적용되는 연속 음성 인식 시스템 블록도이다.1 is a block diagram of a continuous speech recognition system to which the present invention is applied.

도시된 바와 같이, 입력되는 아날로그 음성을 소정 샘플링 주파수로 샘플링하여 디지털 신호로 변환하는 아날로그/디지털 변환부(10), 상기 아날로그/디지털 변환부(10)에서 출력되는 디지털 데이터로부터 음성 구간만을 추출하는 끝점 검출부(20), 상기 끝점 검출부(20)를 통한 음성 데이터로부터 특징 벡터를 추출하는 특징 추출부(30), 상기 특징 추출부(30)에서 얻어지는 특징 벡터를 비터비 디코딩하는 비터비 디코더(40), 이용자의 사용시에 발음이 예측되는 문장 구조를 설정하고 그 문장 구조로부터 만들어질 수 있는 문장들 가운데 소정개의 문장을 가지고 훈련에 필요한 음성 데이터를 구축한 훈련 문장 패턴이 저장된 훈련 모델부(50), 상기 훈련 모델부(50)에 저장된 문장 패턴과 상기 비터비 디코더(40)에서 얻어지는 입력 음성 데이터를 비교하여 연속 음성을 인식하고 그 인식 결과를 출력하는 음성 인식부(60), 상기 음성 인식부(60)에서 출력되는 인식 결과에 대응하는 전화번호를 음성으로 안내해주는 전화번호 안내부(70)로 구성된다.As shown in the figure, an analog / digital converter 10 for sampling an input analog voice at a predetermined sampling frequency and converting it into a digital signal, extracting only a voice section from the digital data output from the analog / digital converter 10 A feature extraction unit 30 for extracting feature vectors from the voice data through the endpoint detection unit 20, the endpoint detection unit 20, and a Viterbi decoder 40 for Viterbi decoding the feature vectors obtained by the feature extraction unit 30. The training model unit 50 sets a sentence structure in which a pronunciation is predicted at the time of use of a user, and stores a training sentence pattern in which voice data necessary for training is constructed with a predetermined sentence among sentences that can be made from the sentence structure. , And compares the sentence pattern stored in the training model unit 50 with the input speech data obtained from the Viterbi decoder 40 to be continuous. Voice recognition unit 60 for recognizing a voice and outputting the recognition result, and a telephone number guide unit 70 for guiding a phone number corresponding to the recognition result output from the voice recognition unit 60 by voice.

이와 같이 구성된 본 발명이 적용되는 연속 음성 인식 시스템을 이용한 본 발명의 전화 번호 안내방법을 첨부한 도면 도2 내지 도13을 참조하여 상세히 설명하면 다음과 같다.Referring to Figures 2 to 13 attached to the phone number guide method of the present invention using a continuous speech recognition system to which the present invention configured as described above is as follows.

본 발명의 전화번호 안내방법은, 중규모(예를들어, 대학교, 특정 회사, 관공서 등등)의 전화번호 안내 방법으로서, 본 발명에서는 대학교의 교수 전화번호 안내를 실시예로 설명한다.The telephone number guide method of the present invention is a telephone number guide method of a medium scale (for example, a university, a specific company, a public office, etc.), and in the present invention, the telephone number guide of a university is described as an embodiment.

먼저, 이용자의 시스템 사용시에 발음이 예측되는 문장 구조를 도2와 같은 단어 네트워크를 통해 설정한다. 그리고 상기와 같은 문장 구조를 통해 만들어질 수 있는 문장들 가운데 10개의 문장(음성 데이터 수집을 위해 설정한 문장 표본으로서, 도3과 같다)을 가지고 훈련에 필요한 음성 데이터를 구축하며, 이렇게 구축한 음성 데이터를 훈련 모델부(50)에 저장한다.First, a sentence structure in which pronunciation is predicted when a user uses the system is set through a word network as shown in FIG. Also, among the sentences that can be made through the sentence structure described above, 10 sentences (sentence sample set for collecting voice data, as shown in FIG. 3) are constructed, and the voice data necessary for training is constructed. The data is stored in the training model unit 50.

이와 같은 상태에서 음성 입력부를 통해 연속 음성이 입력되면 이를 처리하고, 그 처리한 데이터와 상기 훈련 모델을 비교하여 입력되는 연속 음성을 인식한다.In this state, if the continuous voice is input through the voice input unit, it is processed, and the processed data is compared with the training model to recognize the input continuous voice.

즉, 도4에 도시된 바와 같이, 이용자가 전화번호를 안내받기 위해서 시스템측으로 다이얼을 시도하고, 상기 시스템에서는 링신호가 도래하면(S10), 상기 시스템측에서는 시스템 소개 멘트(예를 들어, '본 시스템은 자동 전화번호 안내 시스템입니다')를 송출해주고, 원하는 과 및 해당 교수의 이름을 입력토록 요구한다(S20 ~ S30).That is, as shown in Figure 4, the user attempts to dial to the system side to be guided by the telephone number, the ring signal arrives in the system (S10), the system side on the system side (for example, ' The system is an automatic telephone number guide system '), and asks for the name of the desired department and the professor (S20 ~ S30).

이후 시스템에서는 음성 입력의 유무를 체크하게 되고(S40), 이때 설정 시간 (예를 들어, 10초) 이내에 사용자로부터 음성 입력이 없으면 종료 멘트를 송출해주고 음성 인식 과정 및 전화번호 안내 과정을 종료한다(S50, S90).Thereafter, the system checks the presence or absence of a voice input (S40). At this time, if there is no voice input from the user within a set time (for example, 10 seconds), a termination message is transmitted and the voice recognition process and the phone number guide process are terminated ( S50, S90).

이와는 달리 사용자가 음성을 입력하면, 연속 음성 인식 과정을 통해 입력 연속 음성을 인식한다(S60).Unlike this, when a user inputs a voice, the input continuous voice is recognized through a continuous voice recognition process (S60).

여기서 연속 음성 인식은 도1과 같은 시스템에 의해 구현되어지며, 이를 좀 더 구체적으로 설명하면 다음과 같다.Here, continuous speech recognition is implemented by the system as shown in FIG. 1, which will be described in more detail as follows.

먼저, 아날로그/디지털 변환부(10)는 입력되는 아날로그 음성 신호를 8KHz의 샘플링 클럭으로 샘플링하고 이를 양자화하여 디지털 음성 신호로 변환을 한다.First, the analog-to-digital converter 10 samples an input analog voice signal with a sampling clock of 8 KHz and quantizes the converted analog voice signal to convert it into a digital voice signal.

여기서 전화선이 300 ~ 3400Hz으로 대역이 제한되어 있으므로 최소 샘플링 가능 주파수인 8kHz로 샘플링하여 디지털 신호로 변환한다.Since the phone line is limited to 300 ~ 3400Hz, the sample is converted to a digital signal by sampling at the minimum sampling frequency of 8kHz.

이렇게 변환된 디지털 음성 신호는 끝점 검출부(20)에 입력되며, 상기 끝점 검출부(20)는 에너지값과 영교차율값을 이용하여 변환된 디지털 신호로부터 음성 구간만을 추출한다.The converted digital voice signal is input to the endpoint detector 20, and the endpoint detector 20 extracts only a voice section from the converted digital signal using an energy value and a zero crossing rate value.

즉, 상기 끝점 검출은 어떤 단어가 입력 되었을 때 입력된 신호로부터 음성만의 구간을 찾아내는 과정이다. 이것은 음성 처리의 여러 영역에서 매우 중요하며, 특히 고립 단어 인식의 경우에 인식 성능에 상당한 영향을 미친다. 끝점 검출은 음성 신호의 신호 변환 및 처리 등에 앞서 전 단계에서 처리되므로 이의 부정확성은 인식 성능을 좌우할 수 있다. 또한 불필요한 신호정보를 제거하므로 계산량 감소의 이점을 제공해준다.That is, the end point detection is a process of finding a section of speech only from the input signal when a word is input. This is very important in many areas of speech processing and has a significant impact on recognition performance, especially in the case of isolated word recognition. Since the end point detection is processed at all stages prior to signal conversion and processing of the voice signal, its inaccuracy may influence the recognition performance. In addition, it eliminates unnecessary signal information, providing the benefit of reduced computation.

끝점 검출에 사용되는 음성의 특징으로는 영교차율(Zero Crossing Rate), 에너지(Energy), 자기상관계수(Auto Correlation Coefficients) 등이 있다. 끝점 검출 부분과 인식 부분의 연결 관계에 따라 독립적(explicit), 포함적(implicit), 혼합적(hybrid) 방법으로 분류된다.The features of speech used for endpoint detection include Zero Crossing Rate, Energy, and Auto Correlation Coefficients. It is classified into an explicit, implicit, and hybrid method according to the connection relationship between the endpoint detection part and the recognition part.

본 발명에서 적용한 끝점 검출 방법은 음성 특징 중 영교차율과 에너지를 이용하며, 인식 부분 앞단에 독립적으로 채용할 수 있는 방법을 채택하였다.The end point detection method applied in the present invention employs a zero crossing rate and energy among voice features, and adopts a method that can be independently adopted in front of a recognition part.

도6은 상기 끝점 검출 과정을 보인 흐름도이다.6 is a flowchart illustrating the endpoint detection process.

도시된 바와 같이, 디지털 변환된 음성 신호는 프리앰파시스하여 10msec 단위의 블록으로 만든다. 프리앰파시스하면 음성과 잡음의 구별이 더 좋아지는 특징이 있다. 한 블록씩 현재의 상태(state)에 따라 처리하여 그 블록이 음성인지 묵음인지를 판단하고 이에 따라 상태값도 바꾼다.As shown, the digitally converted speech signal is pre-emphasized into blocks of 10 msec units. Pre-emphasis is characterized by better distinction between voice and noise. Each block is processed according to the current state to determine whether the block is voice or silent, and the state value is changed accordingly.

처음 입력된 10블록(100msec)의 데이터를 묵음구간의 잡음으로 가정하고 그 구간에서 에너지와 영교차율의 통계적 특성을 구하여 기준값을 정한다. 에너지의 기준값은 아래 [수학식1] ~ [수학식4]와 같이 각 블록 에너지 값의 평균에 비례하도록 정한다. 여기서 오프셋은 기준값이 '0'이 되는 것을 막기 위해 설정한 최소값이고, THEL은 기준값이 너무 커지는 것을 막기 위한 것으로 실험에 의해 그 값을 정했다.The first 10 blocks of data (100msec) are assumed to be the silence period, and the reference value is determined by calculating the statistical characteristics of energy and zero crossing rate in that interval. The reference value of energy is determined to be proportional to the average of each block energy value as shown in Equations 1 to 4 below. In this case, the offset is the minimum value set to prevent the reference value from becoming '0', and the value of THEL is to prevent the reference value from becoming too large.

영교차율의 본래 의미는 신호값이 0을 기준으로 음수와 양수로 부호가 바뀌는 횟수를 말하지만, 본 알고리즘에서는 잡음의 영향을 줄이기 위하여 다음과 같이 정의한다. 먼저 ThreshEL에 비례하는 SilenceU, SilenceL을 정하여 신호값을 4개의 영역으로 나눈다.The original meaning of the zero crossing rate refers to the number of times the signal is changed to negative and positive values based on 0, but this algorithm is defined as follows to reduce the influence of noise. First, divide SilenceU and SilenceL which are proportional to ThreshEL and divide signal value into 4 areas.

2와 같이 변화하는 횟수만을 계산한다. 이렇게 각 블록의 영교차율을 구한 후 영교차율의 기준값을 하기 [수학식7]과 같이 계산한다. 여기서 THZCR은 너무 작아지는 것을 막기 위한 실험값이다.Count only the number of times that change as shown in 2. After calculating the zero crossing rate of each block, the reference value of the zero crossing rate is calculated as shown in Equation 7 below. THZCR is an experimental value to prevent it from becoming too small.

만약 음성이 시작되지 않고 계속 묵음이면 0.5초 간격으로 10블록의 데이터를 이용하여 기준값을 다시 계산한다. 새로 계산된 값(Th2)과 이전값(Th1)을 비교하여 하기 [수학식8]과 같이 새로운 기준값을 정한다.If the voice does not start and is silent, the reference value is recalculated using 10 blocks of data at 0.5 second intervals. By comparing the newly calculated value Th2 and the previous value Th1, a new reference value is determined as shown in Equation 8 below.

아울러 현재 상태가 잡음 이면 먼저 에너지를 비교하여 음성의 시작점을 찾는다.In addition, if the current state is noise, the energy is first compared to find the starting point of the voice.

도8은 에너지에 의한 시작점과 끝점 검출 과정을 설명하기 위한 설명도이다.8 is an explanatory diagram for explaining a process of detecting a start point and an end point by energy.

도시된 바와 같이, 이전 블록의 에너지와 현재 블록의 에너지를 기준값과 역방향으로 비교하여 ThreshEL을 초과하기 전 처음으로 ThreshEL 이상이 된 점을 에너지에 의한 음성의 시작점으로 삼는다. 일단 에너지에 의한 시작점이 찾아지면 그 이전 블록의 영 교차율을 역방향으로 비교하면서 처음으로 기준값 ThreshZCR을 초과하게 되는 점을 찾는다. 이 점을 음성의 시작점으로 하며 끝 점도 이와 마찬가지의 방법으로 구하여 하나의 음성 펄스 구간을 정한다.As shown in the drawing, the energy of the previous block and the energy of the current block are compared with the reference value in a reverse direction, and the first point above the ThreshEL before exceeding ThreshEL is used as the starting point of the voice by energy. Once the starting point due to energy is found, the zero crossing rate of the previous block is compared backwards to find the point where the threshold ThreshZCR is exceeded for the first time. This point is used as the start point of the voice and the end point is obtained in the same manner to determine one voice pulse section.

하나의 단어를 발성하면 음성 펄스가 여러개 이어질 수 있으며, 이렇게 찾아진 펄스 중에는 잡음인 것도 있을 수 있으므로 이를 분리하기 위해서 길이와 에너지를 이용하여 다음 세가지로 분류한다.When a word is spoken, a number of voice pulses can be connected, and some of the pulses found may be noise.

1) 확실한 음성 구간 - 에너지가 ThreshBig 이상인 블록이 6이상,1) certain voice intervals-6 or more blocks with energy above ThreshBig,

2) 잡음 구간 - 블록길이가 3이하,2) noise section-block length is 3 or less,

3) 불확실 구간 - 기타.3) Uncertainty-Other.

상기 불확실한 구간은 앞뒤로 400msec내에 확실한 음성이 이어지면 음성으로 구분하고 그렇지 않으면 잡음으로 구분한다. 음성구간이 정해진 후 500msec이상 묵음이 지속되면 끝점으로 간주한다.The uncertainty section is divided into speech if a certain voice is followed within 400msec. If silence is continued for more than 500msec after the voice interval is set, it is regarded as the end point.

아울러 고역 강조(Preemphasis)는 성도(vocal Tract)의 공진주파수(resonace frequency)와 대역폭에 영향을 미치는 다른 인자들을 제거하기 위해서 사용되는데 스펙트럼 상에서 낮은 진폭과 높은 주파수를 가질 경우 이를 모델링하는데 좋은 결과를 낼 수 있다. 그리고 음원의 특성과 방사특성을 제거함으로써 성도자체의 특성이 잘 반영되도록 처리하는 장점도 가진다. 하기 [수학식9]는 일반적으로 사용되는 고역 강조의 전달함수이다.In addition, preemphasis is used to remove other factors affecting the vocal tract's resonance frequency and bandwidth, which have good results in modeling low amplitudes and high frequencies in the spectrum. Can be. In addition, the characteristics of the saints themselves are well reflected by removing the characteristics of the sound source and the radiation. Equation 9 below is a commonly used transfer function of the high frequency emphasis.

한편, 음성 신호를 분석하기 위해서는 stationary한 구간을 가정하여야 하므로 작은 구간으로 블록화시킨다. 이 블록들을 프레임(Frame)이라 하며, 프레임의 끝점에서 데이터의 불연속으로 인한 주파수 성분의 왜곡이 발생하게 된다. 이런 현상은 윈도우를 씌워 프레임의 양끝에서의 데이터의 진폭을 감쇠시켜 줌으로써 줄일 수 있다. 일반적으로 사각(Rectangular), 삼각(Triangular), 해닝(Hanning), 해밍(Hamming) 윈도우 등이 사용된다. 본 발명에서는 해밍 윈도우를 사용하였다. 윈도우를 씌운 음성신호는 다음과 같이 표현된다.On the other hand, in order to analyze the voice signal, a stationary section must be assumed, so that the block is divided into small sections. These blocks are called frames, and distortion of frequency components occurs due to discontinuity of data at the end of the frame. This can be reduced by covering the window and attenuating the amplitude of the data at both ends of the frame. Generally, rectangular, triangular, hanning, and hamming windows are used. In the present invention, a hamming window is used. The voice signal covering the window is expressed as follows.

s(m+n) : 음성 신호, w(m) : 윈도우 신호s (m + n): audio signal, w (m): window signal

여기서 해밍 윈도우 함수는 다음과 같다.Here is the Hamming window function:

고역 강조된 음성에 25msec(분석구간, 또는 20msec) 길이의 해밍윈도우를 10msec(이동구간)씩 이동하면서 취한다. 도9는 윈도우의 사용과 중첩을 보인 도면이다.A 25msec (analysis section, or 20msec) long Hamming window is moved by 10msec (moving section) to the high frequency stressed voice. 9 illustrates the use and overlapping of windows.

음성의 특징을 가장 잘 나타내주는 방법은 음성의 주파수 분석(frequency analysis)에 의한 방법이다. 음성 특징 추출을 위한 주파수 분석 방법으로 필터 뱅크(filter bank)의 사용, LPC(Linear Predictive Coding)계수 추출, 캡스트럼 계수(Cepstral Coefficients) 추출 방법 등이 있다. LPC 계수 추출법은 음성의 발화를 사람의 성도를 모델링하는 전달함수에 의한 과정으로 가정하는 가장 일반적인 방법이며, 켑스트럼 계수는 LPC 계수에 비해 인식성능이 좋은 것으로 알려져 있다. 본 발명에서는 사람의 청각기관을 모델링한 필터뱅크에 의한 방법으로서 음성인식에서 주변 잠음과 채널 왜곡에 강하다고 알려진 멜-캡스트럼을 음성 특징 파라미터로 사용한다.The best way to represent the characteristics of speech is by frequency analysis of the speech. Frequency analysis methods for speech feature extraction include the use of filter banks, LPC (Linear Predictive Coding) coefficient extraction, and Capstral Coefficients extraction. The LPC coefficient extraction method is the most common method that assumes speech utterance as a process based on a transfer function that models human vocal tracts, and it is known that the cepstrum coefficient has better recognition performance than the LPC coefficient. In the present invention, as a filter bank modeling a human auditory organ, Mel-Capstrum, which is known to be strong against ambient sleep and channel distortion in speech recognition, is used as a speech feature parameter.

인지 심리학적인 연구에 의하면 인간이 순수한 톤(tone)이거나 음성신호이든 그 소리를 인지하는 주파수 스케일은 실제 주파수와의 관계가 선형적이지 않음이밝혀져 있다. 이 연구는 순수한 톤의 주관적인 피치(pitch)를 정의하는 실험에 의하여 수행되어졌다. 인간이 주관적으로 느끼는 주파수 스케일을 멜(mel)이라 하여 실제 Hz로 표현되는 물리적인 주파수 f와 구별 짓는다(참고로 인간의 청각 인지 임계치보다 40dB 이상되는 1kHz의 톤의 피치를 100mel이라 정의한다). 다른 주관적인 피치값에 대해서는 1kHz에 해당하는 톤의 피치를 두배 혹은 반이되는 지점에서 느끼는 톤의 주파수를 측정하여 얻는다. 도10은 실제 주파수의 함수로 표현되는 주관적인 피치의 스케일을 나타낸다. 도10에서 위쪽 곡선은 선형적인 스케일에서의 주관적 피치와 주파수와의 관계이고, 아래 곡선은 log 스케일에서의 주관적 피치와 주파수의 관계이다.Cognitive psychological studies have shown that the frequency scale at which humans perceive the sound, whether pure tone or voice, is not linear with the actual frequency. This study was carried out by experiments that define the subjective pitch of pure tones. The frequency scale that a subject feels subjectively is called mel, and it is distinguished from the physical frequency f, which is expressed in actual Hz. Other subjective pitch values are obtained by measuring the frequency of the tone felt at a point where the pitch of the tone corresponding to 1 kHz is doubled or halved. Figure 10 shows the subjective pitch scale expressed as a function of actual frequency. In Fig. 10, the upper curve is the relationship between the subjective pitch and the frequency in the linear scale, and the lower curve is the relationship between the subjective pitch and the frequency in the log scale.

신호의 주파수를 인식하는 또 다른 주관적인 요소는 loudness라 불리는 주파수 대역폭에 관한 크리티컬 밴드(critical band)가 있다. 일정한 크기의 사운드 pressure를 가지는 음성 신호에 대해 대역의 loudness를 일정한 값으로 유지한 채 크리티컬 밴드의 폭, 즉 loudness의 증가를 느낄 때까지 신호의 대역을 증가시킬 수 있다. 즉, 사람의 청각 기관은 오디오 스펙트럼을 비 선형적인 주파수 스케일로 분석한다는 말인데, 멜-켑스트럼은 이러한 정보를 이용하여 푸리에(Fourier) 변환된 음성신호를 멜 스케일 상에서 동일한 간격을 갖도록 만들어진 필터뱅크로 분석한다. 물리적인 주파수와 멜 주파수는 다음과 같은 관계로 주어진다.Another subjective factor of recognizing the frequency of a signal is a critical band of frequency bandwidth called loudness. For a voice signal having a constant sound pressure, the band of the signal may be increased until the width of the critical band, that is, the loudness is increased, while maintaining the loudness of the band at a constant value. In other words, the human auditory organ analyzes the audio spectrum on a non-linear frequency scale. Mel-Busstrum uses this information to filter Fourier transformed speech signals at equal intervals on the mel scale. Analyze with The physical frequency and the mel frequency are given by the following relationship.

즉, 음성인식의 특징 벡터 추출 과정에서 사람이 주관적으로 인지하는 주파수 특성을 반영하여 주파수를 멜 스케일로 비틀어서(warping) 필터뱅크를 비 선형적으로 분포시키는 방식을 사용하는데, 이렇게 구한 특징 벡터를 멜 주파수 스펙트럼 계수(MFSC : Mel Frequency Spectral Coefficients)라 한다. 도11은 이러한 멜 스케일로 분포된 삼각 필터 뱅크를 나타낸다. 본 발명에서는 총 24개의 삼각 필터 뱅크를 사용한다.In other words, in the process of extracting the feature vector of speech recognition, the filter bank is non-linearly distributed by warping the frequency to a mel scale reflecting the frequency characteristic perceived by a subjective subject. It is called Mel Frequency Spectral Coefficients (MFSC). Fig. 11 shows triangular filter banks distributed in such a mel scale. In the present invention, a total of 24 triangular filter banks are used.

멜 스케일의 삼각 임계 필터 뱅크의 경과 값을 log를 취하고, 이 결과를 DCT(Discrete Cosine Transform) 변환하여 멜 주파수 켑스트럼 계수(MFCC : Mel Frequency Cepstral Coefficients)를 구한다. 본 발명에서는 로그를 취한 24개의 MFSC로부터 DCT변환을 통하여 13개의 MFCC를 얻었다. 여기서 DCT 관계식은 다음과 같다.The elapsed value of the mel scale triangular threshold filter bank is taken, and the result is converted to DCT (Discrete Cosine Transform) to obtain Mel Frequency Cepstral Coefficients (MFCC). In the present invention, 13 MFCCs were obtained through DCT conversion from 24 logarithmic MFSCs. DCT relation is as follows.

전화선과 같은 채널에 의한 왜곡현상을 보상할 수 있는 방법으로 켑스트럼 평균 차감법(CMS : Cepstral Mean Subtraction)이 널리 알려져 있다. 이 방법은 음성 인식 성능 향상에 좋은 결과를 나타낸다. CMS는 켑스트럼 계수를 하이패스 필터링해주는 효과를 나타내준다.Cepstral Mean Subtraction (CMS) is widely known as a method for compensating for distortion caused by a channel such as a telephone line. This method has a good result in improving speech recognition performance. CMS has the effect of high pass filtering the spectral coefficients.

채널 보상된 MFCC, c_t[k]중에서 c_t[0]을 log파워로, c_t[1],c_t[2],...,c_t[12] 켑스트럼 계수로 하여 총 4개의 특징 세트를 구한다.4 out of channel-compensated MFCC, c _t [k] with c _t [0] as log power and c _t [1], c _t [2], ..., c _t [12] Get a set of features.

MFCC : c_t[1],c_t[2],...,c_t[12] 12차원 벡터MFCC: c _t [1], c _t [2], ..., c _t [12] 12-dimensional vector

MFCC의 1차 미분값 : dc_t[1]dc_t[12] 12차원 벡터First derivative of MFCC: dc _t [1] dc _t [12] 12-dimensional vector

MFCC의 2차 미분값 : ddc_t[1],ddc_t[2],...,ddc_t[12] 12차원 벡터Second derivative of MFCC: ddc _t [1], ddc _t [2], ..., ddc _t [12] 12-dimensional vector

log 파워의 1차 미분값과 2차 미분값 : dc_t[0], ddc_t[0] 2차원 벡터1st and 2nd derivative of log power: dc _t [0], ddc _t [0] 2 dimensional vector

다음으로, 주지한 바와 같은 과정에 의해 음성의 특징이 추출되면, 비터비 디코더(40)는 그 특징 벡터를 비터비 디코딩한다.Next, when the feature of the voice is extracted by the process as is well known, the Viterbi decoder 40 decodes the feature vector.

이 과정은 최적의 상태 시퀀스를 산출하는 과정이다.This process yields the optimal state sequence.

이 비터비 알고리즘은 P(q|0,λ)를 최대화하는 한 개의 상태 시퀀스를 구하는 방법이다.This Viterbi algorithm finds one state sequence that maximizes P (q | 0, λ).

상기에서 q는 상태를 나타낸다.Q denotes a state.

이 식은 한개의 상태 시퀀스를 따라 시각 t일때 상태 s_i에 있을 가장 높은확률값을 나타낸다. 순환의 방법에 의해 다음 시각 t+1에서의 확률값은 다음 식과 같이 표시된다.This equation represents the highest probability value in state s _i at time t along a sequence of states. By the cyclic method, the probability value at the next time t + 1 is expressed as follows.

최적 상태 시퀀스는 위의 식을 이용하여 다음의 절차에 따라 구할 수 있다.The optimal state sequence can be obtained by the following procedure using the above equation.

1)초기화1) Initialization

2)반복계산2) Repeat calculation

3)종료3) Termination

4)최적 상태 시퀀스 찾기4) Find the optimal state sequence

이러한 과정으로 비터비 디코딩이 이루어진 단어는 HMM 모델링 및 훈련 과정을 통해 기준 패턴으로 훈련 모델부(50)에 저장된다.In this process, the Viterbi decoding word is stored in the training model unit 50 as a reference pattern through the HMM modeling and training process.

음성인식에 사용되는 가장 일반적인 방법은 패턴 매칭(Pattern Matching)에의한 방법이다. 즉, 음성으로부터 음성 패턴(단어, 음소)의 특징을 추출하여 기준 음성 패턴을 만드는 훈련 과정과 미지의 음성이 입력되면 저장된 기준 음성 패턴의 특징과 비교하여 가장 유사한 기준 음성 패턴을 찾아내는 인식 과정으로 나눌 수 있다.The most common method used for speech recognition is by pattern matching. That is, it is divided into a training process of extracting the features of the speech patterns (words and phonemes) from the speech, and a recognition process of finding the most similar reference speech patterns by comparing the characteristics of the stored reference speech patterns when an unknown voice is input. Can be.

본 발명에서 사용하고 있는 HMM알고리즘은 패턴매칭 방법으로써, 1970년 말부터 음성 인식 알고리즘에 본격적으로 사용되었다. 최근에는 높은 인식률과 빠른 처리 속도 때문에 대용량 음성인식 시스템에 많이 사용되고 있다. HMM 알고리즘의 기본적인 사상은 음성이 마코프(Markov) 모델로 모델링 될 수 있다는 가정하에 훈련과정에서 마코프 모델의 파라미터를 얻어 기준 마르코브 모델을 만들고 인식 과정에서는 가장 유사한 기준 마코프 모델을 찾아냄으로써 인식한다.The HMM algorithm used in the present invention has been used in speech recognition algorithms since late 1970 as a pattern matching method. Recently, due to the high recognition rate and fast processing speed, it is widely used in a large capacity speech recognition system. The basic idea of the HMM algorithm is to recognize that the speech can be modeled as a Markov model by obtaining the parameters of the Markov model during training, creating a reference Markov model, and finding the most similar reference Markov model in the recognition process.

마코브 모델로서 Hidden Markov 모델을 적용하는데, 그 이유는 음성 패턴의 다양한 변화를 수용하기 위해서이다. 히든 마코프 모델이란 이중 stochastic process로서 상태 선정에 관한 stochastic process와 매 상태마다 음성 패턴이 발생될 출력 확률에 관한 stochastic process로 구성된다. 즉, 음성 패턴의 각 특징을 상태의 선정확률과 출력확률 등으로 표현하여 준다. 여기서 히든(hidden)이란 의미는 상태가 음성 패턴에 관계없이 모델속에 숨어 있다는 것을 말한다. HMM알고리즘은 기준 패턴을 음소, 음절 등과 같이 단어 이하의 발음 길이를 갖는 패턴으로 설정할 수 있으며, 입력 음성으로 단어, 문장들을 입력할 수 있기 때문에 대용량 음성 인식 시스템에 주로 쓰인다.As Markov model, Hidden Markov model is applied to accommodate various changes of speech pattern. The Hidden Markov Model is a dual stochastic process consisting of a stochastic process for state selection and a stochastic process for output probability of generating a speech pattern in each state. That is, each feature of the voice pattern is expressed by the selection probability of the state and the output probability. Hidden here means that the state is hidden in the model regardless of the speech pattern. The HMM algorithm can be set to a pattern having a pronunciation length of less than a word, such as a phoneme or a syllable. The HMM algorithm is mainly used in a large-capacity speech recognition system because a word or a sentence can be input as an input voice.

HMM을 이용한 모델(λ) 표기법은 상태간의 천이 확률(A), 상태에 종속된 출력확률(B), 상태의 초기상태 존재확률(π)로 나타낼 수 있다. 또한, 상태의 변화가 마코프 프로세스를 따라야 하며 관측되는 심벌의 출력은 서로 독립적이라는 가정이 있다.The model (λ) notation using the HMM can be expressed as a transition probability (A) between states, an output probability (B) dependent on states, and an initial state existence probability (π) of states. It is also assumed that the change of state must follow the Markov process and the output of the observed symbols are independent of each other.

HMM에서 사용되는 변수들은 다음과 같다.Variables used in HMM are as follows.

(1) N : 모델에서 상태의 수(1) N: number of states in the model

(2) M : 상태에서의 관측열의 수(2) M: number of observation columns in the state

(3) 상태간의 천이확률(3) Probability of transition between states

(4) 상태에 종속된 출력확률(4) output probability dependent on state

(5) 초기상태 존재확률(5) Probability of initial state

그리고 모델을 λ=(A,B,π)로 정의하여 사용한다.The model is defined as λ = (A, B, π).

HMM을 음성 인식에 응용하기 위해서는 다음의 세가지 문제가 해결되어야 한다.To apply HMM to speech recognition, the following three problems must be solved.

1) 어떠한 관측 심벌의 시퀀스를 얻었을 때 그러한 시퀀스가 발생할 확률 P(O|λ)를 계산하는 문제.1) The problem of calculating the probability P (O | λ) that a sequence will occur when a sequence of observed symbols is obtained.

이 문제는 훈련 과정에서 Baum-Welch 알고리즘을 사용할 경우에는 필요하지만, segmental K-mean 알고리즘을 사용할 경우에는 계산할 필요가 없다.This problem is necessary when using the Baum-Welch algorithm during training, but does not need to be calculated when using the segmental K-mean algorithm.

2) 최적 상태 시퀀스의 계산 문제.2) Calculation Problem of Optimal Sequence.

두 번째 문제는 HMM을 이용함으로써 감춰진 상태의 시퀀스를 예측하는 문제이다. 이는 비터비 알고리즘을 이용하여 계산된다. 본 발명에서는 비터비 알고리즘은 훈련과 인식 부분에서 응용되는데, 최적의 상태 시퀀스를 구하는 과정은 훈련과정에서 필요하고 인식부분에서는 비터비 알고리즘을 이용해 후보단어 모델에 반하는 입력 단어를 스코어링(scoring)하는 데에 이용된다.The second problem is to predict the sequence of hidden states by using HMM. This is calculated using the Viterbi algorithm. In the present invention, the Viterbi algorithm is applied in the training and recognition part, and the process of obtaining the optimal state sequence is required in the training process, and in the recognition part, the input word against the candidate word model is scored using the Viterbi algorithm. Used for

3) P(0|λ)의 최대화를 위한 모델의 재추정 문제.3) Reestimation problem of the model to maximize P (0 | λ).

세 번째 문제는 관찰된 값과 가장 일치하도록 모델을 재추정하는 방법으로써 아직 해석적인 방법은 알려져 있지 않다. 현재 가장 많이 쓰이는 방법은 Baum-Welch 재추정 방법과 segmental K-means 방법이 있는데, 본 발명에서는 segmental K-means 방법을 사용한다.The third problem is the method of reestimating the model to best match the observed value, but no analytic method is known. Currently, the most commonly used methods include the Baum-Welch reestimation method and the segmental K-means method. In the present invention, the segmental K-means method is used.

한편, HMM에서 출력확률(B)은 이산확률분포(DHMM)나 연속확률분포(CHM, SCHMM)를 가진다. DHMM은 VQ를 사용하여 acoustic space를 분할하기 때문에 계산량은 줄어들지만 디스토션(distortion) 에러가 발생하기 때문에 인식률이 떨어진다. 따라서 본 발명에서는 CHMM을 사용한다.On the other hand, the output probability B in the HMM has a discrete probability distribution (DHMM) or a continuous probability distribution (CHM, SCHMM). Since the DHMM uses VQ to divide the acoustic space, the computation is reduced, but the recognition rate is lowered because of distortion errors. Therefore, the present invention uses CHMM.

CHMM은 M개의 가우시안(Gaussian)을 혼합하여 출력확률 분포를 나타내고, 다음식과 같이 표현된다.CHMM represents the output probability distribution by mixing M Gaussians and is expressed as follows.

상기에서, M은 상태에서 가우시안 혼합의 수, C_im은 M번째 혼합의 혼합 이득, N(O_t,μ_im, _im은 관측벡터(O_t)에 대한 가우시안 확률 분포 함수, O_t는 관측벡터, μ_im은 평균 벡터, _im은 covariance 매트릭스, K는 관측벡터의 차수를 각각 나타낸다. 첨부한 도면 도12는 혼합 가우시안 확률 분포 함수를 나타낸 파형이다.In the above, M is the number of Gaussian mixtures in the state, C _im is the mixing gain of the Mth mixture, N (O _t , μ _im , _im is the Gaussian probability distribution function for the observation vector (O _t ), O _t is the observation vector, μ _im is the mean vector, _im denotes the covariance matrix and K denotes the order of the observed vectors. 12 is a waveform illustrating a mixed Gaussian probability distribution function.

상기 CHMM을 사용할 때 관측값으로부터 추정해야 할 파라미터의 수는 매우 중요하다. 연속확률분포를 가우시안 분포를 이용하여 나타낼 때 파라미터의 수를 줄이는 한 방법은, 도형 공분산 행렬(diagonal covariance matrix)을 사용하는 것이다. 즉, 오프-도형 텀(term)들은 모두 '0'으로 처리하는 것이다. 이렇게 함으로써 계산량을 줄임과 동시에 추정해야 할 파라미터의 수도 줄일 수 있다.When using the CHMM the number of parameters to be estimated from observations is very important. One way to reduce the number of parameters when representing a continuous probability distribution using a Gaussian distribution is to use a geometrical covariance matrix. In other words, all off-terms are treated as '0'. This reduces the amount of computation and reduces the number of parameters that need to be estimated.

다음으로, ML criterion에 의한 HMM의 파라미터를 추정하는 방법에는 Beam-Welch 알고리즘과 Segmental K-means 알고리즘이 있다.Next, there are a beam-welch algorithm and a segmental k-means algorithm for estimating the parameters of the HMM by ML criterion.

Beam-Welch 알고리즘은 어떤 모델이 주어졌을 때 관측열을 최대로 하는 모델을 찾는 반면 Segmental K-means 알고리즘은 어떤 모델이 주어졌을 때 관측열과 최적 상태 시퀀스를 최대화하는 모델을 구하는 것이다.The Beam-Welch algorithm finds a model that maximizes the observation sequence given a model, while the Segmental K-means algorithm finds a model that maximizes the observation sequence and the optimal state sequence given a model.

즉, Beam-Welch 알고리즘은 max P(O|λ)하는 λ를 구하는 반면, Segmental K-means 알고리즘은 max_sP(O,s|λ)를 최대화하는 λ를 구한다.That is, the Beam-Welch algorithm finds λ that max P (O | λ), while the Segmental K-means algorithm obtains λ that maximizes max _s P (O, s | λ).

max_sP(O,s|λ)를 사용하는 이유는 첫째, P(O|λ)가 공산(likelihood)을 구할때 모든 상태 트랜지션(state transition) 경로를 고려하기 때문에 많은 계산량과 시간을 필요로한다. 둘째, 파라미터중 출력확률(B)가 매우 큰 다이나믹 렌지(dynamic range)를 가지기 때문에 모든 상태 트랜지션 경로에 대한 공산(likelihood)의 계산이 수치 곤란(numerical difficulty)을 만날 수 있다. 셋째, 모델링과 디코딩이 모두 max_sP(O,s|λ)를 사용함으로써 모델링에서는 max_sP(O|λ)를 사용하고 디코딩에서는 max_sP(O,s|λ)를 사용하는 Baum-Welch 방법보다 더 자연스럽다.The reason for using max _s P (O, s | λ) is, firstly, that it takes a lot of computation and time since P (O | λ) takes into account all state transition paths when looking for likelihood. . Second, since the output probability B of the parameter has a very large dynamic range, the calculation of likelihood for all state transition paths may encounter numerical difficulties. Third, the modeling and decoding are all P _s max (O, s | λ) The modeled by using a max _s P (O | λ) and using the decoded P _s max (O, s | λ) that uses Baum- More natural than the Welch method.

Segmental K_means 알고리즘을 사용하여 파라미터를 추정하는 방법은 다음식과 같은 조건을 만족시키는 모델을 찾는 것이다.The method of estimating the parameters using the segmental K_means algorithm is to find a model that satisfies the following condition.

이것을 구하기 위해서 EM알고리즘이 사용된다.To solve this, the EM algorithm is used.

첫단계로 모델이 주어졌을때 관측열을 최대로 하는 최적 상태 시퀀스를 구한다.The first step is to find the optimal state sequence that maximizes the observed sequence given the model.

두번째 단계는 위에서 구한 최적 상태 시퀀스를 이용하여 새로운 모델을 추정한다.The second step is to estimate the new model using the optimal state sequence obtained above.

Segmental k-means 알고리즘은 HMM의 파라미터를 추정할 때 마르코브 체인의 k-mean 방법을 사용한다. 분할(segmentation) 정보는 비터비 디코딩 과정을 통해서 얻어진다.The segmental k-means algorithm uses the Markov chain k-mean method to estimate the parameters of the HMM. Segmentation information is obtained through the Viterbi decoding process.

실제로 segmental k-means 알고리즘을 사용하여 파라미터를 추정하는 방법은 다음과 같다.In fact, the parameter estimation method using the segmental k-means algorithm is as follows.

1) 초기화(Initialization)1) Initialization

모든 트레이닝 벡터들을 음소 단위와 HMM 상태로 균일하게 세그먼트한다. 각 음소 단위로 세그먼트된 것을 클러스터링(clustering)함으로써 파라미터들()은 초기화 된다.All training vectors are evenly segmented into phoneme units and HMM state. By clustering the segmented segments of each phoneme, the parameters ( ) Is initialized.

2) 분할(segmentation)2) segmentation

과정1 또는 과정3에서 구한 CHMM의 파라미터들은 비터비 디코딩을 통해 훈련 벡터들을 음소와 HMM 상태별로 다시 세그먼트된다. 천이확률들은 비터비 디코딩으로부터 구한 분할 정보로부터 구할 수 있다.The parameters of CHMM obtained in process 1 or process 3 are re-segmented the training vectors by phoneme and HMM state through Viterbi decoding. The transition probabilities can be obtained from partitioning information obtained from Viterbi decoding.

상기에서, s^*는 최적 상태 시퀀스이고, n_ij(s^*)은 최적 상태 시퀀스에서 상태 i에서 상태 j로 천이를 한 개수를 나타낸다.In the above, s ^* is an optimal state sequence, and n _ij (s ^* ) represents the number of transitions from state i to state j in the optimal state sequence.

3) 클러스터링(Clustering) 및 추정3) Clustering and Estimation

음소모델의 각 상태에 해당하는 모든 트레이닝 벡터들은 벡터 양자화(VQ) 방법에 의해 M개의 클러스터로 나누어진다. 그리고 상태 s_i의 각 클러스터로 나누어진 훈련 벡터들을 이용하여 파라미터들()을 추정한다.All training vectors corresponding to each state of the phoneme model are divided into M clusters by vector quantization (VQ). And using the training vectors divided into each cluster of state s _i , the parameters ( Estimate).

여기서 상기 각 파라미터들은 다음과 같은 식에 의해 구한다.Here, each of the above parameters is obtained by the following equation.

여기서 V_im은 상태 s_i의 m번째 혼합에 해당하는 벡터의 집합이고, L_im은 V_im에 해당하는 벡터의 수이다.Where V _im is the set of vectors corresponding to the mth blend of state s _i , and L _im is the number of vectors corresponding to V _im .

4) 반복(Repetition)4) Repetition

수렴 조건(convergence condition)이 만족될때까지 과정2와 과정3을 반복한다.Repeat steps 2 and 3 until the convergence condition is met.

한편, 본 발명에서 사용한 모델은 문맥 독립형 음소 모델이다. 문맥 독립이라는 것은 음소 모델을 생성할 때, 음소의 앞뒤의 상황을 고려하지 않겠다는 것이며 이로써 모델 개수가 감소한다. 채택된 문맥 독립형 음소의 개수는 53개이며 음소 모델은 Left-to-Right 음소 모델을 선택하였다. 상기 채택된 문맥 독립형 53개의 음소 개수는 도13과 같다.On the other hand, the model used in the present invention is a context-independent phoneme model. Context-independent means that when creating a phoneme model, we will not consider the circumstances before and after the phoneme, which reduces the number of models. The number of context-independent phonemes adopted is 53 and the phoneme model is the left-to-right phoneme model. The adopted context-independent 53 phonemes are shown in FIG.

다음으로, HMM을 이용한 훈련과정은, 입력, 천이확률의 초기화, 재추정, 발음 사전의 구성으로 이루어지며, 이를 각 단계별로 살펴보면 다음과 같다.Next, the training process using the HMM consists of an input, initialization of the probability of transition, reestimation, and the construction of the pronunciation dictionary.

먼저 입력 단계는, 문장 단위로 음성을 입력 받으며, 문장을 입력받을 때 그 문장이 어떠한 음소로 구성되어 있는지를 나타내는 발음 사전을 그 단어를 나타내는 파일 뒤에 첨부하는 식의 입력 파일의 리스트를 입력으로 받게 되어 있다.First, the input step receives a voice in sentence units, and when a sentence is input, receives a list of input files of a type in which a pronunciation dictionary indicating which phonemes are composed of the sentence is attached to the file representing the word. It is.

따라서 문장에서 음소로의 분리가 가능하게 된다.Therefore, it is possible to separate the sentence from the phoneme.

입력으로는 12차원 벡터인 MFCC, MFCC의 1차미분값, MFCC의 2차 미분값 그리고 2차원 벡터인 로그 파워의 1,2차 미분값으로 총 4개의 입력이 사용된다.A total of four inputs are used for the 12-dimensional vector MFCC, the first derivative of the MFCC, the second derivative of the MFCC, and the first and second derivatives of the log power of the two-dimensional vector.

다음 단계로 HMM의 천이확률을 유니폼(uniform)하게 초기화한다.The next step is to uniformly initialize the transition probability of the HMM.

다음으로, 재추정은 이전에 설명한 segmental k-means 알고리즘을 이용하여 HMM의 파라미터들을 재추정하며, 발음 사전의 구성은 문장 모델을 만들 때 53개의 음소 모델을 사용하며, 표기는 소리나는 대로 작성한다. 또한 문장의 시작 부분과 끝 부분, 그리고 중간에 묵음 모델을 첨가한다.Next, the re-estimation re-estimates the parameters of the HMM using the segmental k-means algorithm described previously. The construction of the pronunciation dictionary uses 53 phoneme models when constructing the sentence model, and the notation is written out phonetically. . It also adds silence models at the beginning, end, and middle of the sentence.

주지한 바와 같은 과정으로 연속 문장 인식에 대한 훈련 과정을 거친 훈련 모델을 훈련 모델부(50)에 저장하며, 이후 음성 인식부(60) 및 전화번호 안내부(70)를 통해 도4 및 도5와 같은 방법으로 실제적인 연속 음성 인식에 따른 전화번호 안내 서비스를 수행한다.The training model that has undergone the training process for continuous sentence recognition in the same manner as the well-known process is stored in the training model unit 50, and then through the voice recognition unit 60 and the phone number guide unit 70 of FIGS. In the same way, the telephone number guide service according to the actual continuous speech recognition is performed.

즉, 도4에 도시된 바와 같이, 문의자가 해당 과의 교수 전화번호를 알기 위해서 전화기를 오프훅크하고 문의처 전화번호를 다이얼링하면, 전화번호 안내 시스템은 오프 훅크 상태가 된다(S10). 전화번호 안내 시스템이 오프 훅크 상태가 되면, 시스템은 시스템 소개 멘트(예를 들어, '이 전화 번호 안내 시스템은 00대학교 교수들의 전화번호를 안내해주는 시스템입니다')를 송출해주고(S20), 다음으로 원하는 과의 이름 및 교수의 이름을 입력하라는 음성 요구 멘트(예를 들어, '원하는 교수의 과 및 교수님의 이름을 입력하세요')를 송출해준다(S30). 이와 같은 상태에서 소정 시간 이내에 사용자가 음성을 입력하면 연속 음성 인식 단계(S60)로 리턴을 하고, 이와는 달리 사용자가 소정 시간 이내에 음성을 입력하지 않으면, 설정 시간의 경과 여부를 확인한다(S50). 이 확인 결과 설정 시간이 경과하지 않았으면, 다시한번 상기 단계 S30로 리턴을 하여 원하는 과명 및 교수님의 이름을 입력하라는 멘트를 송출해주며, 상기 확인 결과 설정 시간이 경과했으면, 종료 멘트(예를 들어, '처음부터 다시 시작해 주십시요', 또는 '전화번호 안내 서비스를 종료하겠습니다')를 송출해주고, 전화번호 안내 서비스를 종료한다(S90).That is, as shown in Fig. 4, when the inquirer off-hooks the telephone and dials the telephone number of the inquiries in order to know the professor's telephone number of the corresponding department, the telephone number guide system is in the off-hook state (S10). When the phone number guide system is off-hook, the system sends out a system introduction comment (for example, 'this phone number guide system is a system for guiding the phone numbers of professors at 00 university') (S20). It sends out a voice request to enter the name of the desired department and the name of the professor (for example, 'enter the name of the desired professor and professor') (S30). In this state, if the user inputs the voice within a predetermined time, the process returns to the continuous voice recognition step (S60). Otherwise, if the user does not input the voice within the predetermined time, it is checked whether the set time has elapsed (S50). If the setting time has not elapsed as a result of the check, the process returns to the step S30 once again and sends out a comment for inputting the desired subject name and the professor's name. If the setting time has elapsed, the end comment (for example, , 'Please start again from the beginning', or 'I will end the telephone number guide service'), and ends the telephone number guide service (S90).

상기에서 소정 시간 이내에 사용자가 원하는 과명 및 교수의 이름을 입력한 경우의 연속 음성 인식은 다음과 같다.The continuous speech recognition in the case where the user inputs the desired subject name and professor's name within a predetermined time is as follows.

먼저, 주지한 바와 같이, 이용자의 발음이 예측되는 문장 구조를 설정하고(S61), 그 문장 구조로부터 만들어 질 수 있는 문장을 설정한다(S62). 다음으로 설정된 문장들 가운데 10개의 표본 문장(10개 이상일 수도 있고, 10개 이하일 수도 있지만 본 발명에서는 10개를 실시예로 설명함)을 가지고 훈련에 필요한 음성 데이터를 구축한다(S63). 그런 다음 구축한 음성 데이터를 훈련 모델로 훈련 모델부(50)에 저장하며, 이후 사용자로부터 음성이 입력되면 입력 음성과 저장한훈련 모델을 비교하여 연속 음성을 인식한다(S65). 즉, 사용자가 입력한 음성은 아날로그/디지털 변환부(10), 끝점 검출부(20), 특징 추출부(30), 비터비 디코더(40)를 순차 통해 음성 인식부(60)에 전달되며, 상기 음성 인식부(60)는 그 입력 음성과 상기 훈련 모델부(50)에 저장한 기준 모델을 비교하여 입력 음성을 인식하게 되며, 그 인식된 결과를 전화번호 안내부(70)에 전달해준다(S70). 이에 따라 전화번호 안내부(70)는 그 인식 결과에 따른 해당 과 교수의 전화번호를 음성으로 송출해준다. 그런 후 사용자에게 이용에 대한 감사 멘트(예를 들어, '이용해 주셔서 감사합니다')를 송출해주고(S80), 종료 멘트를 송출한 후 전화번호 안내 서비스를 종료하게 된다(S90).First, as is well known, a sentence structure in which a user's pronunciation is predicted is set (S61), and a sentence that can be made from the sentence structure is set (S62). Next, among the set sentences, 10 sample sentences (10 or more may be 10 or less, but in the present invention, 10 is described as an embodiment), voice data necessary for training is constructed (S63). Then, the constructed speech data is stored in the training model unit 50 as a training model, and when a voice is input from the user, the continuous voice is recognized by comparing the input voice with the stored training model (S65). That is, the voice input by the user is transferred to the voice recognition unit 60 through the analog / digital converter 10, the endpoint detector 20, the feature extractor 30, and the Viterbi decoder 40 in sequence. The voice recognition unit 60 recognizes the input voice by comparing the input voice with the reference model stored in the training model unit 50 and transmits the recognized result to the telephone number guide unit 70 (S70). ). Accordingly, the phone number guide unit 70 transmits the phone number of the corresponding professor according to the recognition result by voice. Thereafter, the user sends an audit comment about the use (for example, 'thank you for using') (S80), and sends out the termination comment, and ends the telephone number guide service (S90).

이상에서 상술한 바와 같이 본 발명은, 한국어의 연속 음성 인식으로 중규모(대학교, 병원, 관청 등등)에서 자동으로 전화번호를 안내해 줄 수 있는 이점이 있으며, 또한 연속 음성 인식으로 사용자에게 보다 자연스러운 발음을 유도해줄 수 있는 이점이 있고, 또한 기존의 음성인식 적용제품에서 발생되는 제약 조건(미리 저장된 단어만을 입력하도록 사용자 음성 입력을 제약하는 조건)에서 벗어나 최대한 편하게 서비스를 이용토록 도모해주는 효과가 있다.As described above, the present invention has the advantage of automatically guiding a phone number in a medium scale (university, hospital, government office, etc.) with continuous speech recognition in Korean, and also provides a more natural pronunciation to the user with continuous speech recognition. In addition, there is an advantage that can be induced, and also has the effect of making the service as convenient as possible free from the constraints (conditions restricting the user's voice input to input only pre-stored words) generated in the existing speech recognition application.

Claims

In the phone number guide method through voice recognition,

When the user dials and the telephone number guide system is in an off-hook state, outputting a system introduction comment and sending a comment requesting voice input of a desired class and professor's name;

Recognizing a continuous voice by comparing a training model trained with a sentence that is set in advance and predicted when the user inputs a name of a corresponding department and professor after the voice input request;

Transmitting the recognition result of the recognized input voice to a telephone number guide to control the telephone number of the corresponding professor to be transmitted;

The phone number guide method using the continuous speech recognition, characterized in that the step comprising the step of transmitting the termination after the phone number of the corresponding professor.

The method of claim 1, further comprising: checking whether a set time has elapsed if a voice is not input within a predetermined time while requesting voice input of the desired department and professor name; Resending a comment requesting entry of the desired department and professor's name if the set time has not elapsed; And returning to the step of transmitting the termination message when a set time has elapsed in the absence of the voice input.

The method of claim 1, wherein the recognizing of the voice comprises: setting a sentence structure in which the user's pronunciation is predicted, setting a sentence that may be made from the set sentence structure, and among the set sentences. Constructing voice data necessary for training with 10 sample sentences; storing the constructed voice data as a training model in a training model unit; and inputting an input voice and a stored training model when a voice is input from the user. Telephone number guidance method using continuous speech recognition, characterized in that the step consisting of comparing the continuous speech recognition.