KR20060065262A

KR20060065262A - Storage method for nametag in speaker dependent speech recognition system

Info

Publication number: KR20060065262A
Application number: KR1020040104072A
Authority: KR
Inventors: 김시내; 정두경
Original assignee: 엘지전자 주식회사
Priority date: 2004-12-10
Filing date: 2004-12-10
Publication date: 2006-06-14

Abstract

본 발명은 화자 종속 음성 인식 시스템에 있어서, 특히 입력되는 화자 종속 음성의 끝점을 검출하고 네임 테그(nametag)의 녹음 시점부터 음성 발화 시점 사이에 존재하는 불필요한 묵음을 최대한 제거한 음성 파형을 저장시켜 줄 수 있도록 한 것으로, 본 발명에 따른 화자 종속 음성 인식 시스템에서의 네임태그 저장방법은, 화자 음성이 입력되면 입력된 음성 신호의 끝점을 검출하여 묵음 및 음성을 분리하고, 상기 음성 신호 중 음성 구간을 검출하여, 음성 시작 시점이 미리 설정된 기준 시간과 비교하며, 상기 비교결과, 음성 시작 시점과 기준 시간과의 차이 나는 정도에 따라 상기 음성 시작 시점을 기준 시간의 소정 배수만큼 앞으로 쉬프트시켜, 그 쉬프트되는 구간만큼의 묵음이 제거된 음성 파형을 저장하게 된다. 이를 통해 화자 종속 음성 인식 시스템에서 네임 태그 저장할 시 불필요한 묵음에 의한 메모리의 낭비를 방지하고 시스템 응답 특성을 향상시켜 줄 수 있다. According to the present invention, in the speaker-dependent speech recognition system, in particular, an end point of an input speaker-dependent speech can be detected and a speech waveform can be stored as much as possible by eliminating unnecessary silence existing between a recording time of a name tag and a speech utterance time. In the speaker-dependent speech recognition system according to the present invention, the name tag storing method detects an end point of an input voice signal when a speaker voice is input, separates silence and voice, and detects a voice section of the voice signal. The voice start time point is compared with a preset reference time, and as a result of the comparison, the voice start time point is shifted forward by a predetermined multiple of the reference time according to the difference between the voice start time point and the reference time point, and the shifted time interval The speech waveform from which the silence is removed is stored. Through this, it is possible to prevent waste of memory by unnecessary mute when the name tag is stored in the speaker-dependent speech recognition system and to improve the system response characteristics.

화자 종속 음성 인식, 네임 태그, 묵음Speaker dependent speech recognition, name tag, mute

Description

STORAGE METHOD FOR NAMETAG IN SPEAKER DEPENDENT SPEECH RECOGNITION SYSTEM}

도 1은 본 발명에 따른 화자 종속 음성 인식 네비게이션 시스템을 나타낸 구성도.1 is a block diagram showing a speaker-dependent speech recognition navigation system according to the present invention.

도 2는 본 발명에 따른 화자 종속 음성 인식을 위한 음성 처리 프로세스를 나타낸 구성도.2 is a block diagram showing a speech processing process for speaker-dependent speech recognition according to the present invention.

도 3은 본 발명 실시 예에 따른 화자 종속 음성 인식 시스템에서의 네임태그 저장방법을 나나낸 플로우 챠트.3 is a flowchart illustrating a method of storing a name tag in a speaker dependent speech recognition system according to an exemplary embodiment of the present invention.

도 4의 (a)는 본 발명에 따른 제 1음성 발화 형태이고, (b)는 제 1음성 발화 형태에 대해 네임 태그 저장 예를 나타낸 파형도.Figure 4 (a) is a first voice utterance form according to the present invention, (b) is a waveform diagram showing an example of the name tag storage for the first voice utterance form.

도 5의 (a)는 본 발명에 따른 제 1음성 발화 형태이고, (b)는 제 2음성 발화 형태의 네임태그 저장 예를 나타낸 파형도.Figure 5 (a) is a first voice utterance form according to the present invention, (b) is a waveform diagram showing an example of storing the name tag of the second voice utterance form.

<도면의 주요부분에 대한 부호의 설명><Description of the symbols for the main parts of the drawings>

11...지피에스위성 13...수신기11 ... GPS satellite 13 ... Receiver

21...원격제어송신기 22...원격제어수신기21.Remote control transmitter 22 ... Remote control receiver

30...마이크 40...전처리부30.Microphone 40 ... Preprocessing part

50...음성처리부 51...끝점검출부50 ... Voice processing unit 51 ... Endpoint detection unit

60...메모리 70...시디롬드라이버60 ... Memory 70 ... CD-ROM driver

80...표시부 90...제어부 80 ... Display section 90 ... Control section

본 발명은 화자 종속 음성 인식 시스템에 있어서, 특히 화자 종속 음성의 끝점을 검출하고 네임 테그(nametag)를 저장할 시 녹음 시점부터 음성 발화 시점 사이에 존재하는 불필요한 묵음을 최대한 제거한 음성 파형을 저장시켜 줄 수 있도록 한 화자 종속 음성 인식 시스템에서의 네임태그 저장방법에 관한 것이다.According to the present invention, in the speaker-dependent speech recognition system, in particular, when detecting an end point of the speaker-dependent speech and storing a name tag, it is possible to store a speech waveform that can remove unnecessary silence between recording time and speech utterance as much as possible. The present invention relates to a method of storing name tags in a speaker-dependent speech recognition system.

음성정보처리기술이란 사람에 의해 발성된 음성신호를 분석하여, 그 안에 내재되어 있는 의미를 파악하는 기술 혹은 발성자의 신원 등을 알아내는 기술 및 텍스트 형태의 정보를 사람의 음성으로 생성하는 기술을 통칭한다.Voice information processing technology refers to a technology that analyzes voice signals spoken by a person, grasps the meaning inherent in it, or finds out the identity of a speaker, and a technology that generates information in the form of text into a human voice. do.

즉, 사람과 기계간의 통신 수단으로 음성을 이용할 수 있게 해주는 기술을 의미하는 것이다. 음성은 사람의 가장 자연스러운 의사전달 도구로 사람과 기계간의 인터페이스에 있어 중요한 위치에 있으며, 이의 실현을 위해 음성정보처리 기술의 필요성이 대두되고 있다. In other words, it means a technology that enables the use of voice as a means of communication between people and machines. Voice is the most natural communication tool for human beings, which is an important position in the interface between human and machine, and the necessity of voice information processing technology is emerging to realize this.

음성 인식 시스템의 종류는 인식의 대상으로 삼는 화장에 따라 화자 독립 시스템과 화자 종속 시스템으로 분류된다. Types of speech recognition systems are classified into speaker-independent systems and speaker-dependent systems according to makeup.

첫 번 째 화자종속 시스템은, 특정 화자의 음성을 인식하기 위한 시스템으 로, 현재 휴대폰에 탑재되어 사용되는 음성 다이얼링(Voice dialing) 시스템이 대표적이다. 화자종속 시스템에서는 일반적으로 시스템의 사용 전에, 사용자의 음성을 저장, 등록시키고, 실제 인식을 수행할 때는 입력된 음성의 패턴과 저장된 음성의 패턴을 비교하는 패턴 매칭(pattern matching)기법이 사용된다.The first speaker dependent system is a system for recognizing the voice of a specific speaker, and is typically a voice dialing system used in mobile phones. In speaker-dependent systems, a pattern matching technique is generally used that stores and registers a user's voice and compares the pattern of the input voice with the stored voice prior to using the system.

예컨대, 스프린트가 서비스하는 보이스 폰 카드는 30개까지의 이름을 저장해두고, 전화걸 때 사람이름만 이야기하면, 그 사람의 전화번호를 찾아 자동으로 전화를 걸어주는 것이 대표적인 예이다.For example, a voice phone card serviced by Sprint stores up to 30 names, and when a call is made, a person's name is spoken, and the person's phone number is automatically found.

두 번째 화자독립 시스템은, 불특정 다수 화자의 음성을 인식하기 위한 것으로, 화자종속 시스템에서와 같이 사용자가 시스템의 동작 전에 음성을 등록시켜야 되는 번거로움이 없다. 화자독립 시스템은 다수화자의 음성을 수집하여 통계적인 모델을 학습시키고, 학습된 모델을 이용하여 인식을 수행하게 된다. 따라서, 각 화자의 특징적인 특성은 사라지고 각 화자간에 공통적으로 나타나는 특성이 부각된다.The second speaker independence system is for recognizing the voice of an unspecified majority speaker, and there is no need for the user to register the voice before the operation of the system as in the speaker dependent system. The speaker-independent system collects the voices of the majority speakers to learn statistical models and recognizes them using the learned models. Therefore, the characteristic characteristics of each speaker disappear, and the characteristics common to each speaker are highlighted.

예컨대, AT&T가 수신자 부담 전화를 응용한 것으로 수신자부담 전화를 수신자가 받아들일 것인지 여부의 대답을 인식하는 기능을 수행한다.For example, AT & T has applied a toll free telephone and performs a function of recognizing whether the recipient will accept the toll telephone.

이러한 음성 인식 시스템 중 화자종속 음성인식은 화자독립 음성인식에 비해 인식률이 높아 실용화하기에 유리하다. 같은 어휘를 대상으로 같은 양의 학습 데이터를 사용한다면, 대체적으로 화자종속 시스템의 성능이 화자독립 시스템 보다 높게 나온다.Among the speech recognition systems, speaker-dependent speech recognition has a higher recognition rate than speaker-independent speech recognition, which is advantageous for practical use. If the same amount of learning data is used for the same vocabulary, the speaker-dependent system generally outperforms the speaker-independent system.

종래 화자종속 음성 인식은 자신의 음성을 녹음한 후, 그 음성으로부터 음성 특징 파라미터를 추출함으로써 인식하게 된다. 이는 보통 이름으로 전화를 거는 어플리케이션에 많이 사용되는데 녹음된 파일의 음성 특징 파라미터뿐만 아니라, 음성 파형 자체를 모종의 압축 방법에 의해 비휘발성 메모리(Non-Volatile memory) 상에 저장을 하고 있어야 한다.Conventional speaker dependent speech recognition is recognized by recording its own voice and extracting voice feature parameters from the voice. This is commonly used for applications that call by name. In addition to the voice feature parameters of the recorded file, the voice waveform itself must be stored in non-volatile memory by some compression method.

또 저장된 압축 음성 파형을 통해서 이를 음성 인식 결과에 대한 확인용으로도 사용할 뿐만 아니라, 사용자로 하여금 녹음된 음성 파형을 들려줌으로써, 어떤 이름들이 등록되어 있는지를 쉽게 확인시켜 줄 수 있다. 이는 음성 합성기(TTS)를 사용해서 쉽게 할 수 있지만, 음성 합성기를 사용하지 않는 시스템에서는 플레이백(PLAYBACK)을 대신할 요소가 반드시 있어야 하므로 절대적으로 필요한 요소이다.In addition, the stored compressed speech waveform can be used not only to confirm the speech recognition result, but also to allow the user to listen to the recorded speech waveform to easily identify which names are registered. This can easily be done using a speech synthesizer (TTS), but it is absolutely necessary in a system that does not use a speech synthesizer because there must be an element to replace PLAYBACK.

그러나, 종래의 화자 종속 음성 인식 시스템에서 음성을 저장하는 방법에 있어서, 화자 음성 저장(즉, nametag) 시간을 대개 일정 시간(예컨대 3~5초)으로 고정시킨 후 음성을 화자 음성을 받아들이게 된다. 그런데, 일정 시간으로 음성을 녹음하게 되므로 임베디드 솔루션(EMBEDDED SOLUTION)에서는 불필요한 메모리로 인해서 시스템의 최적화 입장에서 봤을 때 메모리 낭비를 초래하게 된다. However, in the method of storing a voice in a conventional speaker dependent speech recognition system, the speaker voice storage (ie, nametag) is fixed to a predetermined time (for example, 3 to 5 seconds) and then the voice is accepted. However, since the voice is recorded for a certain time, in the embedded solution (EMBEDDED SOLUTION), unnecessary memory causes a waste of memory when viewed from the system optimization point of view.

예컨대, 네임태그(nametag)를 100개정도 저장한다면 5초간의 분량을 100개정도 저장하기 위한 메모리를 확보하고 있어야 한다. 그리고, 실제 인코딩 기술을 이용해서 저장되는 음성 파형을 압축한다고 할지라도, 실제 음성 앞뒤로 불필요한 묵음(silence)들이 들어가게 됨으로써 여분(redundant)의 메모리를 불필요하게 저장해야 되는 문제가 있다.For example, if you store about 100 nametags, you should have a memory to store about 100 5 seconds. In addition, even when compressing a speech waveform stored using an actual encoding technique, unnecessary silences are put in the front and back of the actual speech, thereby causing a problem of unnecessary storage of redundant memory.

본 발명의 제 1목적은 화자 종속 음성 인식 시스템에서 등록된 사람 이름, 전화번호 등이 제대로 인식되었는지를 확인하기 위한 플레이백(playback)으로 사용되는 음성 파형으로부터 불필요한 묵음을 제거할 수 있도록 함에 있다.A first object of the present invention is to remove unnecessary silence from a speech waveform used as a playback for confirming whether a registered person's name, telephone number, or the like is properly recognized in a speaker dependent speech recognition system.

본 발명의 제 2목적은 플레이백을 위해 프롬프트 시점과 음성 발화 시점 사이에 발생하는 묵음 구간을 최대한 제거하여, 음성 파형을 저장함으로써, 메모리 사이즈를 줄여 줄 수 있도록 함에 있다. A second object of the present invention is to reduce the memory size by storing the audio waveform by eliminating the silence section occurring between the prompt time and the voice ignition time for playback.

본 발명의 제 3목적은 입력되는 음성의 끝점을 검출한 후 끝점을 이용하여 기준 시간과 비교하여, 네임태그에서 불필요한 묵음이 제거될 수 있도록 함에 있다.
A third object of the present invention is to detect an end point of an input voice and compare the reference time with the end point, thereby eliminating unnecessary silence in the name tag.

상기한 목적 달성을 위한 본 발명에 따른 화자 종속 음성 인식 시스템에서의 음성 저장 방법은,Voice storage method in a speaker-dependent speech recognition system according to the present invention for achieving the above object,

화자 종속 음성 인식 시스템의 네임 태그 저장 방법에 있어서,In the name tag storage method of the speaker-dependent speech recognition system,

화자 음성이 입력되면 입력된 음성 신호의 끝점을 검출하여 묵음 및 음성을 분리하는 단계;Detecting an end point of the input voice signal and separating the silence and the voice when the speaker voice is input;

상기 음성 신호 중 음성 구간을 검출하여, 음성 시작 시점이 미리 설정된 기준 시간과 비교하는 단계;Detecting a voice section of the voice signal and comparing a voice start time point with a preset reference time;

상기 비교결과, 상기 음성 시작 시점과 기준 시간과의 차이 나는 정도에 따라 상기 음성 시작 시점을 기준 시간의 소정 배수만큼 앞으로 쉬프트시켜, 그 쉬프트되는 구간만큼의 묵음이 제거된 음성 파형을 저장하는 단계를 포함하는 것을 특징으로 한다.As a result of the comparison, shifting the voice start time forward by a predetermined multiple of the reference time according to the difference between the voice start time and the reference time, and storing the voice waveform from which the silence is removed for the shifted period. It is characterized by including.

바람직하게, 상기 비교결과, 상기 음성 시작 시점이 미리 설정된 기준 시간 보다 클 경우 상기 음성 시작 시점을 기준시간의 일 배수만큼 앞으로 쉬프트시키고, 그 쉬프트된 구간만큼의 묵음이 제거된 음성 파형을 저장하는 것을 특징으로 한다.Preferably, as a result of the comparison, when the voice start time is greater than a preset reference time, the voice start time is shifted forward by a multiple of the reference time, and the voice waveform from which the silence is removed for the shifted period is stored. It features.

바람직하게, 상기 비교결과, 상기 음성 시작 시점이 미리 설정된 기준 시간 보다 작을 경우 상기 음성 시작 시점을 기준 시간의 1/2배수만큼 앞으로 쉬프트시키고, 그 쉬프트된 구간만큼의 묵음이 제거된 음성 파형을 저장하는 것을 특징으로 한다.Preferably, when the voice start time is less than a preset reference time, the voice start time is shifted forward by 1/2 times the reference time, and the voice waveform from which the silence is removed by the shifted interval is stored. Characterized in that.

바람직하게, 상기 묵음이 제거되는 구간은 네임태그의 프롬프트 시작 시점부터 상기 기준 시간의 소정 배수만큼의 묵음이 제거되는 것을 특징으로 한다.Preferably, the section in which the silence is removed is characterized in that silence is removed by a predetermined multiple of the reference time from the prompt start point of the name tag.

본 발명에 따른 화자 종속 음성 인식 네비게이션 시스템에 대해 설명하면 다음과 같다.A speaker dependent speech recognition navigation system according to the present invention will be described below.

도 1은 본 발명에 따른 화자 종속 음성 인식 네비게이션 시스템이다. 1 is a speaker dependent speech recognition navigation system according to the present invention.

도 1을 참조하여 설명하면, 3개 이상의 지피에스위성(11)으로부터 위치신호를 수신하는 지피에스안테나(12)와, 지피에스안테나(12)가 수신한 위치신호에서 그 지점의 경도와 위도를 산출하는 수신기(13)와, 네비게이션시스템을 원격 제어하는 원격제어기송신기(21)와, 원격제어기송신기(21)가 송신하는 네비게이션 시스템 제어신호를 수신하는 원격제어기수신기(22)와, 음성을 입력하는 마이크(30)와, 마이크(30)로 입력된 음성을 디지털로 변환하여 신호처리를 행하는 전처리부(40)(이하 아날로그/디지털변환부)와, 아날로그/디지털변환부(40)의 디지털출력에 의하여 음성 끝점을 추출하는 음성 처리부(50)와, 음성인식 네임 태그등의 사용자 데이터베이스를 저장하는 메모리(60)와, 지도데이터가 수록된 컴팩트디스크(71)에서 데이터를 읽어내는 시디롬드라이버(70)와, 정보를 표시하는 표시부(80)와, 시스템을 제어하며 음성인식 네임태그 중 소정 구간의 묵음을 제거하여 상기 메모리(60)에 저장시키는 제어부(90)를 포함한다.Referring to FIG. 1, a receiver calculating a longitude and latitude of a point from a GPS antenna 12 receiving a position signal from three or more GPS satellites 11 and a position signal received by the GPS antenna 12. 13, a remote controller transmitter 21 for remotely controlling the navigation system, a remote controller receiver 22 for receiving a navigation system control signal transmitted by the remote controller transmitter 21, and a microphone 30 for inputting voice. ), A pre-processing unit 40 (hereinafter referred to as an analog / digital conversion unit) for converting the voice input to the microphone 30 into digital signal processing, and a digital output of the analog / digital conversion unit 40. A voice processing unit 50 for extracting the data, a memory 60 for storing a user database such as a voice recognition name tag, and a CD-ROM driver 7 for reading data from the compact disk 71 containing the map data. 0), a display unit 80 for displaying information, and a controller 90 for controlling the system and removing the silence of a predetermined section of the voice recognition name tag and storing it in the memory 60.

본 발명에 따른 화자 종속 음성 인식 시스템(Speaker dependent speech recognition system)은 특정한 화자의 음성만을 인식하는 시스템으로서, 한 사람의 음성신호로 학습되어 그 사람의 음성은 인식률을 높여줄 수 있다. The speaker dependent speech recognition system according to the present invention is a system for recognizing only a specific speaker's voice, and can be learned by one person's voice signal so that the person's voice can increase the recognition rate.

화자 종속 음성 인식은 자신의 음성(전화번호, 네임태그 등)을 녹음한 후 그 음성으로부터 음성 특징 파라미터를 추출한 후 메모리(non-volatile memory)상에 저장한 후, 실제 사용 환경에서는 음성이 입력되면 리얼 타임으로 음성 특징 파라미터를 추출하고 미리 저장되어 있던 파라미터를 이용해서 인식하게 된다. Speaker-dependent speech recognition records your voice (phone number, name tag, etc.), extracts voice feature parameters from the voice and stores it in non-volatile memory. The voice feature parameters are extracted in real time and recognized using previously stored parameters.

여기서, 만약 음성 합성기(TTS)가 없는 시스템이라면, 인식이 제대로 되었는지 아닌지를 제대로 측량할 방법이 없다. 따라서 음성 특징 파라미터뿐만 아니라 음성 파형 자체도 상기 메모리 상에 저장함으로써 인식 후 사용자에게 제대로 인식 이 되었는지를 확인하는 차원에서 피드백해 준다. 즉, "홍길동"라는 음성정보가 메모리에 저장 및 등록된 후, 사용자로부터 "홍길동"라는 음성(nametag)이 입력되면, 시스템에서는 플레이 백으로 "홍길동 맞습니까?" 라는 피드백을 해 주게 된다.Here, if the system does not have a speech synthesizer (TTS), there is no way to properly measure whether the recognition is correct or not. Therefore, not only the voice feature parameter but also the voice waveform itself is stored in the memory to provide feedback in order to confirm whether the user is properly recognized after recognition. In other words, if the voice information "Hong Gil-dong" is stored and registered in the memory and a voice name "Hong Gil-Dong" is input from the user, the system will answer "Hong Gil-Dong?" Will give feedback.

그리고, 음성 파형 자체를 저장하게 되면 메모리 자원이 한정되어 있는 매입형 솔루션에서는 불필요한 메모리를 많이 차지하게 되므로, 압축되어 저장된다. In addition, when the voice waveform itself is stored, the embedded solution having limited memory resources occupies a lot of unnecessary memory, and thus is compressed and stored.

도 1은 본 발명에 따른 화자 종속 음성 인식 시스템이 적용된 네비게이션 시스템으로서, 이에 대하여 설명하면 다음과 같다. 1 is a navigation system to which a speaker-dependent speech recognition system according to the present invention is applied.

도 1을 참조하면, 사용자는 자신이 입력하고자 하는 네임태그를 마이크(30)를 통하여 음성으로 입력한다(S10). 사용자가 마이크(30)를 통하여 음성을 입력하며 입력된 음성은 아날로그/디지털변환부(Analog to Digital Converter)(40)에 의하여 디지털신호로 변환된다. Referring to FIG. 1, a user inputs a name tag to be input by voice through a microphone 30 (S10). A user inputs a voice through the microphone 30 and the input voice is converted into a digital signal by an analog to digital converter 40.

상기 아날로그/디지털변환부(ADC)(40)는 입력된 음성을 디지털로 변환되어 전 처리(pre-processing) 과정을 거친다. 전 처리과정은 마이크를 통하여 입력된 음성이 아날로그/디지털 변환에 의하여 약 10㎑의 샘플링주파수로 마이크(30)를 통하여 입력된 음성신호인 아날로그 신호를 디지털 신호로 변환하여 신호 대 잡음비(S/N Ratio)를 개선하기 위하여 프리엠퍼시스(pre-emphasis)등의 과정을 거치는 것이다.The analog / digital converter (ADC) 40 converts the input voice into digital and undergoes a pre-processing process. The preprocessing process converts an analog signal, a voice signal input through the microphone 30 into a digital signal, at a sampling frequency of about 10 kHz by analog / digital conversion into a digital signal, thereby converting the signal to noise ratio (S / N). In order to improve the ratio, a process such as pre-emphasis is performed.

아날로그/디지털변환부(40)는 입력된 음성을 전 처리한 후 음성처리부(50)로 출력한다. 음성처리부(50)는 음성이 입력되면 도 2와 같이 끝점 검출(Endpoint detection)부(51)를 통하여 입력된 음성 신호의 묵음과 음성을 구분하는 끝점 검출(Endpoint detection)을 행한다. 그리고, 음성 처리부(50)는 음성 특징 검출부(미도시)를 이용하여 끝점 검출부(51)를 통과한 음성의 특징 파라미터를 검출하게 된다. 이는 음성 특징 파라미터를 구하기 위한 요소로 작용된다. The analog / digital converter 40 pre-processes the input voice and outputs it to the voice processor 50. When a voice is input, the voice processor 50 performs endpoint detection for distinguishing between the silence and the voice of the voice signal input through the endpoint detection unit 51 as shown in FIG. 2. The voice processor 50 detects the feature parameter of the voice that has passed through the endpoint detector 51 using the voice feature detector (not shown). This serves as an element for obtaining the speech feature parameter.

즉, 음성구간에 불필요한 묵음이 포함되어 있으면 음성 인식 파형에 소요되는 시간이 증가하게 되어 끝점 검출을 통하여 묵음과 음성을 구분하는 것이다. 끝점 검출을 할 때에는 잡음의 레벨을 추정하여 끝점검출을 행한다. In other words, if unnecessary silence is included in the speech section, the time required for the speech recognition waveform is increased to distinguish between silence and speech through end point detection. When detecting the end point, the end point is detected by estimating the noise level.

이와 같이 음성 끝점 검출부(51)는 음성 구간의 묵음 및 음성이 구분되어 검출하여, 타이밍 정보를 함께 제어부(90)에 전달하게 된다. 제어부(90)는 음성 타이밍 정보와 묵음 타이밍 정보를 이용하여 음성 시작 시점에 대해 앞으로 쉬프팅시키는 묵음 제거 동작을 수행하게 된다.In this way, the voice endpoint detecting unit 51 detects the silence and the voice of the voice section separately and transmits the timing information to the control unit 90 together. The controller 90 performs the silence removing operation of shifting forward with respect to the voice start time by using the voice timing information and the silence timing information.

다시 말하면, 제어부(90)는 음성 시작 시점의 타이밍 정보에 따라 기준시간을 이용하여 임의의 묵음 시간을 제거하기 위해 음성 시작 시점을 앞으로 쉬프팅(shifting)시킨 후, 음성 파형 자체를 저장하게 된다. In other words, the controller 90 shifts the voice start time forward to remove an arbitrary silence time using the reference time according to the timing information of the voice start time, and then stores the voice waveform itself.

도 3은 본 발명 실시 예에 따른 화자 종속 음성 인식 시스템에서의 음성 저장 방법을 나타낸 플로우 챠트이다.3 is a flowchart illustrating a voice storage method in a speaker-dependent speech recognition system according to an exemplary embodiment of the present invention.

도 3을 참조하면, 네임태그 사용을 위해 프롬프트 후 화자 음성이 입력되면(S11) 음성의 끝점 검출을 통하여 음성 신호의 묵음과 음성을 검출하게 된다(S13). 상기 음성 신호 중 음성 구간을 검출하여(S15), 음성 시작 시점을 미리 설정된 기준 시간과 비교하여 기준 시간 이내에 있는가를 확인하여(S17), 기준 시간을 초과이면 음성 시작 시점을 기준시간 배수에 해당하는 구간을 앞으로 쉬프트 시킴으로써(S19), 그 쉬프트된 구간에 해당하는 묵음 구간이 제거되며, 묵음이 제거된 음성 파형을 메모리에 저장하게 된다(S21).Referring to FIG. 3, when the speaker voice is input after the prompt for the use of the name tag (S11), the silence and the voice of the voice signal are detected by detecting the end point of the voice (S13). Detecting a voice section of the voice signal (S15), and comparing the voice start time point with a preset reference time to check whether it is within the reference time (S17), if the reference time is exceeded, the voice start time point corresponding to the multiple of the reference time By shifting forward (S19), the silent section corresponding to the shifted section is removed, and the speech waveform from which the silence is removed is stored in the memory (S21).

만약, 음성 시작 시점이 기준 시간 이하일 경우(S23), 음성 시작 시점에 대해 기준 시간의 1/2 배수에 해당하는 구간만큼 음성 시작 시점을 앞으로 쉬프트시켜 줌으로써(S25), 기준 시간의 1/2배수에 해당하는 구간 정도의 묵음이 제거된 음성 파형이 저장된다(S27).If the voice start time is less than the reference time (S23), by shifting the voice start time forward by a section corresponding to 1/2 of the reference time with respect to the voice start time (S25), 1/2 times the reference time The speech waveform from which the silence of the section corresponding to is removed is stored (S27).

즉, 도 3은 적어도 하나의 기준 시간을 미리 설정한 후, 그 기준시간 전/후에 음성 발화 시점이 존재하는지를 판단하고, 그 결과에 따라 음성 발화 시점을 기준 시간의 소정 배수 즉, 기준 시간의 일 배수 또는 기준 시간의 1/2 배수 만큼을 앞으로 구간을 앞으로 쉬프트시켜 줌으로써, 불필요한 묵음 구간을 최대한 제거할 수 있도록 함에 있다. 또한 음성 발화 시점과 기준 시간의 비교 및 묵음 제거 동작을 적어도 한 번 이상 수행하도록 함으로써, 최대한 묵음이 제거된 네임 태그가 저장되게 된다.That is, in FIG. 3, after setting at least one reference time in advance, it is determined whether a voice ignition time exists before / after the reference time, and according to the result, the voice ignition time is a predetermined multiple of the reference time, that is, one day of the reference time. By shifting the forward section by a multiple of one-half or a half of the reference time, the unnecessary silent section can be eliminated as much as possible. In addition, by performing a comparison between the speech utterance time point and the reference time and removing the silence at least one or more times, the name tag from which the silence is removed is stored as much as possible.

일반적으로, 녹음되는 이름(Name)의 음성 길이를 보면 일정 시간 뒤에 음성이 발화되고 있다. 이는 실제 시스템이 사용자에게 사용자의 음성을 유도하는 프롬프트(prompt)를 보낸 후 실제로 사용자가 음성(예컨대, nametag)을 발화하기까지의 시간은 여러 시간대로 나타나고 있으나, 대부분 녹음 음성들이 프롬프트를 보낸 후 발화된 시점(즉, 음성 시작 시점)이 일정 시간 이상을 지난 후 나타나고 있다. 이에 따라, 프롬프트 시작 시점부터 일정 구간(또는 시간)의 묵음 구간을 제거함으로써, 음성 파형의 사이즈를 줄여 줄 수 있다. In general, when the voice length of the recorded name is spoken, the voice is uttered after a certain time. This is because the time before the actual system sends a prompt to the user to guide the user's voice and the user actually speaks the voice (eg, nametag) appears in several time zones, but most of the recorded voices are prompted after the prompt. The point in time (that is, the start point of the voice) appears after a certain time or more. Accordingly, the size of the voice waveform can be reduced by removing the silent section of the predetermined section (or time) from the start point of the prompt.

도 4 및 도 5를 참조하여 본 발명에서의 음성 신호의 묵음을 제거한 것으로서, 발화의 형태를 크게 두 가지로 분류하고, 이들의 묵음 구간을 제거한 예이다.With reference to FIGS. 4 and 5, the silence of the voice signal in the present invention is removed, and two types of utterances are classified and the silence sections are removed.

도 4의 (a)를 참조하면, 실제 음성 녹음이 시작된 후 T1(약 0.85초)이 지난 후 음성의 형태가 나타나고 있으며, 도 5의 (a)는 실제 음성 녹음이 시작된 후 T3(약 2초)이 지난 후에 음성이 발화된 예를 나타내고 있다.Referring to Figure 4 (a), after the actual voice recording T1 (approximately 0.85 seconds) after the appearance of the voice appears, Figure 5 (a) is after the actual voice recording starts T3 (about 2 seconds) After), the voice is spoken.

이러한 실험 결과를 통해서, 여러 파형들을 분석하여, 빠른 음성의 발화 시점이 음성 녹음 시작 후 0.5초 이전에 발화되는 경우가 거의 없게 됨을 알 수 있으며, 이는 시스템의 음성 인식 프롬프트 후, 사용자가 발화하기까지 적어도 0.5초 정도의 묵음 갭(silence gap)이 존재하는 것을 알 수 있다. Through the experimental results, it can be seen that by analyzing several waveforms, the point of rapid speech utterance rarely ignites 0.5 seconds after the start of the voice recording, which is until the user speaks after the system's voice recognition prompt. It can be seen that there is a silence gap of at least about 0.5 seconds.

도 4의 (a) 및 도 5의 (a)와 같이 사용자들의 다양한 발화에 대해 일관적인 방법으로 음성만을 추출해서 메모리 즉, 비휘발성 메모리 영역 상에 저장하여야, 메모리 사이즈를 줄일 수 있게 된다. 즉, 음성 발화 시점(P1,P2)의 앞에 나타나는 묵음 구간을 제거하여 저장하면 메모리 사이즈를 줄일 수 있으며, 또 묵음 부분이 사라지기 때문에 재생할 때 시스템 응답 속도를 빠르게 할 수 있다.As shown in FIGS. 4A and 5A, only voices are extracted and stored in a memory, that is, a nonvolatile memory area, in a manner consistent with various utterances of users, thereby reducing the memory size. That is, by removing and storing the silent sections appearing before the voice uttering points P1 and P2, the memory size can be reduced and the silent part disappears, so that the system response speed can be increased during playback.

이를 위해, 음성을 추출하는 방법은 음성 끝점을 이용하여 러프하게 음성 및 묵음 구간을 검출하게 된다. 이는 음성 끝점을 잘못 검출할 경우 음성이 잘려나가게 되므로, 러프하게 검출하게 된다.To this end, the method of extracting speech roughly detects speech and silence sections using speech endpoints. If the voice end point is incorrectly detected, the voice is cut off, and thus the detection is rough.

그리고, 음성이라고 판단되는 부분을 추출한 후 그 시점을 기준으로 음성을 저장하지 않고, 음성 시작 시점(P1)과 기준 시간(R1)을 비교하여, 기준 시간(R1) 보다 음성 시작 시점(P1)이 작을 경우 기준 시간(예 : 1초)의 절반(1/2)에 해당하 는 분량(0.5초) 만큼 음성 신호의 앞부분만큼 쉬프팅하게 된다. 즉, 도 4의 (a)의 음성 신호 중 앞부분이 기준시간의 절반(예 0.5초) 정도의 묵음 구간이 제거되어 도 4의 (b)와 같이 음성 시작 시점(T1'= T1-R1/2)나타나며, 음성 시작 시점(P1')이 앞으로 쉬프트된 음성 신호 파형을 메모리에 저장하게 된다. After extracting the part judged to be the voice, the voice start time point P1 is compared with the reference time point R1 by comparing the voice start time point P1 with the reference time time R1 without storing the voice on the basis of the time point. If it is small, it shifts by the front part of the voice signal by the amount (0.5 seconds) corresponding to half (1/2) of the reference time (eg 1 second). That is, the silence section of the voice signal of FIG. 4A that is about half of the reference time (for example, 0.5 seconds) is removed, so that the voice start time point T1 '= T1-R1 / 2 as shown in FIG. ), The voice start time point P1 'is stored in the memory.

만약, 음성 시작 시점(P2)이 도 5의 (a)와 같이 기준 시간(R1) 미만일 경우 기준 시간만큼의 묵음을 제거할 수 있다. If the voice start time point P2 is less than the reference time R1 as shown in FIG. 5A, silence as much as the reference time may be removed.

즉, 음성 시작 시점(P2)과 기준 시간(R1)을 비교한 후, 음성 시작 시점(P2)이 기준 시간 보다 클 경우 기준 시간(R1)에 해당하는 앞부분의 묵음 구간을 제거하게 된다. 도 5의 (a)가 2초 정도 이후에 음성 시작 시점(P2)이 발생되므로, 기준 시간(R1)에 해당하는 약 1초 정도의 묵음 구간을 제거하여, 도 5의 (b)와 같이 T3'=T3-R1로 묵음 구간이 제거된 음성 신호 파형을 메모리에 저장해 주게 된다.That is, after comparing the voice start time point P2 and the reference time R1, when the voice start time point P2 is greater than the reference time, the silent section corresponding to the reference time R1 is removed. Since the voice start time point P2 occurs after about 2 seconds in FIG. 5A, the silent section corresponding to the reference time R1 is removed and T3 as shown in FIG. 5B. '= T3-R1 saves the audio signal waveform from which the silent section is removed to memory.

또한, 모든 음성 신호에 대해 적어도 한 번의 묵음 제거 동작만을 수행할 수도 있으며, 만약 2번 할 경우 음성 시작 시점이 2초 근처에 있을 경우 1초 정도 묵음 구간을 제거한 후, 다시 한 번 묵음 구간(약 0.5초)을 제거하는 동작을 수행할 수도 있다. In addition, at least one mute removal operation may be performed on all speech signals. If the speech is started two times, if the speech start time is about 2 seconds, the silence section is removed for about 1 second, and then again the silence section (about 0.5 second) may be performed.

만약, 녹음된 음성이 0.5 초 이내에 음성 시작 시점이 있을 경우, 사용자에게 에러 메시지를 보내 재 녹음을 요청하게 된다.If the recorded voice has a voice start time within 0.5 seconds, an error message is sent to the user to request re-recording.

다시 말하면, 도 4의 (a)에서는 음성 시작 시점이 0.8초이지만, 도 4의 (b)에서는 0.5초의 묵음이 제거되어 음성 시작 시점이 0.3초 정도 되어 나타난다. 그리고 도 5의 (a)와 같이 음성 시작 시점이 2초 이후에 나타나지만, 도 5의 (b)와 같이 1초 정도의 묵음이 제거되어 음성 시작 시점이 1초 정도에 나타나게 된다. 이에 따라 원 음성 신호에서 묵음 구간이 제거됨으로써, 플레이백(playback)시 음성 응답 특성이 빠르게 나타난다. In other words, the voice start time is 0.8 seconds in FIG. 4A, but the silence start time of 0.5 seconds is removed in FIG. 4B so that the voice start time is about 0.3 seconds. In addition, as shown in FIG. 5A, the voice start time appears after 2 seconds, but silence of about 1 second is removed as shown in FIG. 5B, and the voice start time appears in about 1 second. Accordingly, the silent section is removed from the original audio signal, so that the voice response characteristics appear quickly during playback.

이제까지 본 발명에 대하여 그 바람직한 실시 예를 중심으로 살펴보았으며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 본질적 기술 범위 내에서 상기 본 발명의 상세한 설명과 다른 형태의 실시 예들을 구현할 수 있을 것이다. 여기서 본 발명의 본질적 기술범위는 특허청구범위에 나타나 있으며, 그와 동등한 범위 내에 있는 모든 차이점은 본 발명에 포함된 것으로 해석되어야 할 것이다.So far, the present invention has been described with reference to the preferred embodiments, and those skilled in the art to which the present invention pertains to the detailed description of the present invention and other forms of embodiments within the essential technical scope of the present invention. Could be implemented. Here, the essential technical scope of the present invention is shown in the claims, and all differences within the equivalent range will be construed as being included in the present invention.

상술한 바와 같이 본 발명에 따른 화자 종속 음성 인식 시스템에서의 음성 저장 방법에 의하면, 사용자가 플레이백(palyback)으로 사용할 네임태그를 저장할 시, 묵음 구간을 제거하여 저장함으로써, 메모리의 사이즈를 줄일 수 있을 뿐만 아니라 시스템 응답 속도를 빠르게 가져갈 수 있는 효과가 있다.
As described above, according to the voice storage method of the speaker-dependent speech recognition system according to the present invention, when the user stores a name tag to be used as a playback, the memory size can be reduced by removing the silent section. Not only that, but it also has the effect of speeding up system response.

Claims

In the name tag storage method of the speaker-dependent speech recognition system,

Detecting an end point of the input voice signal and separating the silence and the voice when the speaker voice is input;

Detecting a voice section of the voice signal and comparing a voice start time point with a preset reference time;

As a result of the comparison, shifting the voice start time forward by a predetermined multiple of the reference time according to a difference between the voice start time and the reference time, and storing the voice waveform from which the silence is removed for the shifted period. Speech storage method in a speaker-dependent speech recognition system comprising a.

The method of claim 1,

As a result of the comparison, when the voice start time is greater than a preset reference time, the voice start time is shifted forward by a multiple of the reference time, and the voice waveform from which the silence is removed for the shifted period is stored. Speech storage method in speaker dependent speech recognition system.

The method of claim 1,

As a result of the comparison, when the voice start time is smaller than the preset reference time, the voice start time is shifted forward by 1/2 times the reference time, and the voice waveform from which the silence is removed for the shifted period is stored. A speech storage method in a speaker dependent speech recognition system characterized by the above-mentioned.

The method of claim 1,

The silence storage method of the speaker-dependent speech recognition system, characterized in that the silence is removed as much as a predetermined multiple of the reference time from the start point of the prompt of the name tag.