KR20060027570A

KR20060027570A - Voice dialing apparatus and method using features of the voice

Info

Publication number: KR20060027570A
Application number: KR1020040076427A
Authority: KR
Inventors: 신동화; 김동관
Original assignee: 삼성전자주식회사
Priority date: 2004-09-23
Filing date: 2004-09-23
Publication date: 2006-03-28
Also published as: KR100647291B1

Abstract

본 발명은 음성의 특징을 이용한 음성 다이얼링 장치 및 방법을 개시한다. 이동 통신 단말기의 다이얼링시 버튼을 누르지 않고, 음성으로 다이얼링하는 이 장치는, 보코더에서 츨력된 프레임별 패킷 데이터로부터 음성특징 파라미터를 추출하는 입력 인터페이스와, 음성 특징 파라미터를 이용하여 음성 구간을 검출하는 음성 구간 검출부와, 검출된 음성 구간에서 음성 신호의 피크의 개수 및 피크의 레벨중 적어도 하나를 음성 인식용 음성 신호의 특징으로서 추출하는 특징 추출부와, 음성 인식용 음성신호의 특징을 각 전화번호에 해당하도록 데이터 베이스에 등록시키는 음성 등록부와, 음성 인식용 음성신호의 특징을 이용하여 데이터 베이스에 등록된 음성 신호와 음성 다이얼링을 위해 발성된 음성을 비교하여 가장 가까운 음성 신호를 골라내는 음성 인식부 및 음성 인식 결과에 해당되는 음성 신호와 전화 번호를 출력하는 결과 출력부를 구비하는 것을 특징으로 한다.The present invention discloses a voice dialing device and method using the features of voice. The apparatus for dialing by voice without pressing a button when dialing a mobile communication terminal includes an input interface for extracting voice feature parameters from frame-by-frame packet data output from a vocoder, and a voice for detecting voice segments using voice feature parameters. A section detection section, a feature extraction section for extracting at least one of the number of peaks and peak levels of the speech signal in the detected speech section as features of the speech recognition speech signal, and a feature of the speech recognition speech signal to each telephone number. A voice register that registers in the database so as to correspond to the voice register; and a voice recognizer that selects the closest voice signal by comparing the voice signal registered in the database with the voice signal registered in the database using the features of the voice signal for voice recognition; Output voice signal and phone number corresponding to voice recognition result It is characterized by comprising a result output.

Description

Voice dialing apparatus and method using features of the voice}

도 1은 본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 장치의 블럭도이다.1 is a block diagram of a voice dialing apparatus using the features of voice according to the present invention.

도 2는 도 1에 도시된 음성 구간 검출부(120)의 구성도이다.FIG. 2 is a block diagram of the voice section detector 120 shown in FIG. 1.

도 3은 도 1에 도시된 음성 등록부(140)의 예시적인 블럭도이다.3 is an exemplary block diagram of the voice register 140 shown in FIG.

도 4는 도 1에 도시된 음성 인식부(150)의 본 발명에 의한 일 실시예의 블럭도이다.4 is a block diagram of an embodiment of the present invention of the speech recognition unit 150 shown in FIG.

본 발명은 이동 단말기에서 음성을 이용한 다이얼링 기술에 관한 것으로, 특히 음성 신호가 아닌 각 보코더의 특징 파라미터를 이용하여 기존 단말기에 하드웨어의 추가없이 소프트웨어만으로 실시간 구현하는 음성의 특징을 이용한 음성 다이얼링 장치 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a dialing technique using voice in a mobile terminal, and more particularly, to an apparatus and method for voice dialing using a feature of real-time implementation using only software without adding hardware to an existing terminal using feature parameters of each vocoder rather than a voice signal It is about.

음성 다이얼링 기술은 음성인식 기술을 전화기 제어에 응용한 것이다. 이러한 음성 다이얼링 기술은 입력으로 사용하는 것이 8KHz 샘플링, 16비트 PCM 데이터 나 그에 준하는 음성을 직접 사용하는 것이었다. 하지만, 이동 전화기 환경하에서 하드웨어의 추가없이 소프트웨어 만으로 음성 다이얼링 기술을 구현하려면 이동 전화기의 보코더 결과를 이용하는 방법을 찾아야만 한다. 왜냐하면, 음성을 직접 받기 위해서는 A/D 장치는 물론 상당량의 메모리도 추가로 필요하기 때문이다. Voice dialing technology is an application of voice recognition technology to phone control. This voice dialing technology used 8KHz sampling, 16-bit PCM data or equivalent voice directly as input. However, to implement voice dialing techniques in software alone without the addition of hardware under a mobile phone environment, one must find a way to use the mobile phone's vocoder results. This is because, in order to receive voice directly, a large amount of memory is required as well as an A / D device.

그리고, 보코더 결과를 이용하여 음성을 재생하는 방법은 소프트웨어 만으로 구현하기에는 계산량이 부족한 것이 일반적이다. 한편, 핸즈프리 환경하에서 음성 다이얼링을 하고자 할 때는 배경 잡음의 처리라는 또 하나의 문제가 발생한다. 즉, 고속도로를 달리는 자동차에서 전화를 걸려면 핸즈프리 장치를 이용하여야 하는데 이 때 마이크는 자동차 선바이저에 장착하는 것이 일반적이다. 이러한 환경에서 음성을 발성하면 음성과 함께 고속도로에서 발생하는 여러 잡음이 동시에 들어 오게 된다. 100km 이상 달리는 자동차 안에서 고속도로 잡음을 측정해보면 신호대잡음비(SNR)가 -6dB에서 6dB 정도나 된다. 이 수치는 음성의 출력 크기나 잡음의 출력 크기가 비슷하다는 것을 의미한다. 따라서, 음성이 아닌 보코더 출력인 패킷 데이터를 이용해야 한다는 문제와 함께 다양하고 값이 큰 잡음 환경하에서 음성 인식을 수행해야 한다는 문제를 안고 있는 것이다.In addition, the method of reproducing speech using the vocoder result is generally insufficient in the amount of computation to be implemented by software alone. On the other hand, when voice dialing in a hands-free environment, another problem arises, which is background noise processing. In other words, to make a call from a car running on the highway, a hands-free device should be used. In this case, the microphone is usually mounted on a car sun visor. In this environment, voice is generated, and voice and noise generated on the highway are simultaneously introduced. When measuring highway noise in a car that is running more than 100 km, the signal-to-noise ratio (SNR) ranges from -6 dB to 6 dB. This value means that the output size of voice or noise is similar. Therefore, there is a problem that voice recognition should be performed in a variety of high-value noise environments with the problem of using packet data, which is a vocoder output rather than voice.

잡음 환경하에서 음성 인식을 하기위해서 잡음을 제거하는 방법은 주파수 차감법(Spectral substraction)등 알려진 기술이 많지만 이러한 방법은 PCM 데이터를 이용할 수 있다는 가정에서 시작하는 것이다. There are many known techniques to remove noise for speech recognition in noisy environments, such as Spectral Substraction, but this method assumes that PCM data is available.

한편, 보코더에서 만들어내는 특징 파라미터중 게인에 관계되는 파라미터를 이용해서 음성 인식을 수행하는 것은 잡음이 적은 환경에서는 최고 90% 정도의 인 식률을 얻을 수 있지만 잡음이 많은 환경에서는 40% 이하의 인식률 밖에 보장할 수 없다.On the other hand, speech recognition using gain-related parameters of the vocoder produces up to 90% recognition rate in low noise environment, but only 40% or less recognition rate in noisy environment. Cannot be guaranteed.

본 발명이 이루고자하는 기술적 과제는 하드웨어의 추가 없이 소프트웨어만으로 기존 이동 전화기에 음성의 피크 개수 및/또는 음성의 피크 레벨 같은 음성의 특징을 이용하여 음성 다이얼링 기능을 적용시킬 수 있는 음성의 특징을 이용한 음성 다이얼링 장치 및 방법을 제공함에 있다. The technical problem to be achieved by the present invention is the voice using the voice feature that can apply the voice dialing function to the existing mobile phone using the voice features such as the number of peaks and / or the peak level of the voice to the existing mobile phone only without the addition of hardware The present invention provides a dialing device and method.

상기 기술적 과제를 해결하기 위한, 이동 통신 단말기의 다이얼링시 버튼을 누르지 않고, 음성으로 다이얼링하는 본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 장치는, 보코더에서 츨력된 프레임별 패킷 데이터로부터 음성특징 파라미터를 추출하는 입력 인터페이스와, 상기 입력 인터페이스로부터 출력된 상기 음성 특징 파라미터를 이용하여 음성 구간을 검출하는 음성 구간 검출부와, 상기 검출된 음성 구간에서 음성 신호의 피크의 개수 및 상기 피크의 레벨중 적어도 하나를 음성 인식용 음성 신호의 특징으로서 추출하는 특징 추출부와, 상기 특징 추출부로부터 추출된 상기 음성 인식용 음성신호의 특징을 각 전화번호에 해당하도록 데이터 베이스에 등록시키는 음성 등록부와, 상기 특징 추출부로부터 추출된 상기 음성 인식용 음성신호의 특징을 이용하여 상기 데이터 베이스에 등록된 음성 신호와 음성 다이얼링을 위해 발성된 음성을 비교하여 가장 가까운 음성 신호를 골라내는 음성 인식부 및 상기 음성 인식 결과에 해당되는 음성 신호와 전화 번호를 출력하는 결 과 출력부로 구성되는 것이 바람직하다.In order to solve the above technical problem, the voice dialing device using the voice feature according to the present invention for dialing by voice without pressing a button during dialing of a mobile communication terminal includes a voice feature parameter from the packet data for each frame output from a vocoder. A speech section detector for detecting a speech section by using an input interface to be extracted, the speech feature parameter output from the input interface, and at least one of the number of peaks of the speech signal and the level of the peaks in the detected speech section A feature extraction unit for extracting a feature of the voice signal for speech recognition as a feature; a voice register for registering a feature of the voice recognition voice signal extracted from the feature extractor into a database so as to correspond to each telephone number; The speech signal for speech recognition extracted from the A voice recognition unit that selects the closest voice signal by comparing the voice signal registered in the database with the voice spoken for voice dialing by using the feature, and outputs a voice signal and a telephone number corresponding to the voice recognition result It is preferable that it consists of an output part.

상기 다른 기술적 과제를 해결하기 위해,이동 통신 단말기의 다이얼링시 음성으로 다이얼링하는 본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 방법은, 보코더로부터 출력된 패킷 스트림 신호를 언패킹 파라미터 스트림신호로 변환하는 단계와, 상기 언패킹 파라미터 스트림신호의 구간을 검출하는 단계와, 상기 검출된 신호에 존재하는 음성 신호의 피크의 개수 및 상기 피크의 레벨중 적어도 하나를 음성신호의 특징으로서 검출하는 단계와, 상기 음성신호의 특징을 각 전화번호에 해당하도록 메모리에 저장시키는 음성 등록 단계와, 상기 등록된 음성 신호에서 발성된 음성과 가장 가까운 음성 신호를 골라내는 음성 인식 단계 및 상기 음성인식결과에 해당되는 음성 신호와 전화 번호를 출력하는 단계로 이루어지는 것이 바람직하다.In order to solve the above other technical problem, the voice dialing method using the feature of the voice dialing to the voice when dialing the mobile communication terminal, converting the packet stream signal output from the vocoder into an unpacking parameter stream signal Detecting a section of the unpacking parameter stream signal, detecting at least one of the number of peaks of the audio signal present in the detected signal and the level of the peak as a feature of the audio signal; A voice registration step of storing a feature of the signal in a memory so as to correspond to each telephone number, a voice recognition step of selecting a voice signal closest to the voice spoken from the registered voice signal, a voice signal corresponding to the voice recognition result, and Preferably, the step comprises outputting a telephone number.

이하, 본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 장치를 첨부된 도면들을 참조하여 다음과 같이 설명한다.Hereinafter, a voice dialing apparatus using a voice feature according to the present invention will be described with reference to the accompanying drawings.

도 1은 본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 장치의 블럭도로서, 입력 인터페이스(110), 음성 구간 검출부(120), 특징 추출부(130), 음성 등록부(140), 음성 인식부(150) 및 결과 출력부(160)로 이루어진다.1 is a block diagram of a voice dialing apparatus using features of a voice according to the present invention, which includes an input interface 110, a voice section detector 120, a feature extractor 130, a voice register 140, and a voice recognizer ( 150 and the result output unit 160.

입력 인터페이스(110)는 보코더의 패킷 데이터를 언패킹(unpacking)하고 그 데이타를 해석하여 음성 구간 검출부(120)에 입력시켜주는 일을 한다. 이는 프레임 별 패킷 데이터에서 1~2msec 이내에 특징 파라미터를 추출한다.The input interface 110 unpacks the packet data of the vocoder, interprets the data, and inputs the data to the voice interval detector 120. It extracts feature parameters within 1 ~ 2msec from packet data per frame.

음성 구간 검출부(120)은 입력 인터페이스(110)에서 만들어진 파라미터의 이 득 특성을 이용하여 음성 구간을 검출한다.The voice section detector 120 detects the voice section by using a gain characteristic of a parameter generated by the input interface 110.

특징추출부(130)는 음성 구간 검출부(120)에서 검출된 음성 구간에 속하는 음성 신호의 피크의 개수 및 피크의 레벨중 적어도 하나를 음성 인식용 음성신호의 특징으로서 추출한다. 여기서, 음성 구간에 속하는 음성 신호의 피크의 개수는 전화 번호에 따라 각기 다르다. 또한, 음성 구간에 속하는 음성 신호의 피크의 레벨 역시 전화 번호에 따라 각기 다르다. 이 때, 특징 추출부(130)는 음성 구간에 속하는 음성 신호의 피크의 개수만을 음성 신호의 특징으로서 추출할 수도 있고, 음성 신호의 피크의 레벨을 음성 신호의 특징으로서 추출할 수도 있다. 또는, 특징 추출부(130)는 음성 구간에 속하는 음성 신호의 피크의 개수 및 음성 신호의 피크의 레벨을 음성 신호의 특징으로서 모두 추출할 수도 있다.The feature extractor 130 extracts at least one of the number of peaks and the level of peaks of the speech signal belonging to the speech section detected by the speech section detector 120 as the feature of the speech recognition voice signal. Here, the number of peaks of the speech signal belonging to the speech section varies depending on the telephone number. In addition, the level of the peak of the voice signal belonging to the voice section also varies depending on the telephone number. In this case, the feature extractor 130 may extract only the number of peaks of the voice signal belonging to the voice section as the feature of the voice signal, or may extract the level of the peak of the voice signal as the feature of the voice signal. Alternatively, the feature extractor 130 may extract both the number of peaks of the speech signal and the level of the peak of the speech signal as the features of the speech signal.

한편, 음성 등록부(140)은 각 전화번호에 해당하는 음성 신호의 특징을 메모리(Flash ROM)에 저장시킨다.On the other hand, the voice register 140 stores the feature of the voice signal corresponding to each telephone number in a memory (Flash ROM).

음성 인식부(150)은 현재의 발성이 저장되어 있는 등록 데이터베이스(170)와 어느 정도 유사한지 계산하는 일을 수행한다. 한 번 혹은 필요하면 두 번의 발성을 가지고 인식을 수행한다.The speech recognition unit 150 calculates how similar to the registration database 170 in which the current speech is stored. Recognition is performed once or if necessary with two voices.

결과 출력부(160)은 전화를 걸어주기에 앞서 인식 결과에 해당하는 음성을 사용자에게 들려 주고 확인 시켜주는 일을 한다.The result output unit 160 performs a job of confirming to the user the voice corresponding to the recognition result before making a call.

상술한 구성에 의거하여 본 발명의 동작에 대하여 설명하기로 한다.The operation of the present invention will be described based on the above configuration.

본 발명은 이동 전화기용 음성 다이얼링에 관한 것으로서 음성을 입력받는 것이 아니라 임의의 코덱 패킷 값을 가지고 음성 인식용 음성 신호의 특징을 추출 하고 음성부분을 검출하여 인간과 기계간의 인터페이스를 가능하게 한다.The present invention relates to voice dialing for a mobile phone, and does not receive a voice but extracts a feature of a voice signal for speech recognition with an arbitrary codec packet value and detects a voice to enable an interface between a human and a machine.

본 발명은 QCELP, EVRC 및 RPE-LTP 같은 보코더에서 생성된 음성 특징 파라미터만을 이용한 것이다. 입력 인터페이스(110)는 보코더의 패킷 데이터를 언패킹(unpacking)하고 그 데이타를 해석하여 음성 구간 검출부(120)에 입력하고, 프레임 별 패킷 데이터에서 1~2msec 이내에 특징 파라미터를 추출한다.The present invention uses only speech feature parameters generated in vocoder such as QCELP, EVRC and RPE-LTP. The input interface 110 unpacks the packet data of the vocoder, interprets the data, inputs the data to the voice interval detector 120, and extracts feature parameters within 1 to 2 msec from the packet data for each frame.

도 2는 도 1에 도시된 음성 구간 검출부(120)의 구성도로서, 입력 인터페이스부(210), 유사 신호 생성부(220), 유사 이득 생성부(230), 이득 평균 생성부(240), 프레임 상태 판정부(250) 및 후 처리부(250)로 구성된다.FIG. 2 is a block diagram of the voice section detector 120 of FIG. 1. The input interface unit 210, the similar signal generator 220, the similar gain generator 230, the gain average generator 240, And a frame state determination unit 250 and a post processing unit 250.

유사 신호 생성부(220)는 특징 파라미터를 이용하여 유사 신호를 생성하는 역할을 한다. 이를 위해, 유사 신호 생성부(220)는 백색 잡음 펄스에 보코더 이득 값을 이용하여 이득 필터링을 한 후, 피치 정보를 이용하여 피치 필터링을 수행한다. 이렇게 해서 생성된 신호는 사람의 귀로 구별할 수 있을 정도의 음성 신호는 아니지만 음성을 구별하는 기본 정보인 음의 크기와 주파수 정보를 포함하고 있다. 다음 수학식 1은 유사 신호를 생성하는 식이다.The similar signal generator 220 generates a similar signal using the feature parameter. To this end, the pseudo signal generator 220 performs gain filtering using a vocoder gain value on a white noise pulse, and then performs pitch filtering using pitch information. The signal generated in this way is not a speech signal that can be distinguished by a human ear, but contains loudness and frequency information, which is basic information for distinguishing speech. Equation 1 below generates an analogous signal.

x(i)=w(i)*G+x(i-l)*Bx (i) = w (i) * G + x (i-l) * B

여기서, i는 샘플 수에 해당하는 시간 변수이고, G값은 프레임별 이득값, L은 프레임별 피치, B는 프레임별 피치 이득, x(i)는 유사 신호, w(i)는 백색 신호를 각각 나타낸다.Where i is a time variable corresponding to the number of samples, G is a gain per frame, L is a pitch per frame, B is a pitch gain per frame, x (i) is a similar signal, and w (i) is a white signal. Represent each.

이런 방법을 이용하여 이동 전화기에서 음성 검출기의 실시간 구현을 위해 각 보코더의 신호 재생 장치를 대치할 수 있는 것이다.Using this method, the signal reproducing device of each vocoder can be replaced for real-time implementation of the voice detector in the mobile phone.

유사 이득 생성부(230)는 생성된 유사 신호에 절대값을 취한 후 프레임의 크기 만큼 합한 수이다. 각 신호를 제곱하여 더한 후 제곱근을 구해야 정확한 이득값이 생성될 수 있지만, 실시간으로 구현하기 위해서는 곱하기나 제곱근은 사용할 수 없다. 다음 수학식 2는 유사 신호의 이득을 생성하는 식이다.The similar gain generator 230 is a number obtained by taking an absolute value of the generated similar signal and adding the frame size. Each signal must be squared, summed, and square rooted to produce an accurate gain, but multiplication or square root cannot be used for real-time implementation. Equation 2 is an equation for generating the gain of the pseudo signal.

s(j) = abs_sum(x(j*I) ~ x(j*(I+1) - 1))s (j) = abs_sum (x (j * I) to x (j * (I + 1)-1))

여기서, j는 프레임 수로서 j = i / I 에 해당한다. I는 프레임의 크기로서 160(20msec)이 일반적이다. abs_sum(x(k) ~ x(m))는 k번째 유사신호부터 m번째 유사신호까지의 절대값 합을 의미한다.Here, j corresponds to the number of frames j = i / I. I is a frame size of 160 (20 msec) in general. abs_sum (x (k) to x (m)) means the sum of absolute values from the kth similar signal to the mth similar signal.

이득 평균 생성부(240)는 다음 수학식 3과 같이 연속된 4개의 프레임의 평균값을 구하는 일을 수행한다. 본 발명에서 연속된 4개의 프레임을 사용하는 이유는 음성의 특징이 가장 잘 유지되는 프레임의 길이이기 때문이다.The gain average generator 240 calculates an average value of four consecutive frames as shown in Equation 3 below. The reason for using four consecutive frames in the present invention is that the feature of speech is the length of the frame that is best maintained.

nG(j) = (s(j-3) + s(j-2) + s(j-1) + s(j)) / 4nG (j) = (s (j-3) + s (j-2) + s (j-1) + s (j)) / 4

각 프레임별 유사 신호의 이득은 시간에 따른 변화 폭이 커 수학식 2의 결과를 이용하여 음성을 검출할 때 에러가 발생할 확률이 높아진다. 수학식 3은 수학식 2보다 이득의 변화폭이 안정되어서 음성 부분과 잡음 부분을 구분짓기가 용이하다.The gain of the similar signal for each frame has a large variation in time, and thus the probability of error occurring when detecting a voice using the result of Equation 2 increases. Equation 3 has a more stable variation in gain than Equation 2, so that it is easy to distinguish between the voice part and the noise part.

프레임 상태 판정부(250)는 유사신호의 이득 평균값과 과거 상태 판정 결과를 이용하여 현재 프레임이 음성인지, 음성과 음성 사이의 묵음 구간인지, 배경 잡 음 구간인지 판정해주는 일을 한다. 1차 음성 구간 결정은 음성과 음성 사이의 묵음 구간을 포함한 시작 위치와 끝 위치 정보를 알려준다.The frame state determination unit 250 determines whether the current frame is a voice, a silent section between the voice and the voice, or a background noise section using the gain average value of the similar signal and the past state determination result. The first speech segment determination informs the start position and the end position information including the silence section between the speech and the speech.

후 처리부(260)는 1차 음성 구간 결과를 이용하여 시작 위치와 끝 위치를 좀더 상세하게 보정해 준다. 이것은 시작 위치 결과의 앞뒤 10프레임, 끝 위치 결과의 앞뒤 10프레임 내에서 상태 판정단을 다시 수행시켜 음성 검출 결과를 보정시키는 일을 수행한다.The post processor 260 corrects the start position and the end position in more detail using the first speech section result. This performs the state determination step again within 10 frames before and after the start position result and 10 frames before and after the end position result to correct the voice detection result.

따라서, 음성 구간 검출부(120)로부터 검출된 음성구간은 특징추출부(130)에 의해서 음성 인식용 음성 신호의 특징을 추출할 때 이용된다. 그리고, 특징 추출부(130)의 음성 인식용 음성신호의 특징은 등록 모드일때는 음성 등록부(140)로 입력되고, 인식 모드일때는 음성 인식부(150)로 입력된다.Therefore, the voice section detected by the voice section detector 120 is used by the feature extractor 130 to extract the feature of the voice signal for voice recognition. Then, the feature of the voice recognition voice signal of the feature extraction unit 130 is input to the voice registration unit 140 in the registration mode, the voice recognition unit 150 in the recognition mode.

음성 등록부(140)는 음성 신호의 특징을 각 전화 번호에 해당하도록 메모리(Flash ROM)에 저장시킨다. 또한, 보코더의 패킷을 해석한 결과를 이용해서 음성 특징을 추출한다.The voice register 140 stores the feature of the voice signal in a memory (Flash ROM) so as to correspond to each telephone number. Also, the voice feature is extracted using the result of analyzing the vocoder's packet.

도 3은 도 1에 도시된 음성 등록부(140)의 예시적인 블럭도로서, 등록 데이타 베이스 비교부(310), 발성 비교부(320) 및 데이타 베이스 저장부(330)로 구성된다.FIG. 3 is an exemplary block diagram of the voice register 140 shown in FIG. 1 and includes a registration database comparator 310, a speech comparator 320, and a database storage 330.

도 3에 도시된 등록 데이타 베이스 비교부(310)는 입력 발성음과 데이터 베이스(170)에 등록된 음성의 유사성을 비교한다. 발성 비교부(320)는 발성음이 데이터 베이스에 등록된 음성과 유사하지 않으면, 발성음과 재발성된 발성음의 발성특징을 비교한다. 데이터 베이스 저장부(330)는 발성 비교부(320)에서 비교된 발성특 징이 같으면 발성음의 발성 특징을 저장한다. 또한 발성 비교부(320)는 발성음과 재발성된 발성음의 발성 특징이 다르면 입력 음성을 추가하여 발성 특징을 재 비교하는 추가 발성비교부(미도시)를 더 구비할 수도 있다.The registration database comparison unit 310 shown in FIG. 3 compares the similarity between the input voice and the voice registered in the database 170. The speech comparison unit 320 compares the speech characteristics of the speech sound with the resounding sound when the speech sound is not similar to the voice registered in the database. The database storage unit 330 stores the speech characteristics of the speech sound when the speech characteristics compared in the speech comparison unit 320 are the same. In addition, the speech comparison unit 320 may further include an additional speech comparison unit (not shown) for adding the input voice to re-compare the speech characteristics when the speech characteristics of the speech sound and the re-uttered speech sound are different.

음성 등록을 위해서는 2번에서 3번의 음성 발성이 필요하다. 등록 과정을 살펴보면, 발성음 1을 입력하면 기존에 저장된 음성 데이터 베이스들과 유사성을 검토하여 유사하다고 판단되면 발성음 1을 음성등록부(140)에 저장하고, 처음으로 되돌아간다.Voice registration requires two to three voice utterances. Referring to the registration process, when the voice sound 1 is inputted, the voice sounder 1 is stored in the voice register 140 when it is determined to be similar to the previously stored voice databases.

발성음 2가 입력되면 발성음 1과 발성음 2의 유사성을 검토하여 유사하다고 판단되면 발성음 1과 발성음 2의 특징을 음성 등록부(140)에 저장한다. 만약, 발성음 1과 발성음 2가 유사하지 않으면 발성음 3을 입력하여 발성음 1과 발성음 2를 발성음 3과 유사성을 검토하여 유사하면 발성음 1 혹은 발성음 2와 발성음 3 의 특징을 음성등록부(140)에 저장한다.When the speech sound 2 is input, the similarity between the speech sound 1 and the speech sound 2 is examined, and if it is determined that the speech sound 2 is similar, the characteristics of the speech sound 1 and the speech sound 2 are stored in the voice register 140. If the sound 1 and the sound 2 are not similar, input the sound 3 to examine the similarity between the sound 1 and the sound 2 to the sound 3, and if the sound 1 or the sound 2 and the sound 3 are similar. The voice register 140 is stored.

여기서, 추가 발성 비교부(미도시)는 발성음 3과 발성음 1 혹은 발성음 2와 비교하는 단계를 뜻한다. 등록 데이터 베이스는 플래쉬 롬에 저장하여 항시 쓰고 읽을 수 있게 한다.Here, the additional voice comparison unit (not shown) refers to a step of comparing the voice 3 and the voice 1 or the voice 2. The registration database is stored in flash ROM so that it can be written and read at all times.

음성 인식부(150)은 현재의 발성이 저장되어 있는 등록 데이터 베이스와 어느 정도 유사한지 계산하는 일을 수행한다. 한 번 혹은 필요하면 두 번의 발성을 가지고 인식을 수행한다.The speech recognition unit 150 calculates how similar to the registered database in which the current speech is stored. Recognition is performed once or if necessary with two voices.

도 4는 도 1에 도시된 음성 인식부(150)의 본 발명에 의한 일 실시예의 블럭도로서, 등록 데이타 베이스 비교부(410) 및 인식 확인부(420)로 구성된다.4 is a block diagram of an embodiment of the voice recognition unit 150 shown in FIG. 1 according to the present invention, and includes a registration database comparison unit 410 and a recognition confirmation unit 420.

도 4에 도시된 등록 데이터 베이스 비교부(410)는 입력 발성음과 데이터 베이스에 등록된 음성의 유사성을 비교한다. 또한, 등록 데이터 베이스 비교부(410)는 입력 발성음이 데이터 베이스에 등록된 음성과 유사하지 않으면 입력 발성음을 추가하여 데이터 베이스에 등록된 음성과 재 비교하는 추가 발성 데이터베이스 비교부(미도시)를 더 부가할 수 있다. 인식 확인부(420)는 등록 데이터 베이스 비교부(410)에서 결정한 인식 후보를 가지고 전화를 걸어 줄지 결정해주는 일을 한다. 비록 인식 결과가 1위로 나온 후보라도 결과의 신빙성이 떨어지면 전화를 걸어 주지 않는다. 이러한 일을 하기 위해서는 현재의 잡음 정도와 1위 후보와 2위 후보 사이의 인식 결과값 차이, 후보들간의 유사값을 이용한다.The registration database comparison unit 410 shown in FIG. 4 compares the similarity between the input voice and the voice registered in the database. In addition, the registration database comparator 410 adds an input voice to re-compare the voice registered in the database if the input voice is not similar to the voice registered in the database (not shown). Can be added further. The recognition check unit 420 determines whether to make a call with the recognition candidate determined by the registration database comparison unit 410. Even if the recognition result is the top candidate, don't call if the result is unreliable. To do this, we use the current noise level, the difference in recognition results between the first and second candidates, and the similarity values between the candidates.

그리고, 음성 등록부(140)나 음성 인식부(150)는 보코더 패킷을 해석한 결과를 이용해서 특징을 추출한다.The voice registration unit 140 or the voice recognition unit 150 extracts the feature using the result of analyzing the vocoder packet.

예를 들어, 시디엠에이(CDMA) 단말기에서는 양자화된 엘에스피(LSP:Line Spe ctrum Pair) 계수를 이용하여 유사 켑스트럼이라는 인식용 특징을 만들고, 지에스엠(GSM) 단말기에서는 엘에이알(LAR:Log Araea Ratio) 계수를 이용하여 인식용 특징을 만든다.For example, a CDMA terminal uses the quantized LSP (Line Spectrum Pair) coefficient to create a recognition characteristic called pseudo-string, and a GSM terminal uses a LAR. Create a feature for recognition using the Log Araea Ratio coefficient.

결과 출력부(160)는 전화를 걸어주기에 앞서 인식 결과에 해당하는 음성을 사용자에게 들려 주고 확인 시켜주는 일을 한다. 이것을 위해서는 음성 등록부에 인식용 특징과 함께 발성 부분만을 메모리에 저장시켜야 한다. The result output unit 160 performs a job of confirming and hearing the voice corresponding to the recognition result before the call is made. To do this, only the vocal parts are stored in the memory together with the recognition feature in the voice register.

본 발명에 의한 음성의 특징을 이용한 음성 다이얼링 장치 및 방법은, 하드 웨어의 추가 없이 소프트웨어만으로 기존 이동 전화기에 음성 다이얼링 기능을 적용시킬 수 있으므로써, 고속도로 같은 잡음이 다양한 환경에서도 음성 인식을 수행 할 수 있으므로 단말기 뿐만 아니라 핸즈프리킷트에서도 음성 다이얼링을 할 수 있다. 따라서, 음성 특징을 저장하기 위한 메모리의 추가 만으로 단말기의 고급화를 추구할 수 있어 제품의 경쟁력을 높일 수 있다. 게다가, 음성 구간에 속하는 음성의 피크의 개수 및/또는 음성의 피크 레벨 같은 음성의 특징을 이용하여 비교적 쉽게 음성을 인식할 수 있다.Voice dialing apparatus and method using the features of the voice according to the present invention, by applying the voice dialing function to the existing mobile phone by software only without the addition of hardware, it is possible to perform speech recognition in a variety of noise environments such as highways This allows voice dialing not only in the handset but also in the handsfree kit. Therefore, it is possible to pursue the high quality of the terminal only by adding a memory for storing voice features, thereby increasing the competitiveness of the product. In addition, it is possible to recognize the voice relatively easily by using features of the voice such as the number of peaks of the voice belonging to the voice interval and / or the peak level of the voice.

Claims

In the voice dialing apparatus using a feature of voice dialing by voice without pressing a button when dialing a mobile communication terminal,

An input interface for extracting voice feature parameters from the frame-by-frame packet data output from the vocoder;

A speech section detector for detecting a speech section by using the speech feature parameter output from the input interface;

A feature extraction unit for extracting at least one of the number of peaks of the voice signal and the level of the peak in the detected voice section as a feature of the voice signal for voice recognition;

A voice register that registers features of the voice recognition voice signal extracted from the feature extractor into a database so as to correspond to each telephone number;

A voice recognition unit which selects the closest voice signal by comparing the voice signal registered in the database with the voice signal registered for the voice dialing using the feature of the voice recognition voice signal extracted from the feature extractor; And

And a result output unit for outputting a voice signal and a phone number corresponding to the voice recognition result.

A voice dialing method using a feature of voice dialing by voice when dialing a mobile communication terminal,

Converting the packet stream signal output from the vocoder into an unpacking parameter stream signal;

Detecting a section of the unpacking parameter stream signal;

Detecting at least one of the number of peaks of the voice signal present in the detected signal and the level of the peak as a feature of the voice signal;

A voice registration step of storing the feature of the voice signal in a memory so as to correspond to each telephone number;

A voice recognition step of selecting a voice signal closest to the voice spoken from the registered voice signal; And

And outputting a voice signal and a phone number corresponding to the voice recognition result.