KR19980045013A

KR19980045013A - How to Improve Speaker Recognizer by Entering Password

Info

Publication number: KR19980045013A
Application number: KR1019960063170A
Authority: KR
Inventors: 안영목
Original assignee: 양승택; 한국전자통신연구원
Priority date: 1996-12-09
Filing date: 1996-12-09
Publication date: 1998-09-15

Abstract

본 발명은 발성한 음성이 어떤 사람의 목소리인지를 찾아내는 화자 인식 기술에 관한 것으로, 등록된 화자에 대한 문턱 값만을 사용하여 해당 화자로 승인 및 거절하는 종래 기술의 사칭자에 따른 취약성을 암호 검증 단계를 통해서 보강하는 암호 입력을 통한 화자 인식기의 성능 개선 방법 관해 개시된다.The present invention relates to a speaker recognition technology for finding out which person's voice is spoken, and using a threshold value for a registered speaker, verifying a vulnerability according to a prior art impersonator who approves and rejects the speaker using the threshold value. Disclosed is a method for improving the performance of a speaker recognizer by reinforcing a password.

Description

How to Improve Speaker Recognizer by Entering Password

본 발명은 발성한 음성이 어떤 사람의 목소리인지를 찾아내는 화자 인식 기술에 관한 것으로, 등록된 화자에 대한 문턱 값만을 사용하여 해당 화자로 승인 및 거절하는 종래 기술의 사칭자에 따른 취약성을 암호 검증 단계를 통해서 보강하는 암호 입력을 통한 화자 인식기의 성능 개선 방법 관한 것이다. 즉, 본 발명은 화자 인식 시스템의 보안 성능 개선 및 화자 검증 성능 개선에 관한 것이다.The present invention relates to a speaker recognition technology for finding out which person's voice is spoken, and using a threshold value for a registered speaker, verifying a vulnerability according to a prior art impersonator who approves and rejects the speaker using the threshold value. The present invention relates to a method for improving the speaker recognizer's performance through the input of a password reinforcement through a password. That is, the present invention relates to improving security performance and speaker verification performance of a speaker recognition system.

화자 인식(Speaker Recognition) 기술이란 사용자가 발성한 음성을 분석하여 그 사람이 누구인지를 알아내는 기술로써 화자 식별(Speaker Identification) 기술과 화자 검증(Speaker Verification) 기술로 나눌 수 있다. 화자 식별 기술은 화자 인식 시스템에 입력되는 음성을 분석하여 시스템에 등록되어 있는 화자들 중에서 현재 사용자의 목소리와 가장 가까운 화자를 찾아내는 것이다. 화자 검증 기술은 화자 인식 시스템에 입력되는 음성을 분석하여 현재 사용자가 시스템에 등록되어 있는 화자인지를 분간해 주는 기술이다. 화자 인식 시스템에 입력되는 음성은 시스템의 특성에 따라서 문장 종속형(Text Dependent), 문장 독립형(Text Independent)의 두 종류로 나눌 수 있다. 여기에서 문장 종속형리안 시스템의 훈련용 음성과 시스템의 인식용 음성의 내용이 동일한 것을 뜻한다. 즉, 훈련을 위해서 발성한 문장 혹은 단어, 어절 등을 시스템 인식 단계에서도 동일하게 해당 문장 혹은 단어, 어절 등을 발성하는 것을 의미한다. 문장 독립형이란 시스템의 훈련용 음성과 시스템의 인식용 음성이 다른 것을 뜻한다. 즉, 시스템 훈련에 사용된 문장 혹은 단어, 어절 등을 시스템의 인식 단계에서도 동일한 문장 혹은 단어, 어절을 발성하지 않아도 되는 것을 의미한다.Speaker Recognition technology is a technology to find out who the person is by analyzing the voice spoken by the user. The speaker recognition technology can be divided into speaker identification technology and speaker verification technology. Speaker identification technology analyzes the voice input to the speaker recognition system and finds the speaker closest to the current user's voice among the speakers registered in the system. Speaker verification technology analyzes the voice input to the speaker recognition system and distinguishes whether the user is a speaker registered in the system. Speech input to the speaker recognition system can be divided into two types, text dependent and text independent depending on the characteristics of the system. Here, the content of the training voice of the sentence dependent lian system and the recognition voice of the system is the same. In other words, it means that the sentence, word, word, etc. spoken for the training, the same sentence, word, word, etc. in the system recognition stage. Sentence-independent means that the training voice of the system and the recognition voice of the system are different. That is, the sentence, word, word, etc. used in the system training does not have to utter the same sentence, word, word, etc. at the recognition stage of the system.

종래의 은닉 마코프 모델(hidden Markov model), 벡터 양자화(Vector Quantization), 동적 시간 휘어짐(dynamic time warping) 알고리즘을 기반으로 한 화자 인식 시스템에서 사용자가 발성한 음성을 등록되어 있는 화자들의 기준 패턴과 비교하여 입력된 음성이 어느 정도 발생 값을 갖는지를 계산하고, 각각의 화자에 대한 이 발생 값을 바탕으로 하여 현재 입력된 음성과 가장 가까운 화자를 결정하고, 이 발생 값이 현재 저장되어 있는 해당 화자의 문턱 값보다 높은 경우에만 이 음성을 발성한 사용자를 해당 화자로 승인하는 과정을 거치게 된다. 따라서 화자 인식에 있어서 최종적으로 승인 및 거절을 하기 위해서 문턱 값만을 사용한다면 몰래 침입하려는 사람이 사용자의 음성을 녹음하거나 흉내내어 시스템 침입에 성공할 가능성이 높다. 즉, 종래의 기술에서는 현재 시스템에 등록된 사용자의 문턱 값만을 사용하기 때문에 침입자 혹은 사칭자가 사용자의 음성을 녹음 또는 흉내내어 접근할 경우 이를 효과적으로 막아낼 방법이 없다.In the speaker recognition system based on the conventional hidden Markov model, vector quantization, and dynamic time warping algorithm, the speech of the user is compared with the reference pattern of registered speakers. Calculates how many occurrences the input voice has, and based on this occurrence value for each speaker, determines the speaker that is closest to the current input voice, Only when the threshold is higher than the threshold value, the user who speaks the voice is approved as the speaker. Therefore, if only the threshold value is used for the final recognition and rejection in speaker recognition, a person who attempts to sneak in is likely to successfully invade the system by recording or imitating the user's voice. That is, the conventional technology uses only the threshold value of the user registered in the current system, so there is no effective method for preventing an attacker or impersonator from recording or imitating the user's voice.

또한, 은닉 마코프 모델 혹은 벡터 양자화, 동적 시간 휘어짐 알고리즘을 기반으로 한 화장 인식기의 입력된 음성이 누구의 목소리인지를 식별 및 검증하기 위해서 음성 특징 추출 단계를 거친 후에 기준 패턴과의 비교를 통해서 입력된 음성이 현재 시스템에 등록되어 있는 화자들 중에서 누구와 가장 가까운지를 찾아내고, 찾아진 화자의 발생 값에 대한 문턱 값을 현재 입력된 음성으로 계산된 발생 값과 비교하여, 그 값이 문턱 값보다 높은 경우에는 현재 시스템의 사용자가 해당 화자임을 승인하고 그렇지 않은 경우에는 거절을 하는 절차를 거치게 된다. 따라서 침입자 혹은 사칭자가 사용자의 음성을 녹음하거나 사용자의 목소리를 흉내내어 시스템에 접근할 경우 입력 음성의 기준 패턴에 대한 발생 값이 승인 및 거절에 사용되는 문턱 값에 근접하게 됨으로써 사칭자를 잘 막아낼 수 없다. 화자 인식 시스템은 현재 사용자에 대한 검증 성능은 물론 몰래 시스템에 접근하려는 사람을 잘 막아낼 수 있어야 한다. 종래의 화자 인식 시스템에서 입력 음성에 대한 화자 식별이 화자 검증 보다 먼저 이루어진다. 그리고 화자 식별은 각 화자의 기준 패턴과 입력 음성 사이의 음향적인 유사성에 의해서만 이루어지기 때문에 화자 식별기의 성능에 따른 오류가 발생될 가능성이 있다. 따라서 화자 식별기의 정확성을 보장하기 위한 다른 방법이 요구된다. 또한 화자 검증 단계에서도 해당 화자에 대한 문턱 값으로 승인 및 거절이 이루어짐으로 화자 식별기와 마찬가지로 화자 검증기의 검증 성능을 보장하기 위한 대책이 필요하다.In addition, in order to identify and verify whose voice is the input voice of the makeup recognizer based on hidden Markov model or vector quantization and dynamic time warping algorithm, it is input through comparison with reference pattern after voice feature extraction step. It finds out which speaker is the closest among the speakers currently registered in the system, compares the threshold value of the found speaker's occurrence value with the occurrence value calculated by the currently input voice, and the value is higher than the threshold value. If the user of the current system is the speaker is approved, otherwise the process of rejecting. Therefore, if an intruder or impersonator approaches the system by recording the user's voice or by imitating the user's voice, the occurrence value of the reference pattern of the input voice is close to the threshold used for approval and rejection, thereby preventing the impersonator. none. The speaker recognition system should be able to prevent the performance of verification for the current user as well as those who try to access the system in secret. In a conventional speaker recognition system, speaker identification of an input voice is performed before speaker verification. And since the speaker identification is made only by the acoustic similarity between the reference pattern of each speaker and the input voice, there is a possibility that an error according to the performance of the speaker identifier is generated. Therefore, another method for ensuring the accuracy of the speaker identifier is required. In addition, in the speaker verification step, the approval and rejection are made as the threshold value for the speaker, and thus, a measure for guaranteeing the verification performance of the speaker verifier is necessary as the speaker identifier.

따라서, 본 발명은 등록된 화자에 대한 문턱 값만을 사용하여 해당 화자로 승인 및 거절하는 종래 기술의 사칭자에 따른 취약성을 암호 검증 단계를 통해서 보강하는 암호 입력을 통한 화자 인식기의 성능 개선 방법을 제공하는데 그 목적이 있다.Accordingly, the present invention provides a method for improving the performance of a speaker recognizer through a password input that reinforces a vulnerability according to a prior art impersonator who accepts and rejects a corresponding speaker using only a threshold value for a registered speaker through a password verification step. Its purpose is to.

상술한 목적을 달성하기 위한 본 발명은 화자 인식 시스템에 있어서 시스템 사용자의 발성을 음성 입력, A/D 변환, 음성 특징 추출, 기준 패턴 비교 및 화자 식별을 순차적으로 수행하는 단계와, 사용자가 미리 암호 등록기에서 등록한 암호를 암호 검증기에서 검증하는 단계와, 상기 화자 검증기에서 해당 화자에 대한 문턱값과의 비교를 통해서 현재 사용자에 대한 승인 및 거절 여부를 판단하는 단계로 이루어진 것을 특징으로 한다.The present invention for achieving the above object is a step of sequentially performing the speech input, A / D conversion, speech feature extraction, reference pattern comparison and speaker identification of the system user in the speaker recognition system, and the user in advance Verifying the password registered by the registrar in the password verifier, and determining whether to approve or reject the current user by comparing the speaker verifier with a threshold value for the corresponding speaker.

도 1은 본 발명이 적용되는 하드웨어의 구성도.1 is a block diagram of hardware to which the present invention is applied.

도 2는 종래의 화자 인식기의 처리 흐름도.2 is a process flow diagram of a conventional speaker recognizer.

도 3은 본 발명에 따른 화자 인식기의 처리 흐름도.3 is a process flow diagram of a speaker recognizer in accordance with the present invention;

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

11 : 음성 입력 장치12 : A/D 변환 장치11: voice input device 12: A / D conversion device

13 : 기억 장치14 : 중앙 처리 장치13: storage device 14: central processing unit

15 : 인식 결과 출력 장치15: recognition result output device

본 발명은 화자 인식 시스템에 있어서 화자 식별기의 결과에 대한 재확인을 통한 화자 식별 성능의 보장 및 사칭자의 녹음 등에 의한 접근을 막기 위해서 화자 식별기와 화자 검증기 사이에 암호 검증기를 둠을 특징으로 한다. 암호 검증기는 현재 시스템 사용자의 발성이 저장되어 있는 암호와 일치하는지 여부를 판단함으로써 시스템의 식별 성능을 향상시킬 수 있다. 왜냐하면 해당 화자와의 음향적인 유사성과 더불어 발성한 음성의 내용 즉, 암호(password)까지 일치되어야 하기 때문이다. 암호 등록기에서 사용되는 암호는 사용자가 시스템의 승인을 받은 후에 임의로 등록할 수 있으므로 사칭자의 암호 녹음에 의한 접근을 방지할 수 있다. 왜냐하면 침입자가 사용자의 현재 음성을 몰래 녹음하였어도 다음 시스템 사용에 사용될 수 없기 때문이다.The present invention is characterized in that in the speaker recognition system, a cryptographic verifier is placed between the speaker identifier and the speaker verifier in order to ensure speaker identification performance by re-confirming the result of the speaker identifier and to prevent access by impersonation recording. The password verifier can improve the identification performance of the system by determining whether the current system user's speech matches the stored password. Because the acoustic similarity with the speaker and the contents of the spoken voice, that is, the password, must match. The password used in the password register can be arbitrarily registered after the user has been approved by the system, thereby preventing access by recording the password of the impersonator. This is because even if the attacker secretly recorded the user's current voice, it could not be used for the next system.

이하, 첨부된 도면을 참조하여 본 발명에 따른 일실시예를 상세히 설명한다.Hereinafter, with reference to the accompanying drawings will be described an embodiment according to the present invention;

도 1은 본 발명이 적용되는 하드웨어의 구성도이다. 컴퓨터에 사용자가 화자 인식을 요구하는 발성을 하면 음성 입력 장치(11)와 컴퓨터 내의 A/D 변환 장치(12)를 거쳐 음성은 아날로그 시호에서 디지털 신호로 변환된다. 중앙 처리 장치(14)는 이 디지털 음성 데이터로부터 음성 특징 벡터를 추출하고, 이 특징 벡터와 기억 장치(13) 내에 저장되어 있던 각 화자들의 기준 패턴과 비교하여 가장 유사한 화자를 찾은 후에 문턱 값에 의한 검증 단계를 거쳐 출력 장치(15)를 통해서 그 승인 및 거절에 대한 결과를 출력시킨다.1 is a block diagram of hardware to which the present invention is applied. When the user speaks to the computer for speaker recognition, the voice is converted into an analog signal into a digital signal through the voice input device 11 and the A / D converter 12 in the computer. The central processing unit 14 extracts the voice feature vector from the digital voice data, compares the feature vector with the reference pattern of each speaker stored in the storage device 13, and finds the most similar speaker, and then the threshold value is determined by the threshold value. After the verification step, the output device 15 outputs the result of the approval and rejection.

도 2는 종래의 화자 인식기의 처리 흐름도이다. 그 처리 흐름은 크게 두 가지로 나눌 수 있다. 즉, 훈련 단계와 인식 단계(29)로 나누어진다.2 is a flow chart of a conventional speaker recognizer. The processing flow can be divided into two types. That is, it is divided into a training stage and a recognition stage 29.

먼저, 훈련 단계를 설명하면 다음과 같다. 사용자가 기준 패턴 형성을 위한 훈련용 음성을 입력(20)하면, 이를 A/D 변환(21)하고, 이로부터 음성 특징을 추출(22)한다. 추출된 음성 특징 벡터들은 기준 패턴 작성기(23)에서 각 화자에 대한 기준 패턴(24)을 작성하고 문턱 값 작성기(25)에서 기준 패턴을 이용해서 각 화자에 대한 문턱 값(26)을 설정한다.First, the training phase is described as follows. When the user inputs the training voice for forming the reference pattern 20, the user A / D conversion 21 and extracts the voice feature 22 therefrom. The extracted voice feature vectors create a reference pattern 24 for each speaker in the reference pattern builder 23 and set a threshold value 26 for each speaker using the reference pattern in the threshold value builder 25.

다음으로 인식 단계에서 처리되는 과정을 살펴보면 다음과 같다. 먼저 훈련 단계와 동일한 과정을 거쳐 음성 특징을 추출한다. 추출된 음성 특징 벡터들은 기준 패턴 비교기(27)에서 각 화자에 대한 기준 패턴과의 비교에 사용된다. 기준 패턴 비교기에서 계산된 결과인 각 화자들에 대한 발생 값을 이용하여 화자 식별기(28)에서 최대 발생 값을 갖는 화자를 골라낸다. 화자 식별기에서 정해진 입력 음성에 대한 가장 가까운 화자에 대한 승인 및 거절을 하기 위해서 화자 검증기(30)에서는 해당 화자의 문턱 값을 현재 발생 값과 비교한 후 그 값이 문턱 값보다 큰 경우에는 승인 신호를 출력시키고 작으면 거절 신호를 출력(33)시킨다.Next, the process of the recognition step is as follows. First, the voice feature is extracted through the same process as the training step. The extracted speech feature vectors are used for comparison with the reference pattern for each speaker in the reference pattern comparator 27. Using the occurrence value for each speaker, which is the result calculated by the reference pattern comparator, the speaker having the maximum occurrence value is selected by the speaker identifier 28. In order to approve and reject the speaker closest to the input voice determined by the speaker identifier, the speaker verifier 30 compares the threshold value of the speaker with the current occurrence value, and if the value is larger than the threshold value, an acknowledgment signal is generated. If the output is small, the reject signal is output (33).

도 3은 본 발명에 따른 암호 검증기 및 암호 등록기가 첨가된 화자 인식기의 처리 흐름도이다. 그 처리 흐름을 훈련 단계와 인식 단계(44)로 나누어 설명하도록 한다.3 is a flowchart of a speaker recognizer to which a password verifier and a password register according to the present invention are added. The process flow will be divided into a training step and a recognition step 44 to be described.

먼저, 훈련 단계를 설명하면 다음과 같다. 음성 입력(30), A/D 변환(31), 음성 특징 추출(32), 기준 패턴 작성기(33)를 통한 기준 패턴(34) 작성 그리고, 문턱 값 작성기(35)에서 기준 패턴을 이용해서 각 화자에 대한 문턱 값(36)을 설정한다.First, the training phase is described as follows. Creating a reference pattern 34 through the voice input 30, the A / D conversion 31, the speech feature extraction 32, and the reference pattern builder 33, and using the reference pattern in the threshold builder 35 Set the threshold 36 for the speaker.

다음으로 인식 단계에서 처리되는 과정을 살펴보면 다음과 같다. 먼저 훈련 단계와 동일한 과정을 거쳐 음성 특징을 추출한다. 추출된 음성 특징 벡터들은 기준 패턴 비교기(37)에서 각 화자에 대한 기준 패턴과의 비교에 사용된다. 기준 패턴 비교기에서 계산된 결과인 각 화자들에 대한 발생 값을 이용하여 화자 식별기(38)에서 최대 발생 값을 갖는 화자를 골라낸다. 화자 식별기에서 정해진 화자에 대한 기준 패턴을 이용하여 현재 등록된 암호(42)와 입력 음성의 내용이 동일한 것인지를 비교(39)하고, 해당 화자에 대한 승인 및 거절(45)을 하기 위해서 화자 검증기(40)에서 해당 화자의 문턱 값을 현재 발생 값과 비교한 후 그 값이 문턱 값보다 작거나 암호가 일치하지 않으면 거절 신호를 출력(43)시키고, 암호가 일치하며 큰 경우에는 암호 등록기(41)에서 다음에 사용될 암호를 입력받아 등록시킨 후 승인 신호를 출력시킨다.Next, the process of the recognition step is as follows. First, the voice feature is extracted through the same process as the training step. The extracted speech feature vectors are used in the reference pattern comparator 37 for comparison with the reference pattern for each speaker. The speaker having the maximum occurrence value is selected by the speaker identifier 38 using the occurrence value for each speaker that is the result calculated by the reference pattern comparator. A speaker verifier (39) compares the currently registered password 42 with the contents of the input voice using the reference pattern for the speaker determined by the speaker identifier, and makes the speaker verifier (45) to approve and reject the speaker. 40), the threshold value of the speaker is compared with the current occurrence value, and if the value is smaller than the threshold value or the password does not match, a reject signal is output (43). After receiving and registering the password to be used in the next time, the approval signal is output.

상기와 같이 구성되어 동작하는 본 발명은 다음과 같은 효과가 있다.The present invention configured and operated as described above has the following effects.

첫째, 화자 인식에 있어서 암호를 검증하는 단게를 둠으로써 화자 식별기의 오류를 방지할 수 있다.First, it is possible to prevent an error of a speaker identifier by providing a step of verifying a password in speaker recognition.

둘째, 현재 시스템 사용자에 대한 승인 및 거절에 있어서 암호에 의한 화자 검증 단계가 추가됨으로써 검증 성능이 향상된다.Second, the verification performance is improved by adding the speaker verification step by the password in the approval and rejection of the current system user.

셋째, 사용자가 다음에 사용될 암호를 임의로 정할 수 있으므로 사칭자의 도청 및 녹음 등에 의한 접근을 방지함으로 시스템의 보안 성능이 향상된다.Third, since the user can arbitrarily determine the password to be used next, the security performance of the system is improved by preventing access by the impersonator's eavesdropping and recording.

Claims

In a speaker recognition system, sequentially performing speech input, A / D conversion, speech feature extraction, reference pattern and comparison, and speaker identification of a system user;

Verifying, at the password verifier, a password that the user has previously registered at the password register;

And determining whether to approve or reject the current user by comparing the speaker verifier with a threshold value for the corresponding speaker.