KR20200100332A

KR20200100332A - Speech recognition device, method and computer program for updating speaker model

Info

Publication number: KR20200100332A
Application number: KR1020190018463A
Authority: KR
Inventors: 이가희
Original assignee: 주식회사 케이티
Priority date: 2019-02-18
Filing date: 2019-02-18
Publication date: 2020-08-26

Abstract

A device for updating a speaker model comprises: a voice input unit receiving a voice uttered from a user; an extraction unit extracting a feature vector value from the received voice; a similarity calculation unit calculating similarity by comparing the extracted feature vector value and a plurality of previously registered speaker models; a speaker model selection unit selecting one of the plurality of previously registered speaker models based on the calculated similarity; and a speaker model update unit updating the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model. Therefore, the speaker model may be automatically updated.

Description

Speech recognition device, method, and computer program for updating speaker models {SPEECH RECOGNITION DEVICE, METHOD AND COMPUTER PROGRAM FOR UPDATING SPEAKER MODEL}

본 발명은 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램에 관한 것이다. The present invention relates to a speech recognition apparatus, method and computer program for updating a speaker model.

지능형 개인 비서는 사용자가 요구하는 작업을 처리하고, 사용자에게 특화된 서비스를 제공하는 소프트웨어 에이전트이다. 지능형 개인 비서는 인공 지능(AI) 엔진과 음성 인식을 기반으로 사용자에게 맞춤 정보를 수집하여 제공하고, 사용자의 음성 명령에 따라 일정 관리, 이메일 전송, 식당 예약 등 여러 기능을 수행하는 점에서 사용자의 편의성을 향상시키는 장점을 갖는다. The intelligent personal assistant is a software agent that handles tasks requested by users and provides specialized services to users. The intelligent personal assistant collects and provides customized information to the user based on artificial intelligence (AI) engine and voice recognition, and performs various functions such as scheduling, sending email, and restaurant reservation according to the user's voice command. It has the advantage of improving convenience.

이러한 지능형 개인 비서는 주로 스마트폰에서 맞춤형 개인 서비스의 형태로 제공되고 있으며, 대표적으로 애플의 시리(siri), 구글의 나우(now), 삼성의 빅스비 등이 이에 포함된다. 이와 관련하여, 선행기술인 한국공개특허 제 2016-0071111호는 전자 장치에서의 개인 비서 서비스 제공 방법을 개시하고 있다.These intelligent personal assistants are mainly provided in the form of personalized personal services on smartphones, and include Apple's Siri, Google's Now, and Samsung's Bixby. In this regard, Korean Patent Publication No. 2016-0071111, which is a prior art, discloses a method of providing a personal assistant service in an electronic device.

최근에는 사용자별로 지능형 개인 비서에 화자로 등록함으로써, 사용자별 맞춤형 서비스를 제공받을 수 있게 되었다. 이 때, 사용자의 음성을 지능형 개인 비서에 등록하는 과정은 매우 짧은 시간 동안 이루어진다. 이로 인해, 음성이 등록된 시일이 지날수록 화자 인식 성능이 떨어진다는 단점을 가진다. In recent years, by registering as a speaker in an intelligent personal assistant for each user, it is possible to receive customized services for each user. At this time, the process of registering the user's voice with the intelligent personal assistant takes place for a very short time. For this reason, it has a disadvantage that the speaker recognition performance deteriorates as the time when the voice is registered passes.

화자를 인식하지 못한 경우, 기등록된 화자 모델의 재등록 없이 화자 모델을 자동으로 업데이트하도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다. An object of the present invention is to provide a speech recognition apparatus, method, and computer program for updating a speaker model that automatically updates a speaker model without re-registration of a previously registered speaker model when the speaker is not recognized.

화자 모델과 사용자로부터 입력되는 발화 음성 간의 중요도를 가중치로 조절하여 화자 모델을 유연하게 업데이트하도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.An object of the present invention is to provide a speech recognition apparatus, method, and computer program for updating a speaker model to flexibly update the speaker model by adjusting the importance between the speaker model and the spoken voice input from the user by weight.

소량의 데이터를 이용하여 화자 모델을 모델링한 경우, 사용자로부터 입력받은 음성을 추가적으로 이용하여 화자 모델을 보완할 수 있도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.In the case of modeling a speaker model using a small amount of data, the present invention intends to provide a speech recognition apparatus, a method, and a computer program for updating a speaker model to supplement the speaker model by additionally using a voice input from a user.

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problem to be achieved by the present embodiment is not limited to the technical problems as described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 사용자로부터 발화된 음성을 입력받는 음성 입력부, 상기 입력받은 음성으로부터 특징 벡터값을 추출하는 추출부, 상기 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출하는 유사도 산출부, 상기 산출된 유사도에 기초하여 상기 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택하는 화자 모델 선택부 및 상기 추출된 특징 벡터값과 상기 선택된 화자 모델 간의 유사도에 기초하여 상기 선택된 화자 모델을 업데이트하는 화자 모델 업데이트부를 포함하는 음성 인식 장치를 제공할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention provides a voice input unit for receiving a speech uttered from a user, an extraction unit for extracting a feature vector value from the received voice, and the extracted feature vector value. A similarity calculation unit that compares a plurality of previously registered speaker models to calculate a similarity, a speaker model selection unit that selects any one speaker model among the plurality of previously registered speaker models based on the calculated similarity, and the extraction It is possible to provide a speech recognition apparatus including a speaker model update unit that updates the selected speaker model based on a similarity between the selected feature vector value and the selected speaker model.

본 발명의 다른 실시예는, 사용자로부터 발화된 음성을 입력받는 단계, 상기 입력받은 음성으로부터 특징 벡터값을 추출하는 단계, 상기 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출하는 단계, 상기 산출된 유사도에 기초하여 상기 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택하는 단계 및 상기 추출된 특징 벡터값과 상기 선택된 화자 모델 간의 유사도에 기초하여 상기 선택된 화자 모델을 업데이트하는 단계를 포함하는 화자 모델 업데이트 방법을 제공할 수 있다. In another embodiment of the present invention, the step of receiving a voice spoken from a user, extracting a feature vector value from the received voice, and comparing the extracted feature vector value with a plurality of pre-registered speaker models to determine the similarity. Calculating, selecting any one speaker model among the plurality of previously registered speaker models based on the calculated similarity, and the selected speaker model based on a similarity between the extracted feature vector value and the selected speaker model It is possible to provide a speaker model update method including the step of updating.

본 발명의 또 다른 실시예는, 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자로부터 발화된 음성을 입력받고, 상기 입력받은 음성으로부터 특징 벡터값을 추출하고, 상기 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출하고, 상기 산출된 유사도에 기초하여 상기 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택하고, 상기 추출된 특징 벡터값과 상기 선택된 화자 모델 간의 유사도에 기초하여 상기 선택된 화자 모델을 업데이트하도록 하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. In another embodiment of the present invention, when a computer program is executed by a computing device, the computer program receives a speech uttered from a user, extracts a feature vector value from the received voice, and extracts the extracted feature vector value and the previously registered voice. A similarity is calculated by comparing a plurality of speaker models, and a speaker model is selected from among the plurality of previously registered speaker models based on the calculated similarity, and the similarity between the extracted feature vector value and the selected speaker model It is possible to provide a computer program stored in a medium including a sequence of instructions for updating the selected speaker model based on.

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary and should not be construed as limiting the present invention. In addition to the above-described exemplary embodiments, there may be additional embodiments described in the drawings and detailed description of the invention.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 화자를 인식하지 못한 경우, 기등록된 화자 모델의 재등록 없이 화자 모델을 자동으로 업데이트하도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. According to any one of the above-described problem solving means of the present invention, when a speaker is not recognized, a speech recognition apparatus, method, and computer for updating a speaker model to automatically update a speaker model without re-registration of a previously registered speaker model. Program can be provided.

화자 모델과 사용자로부터 입력되는 발화 음성 간의 중요도를 가중치로 조절하여 화자 모델을 유연하게 업데이트하도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.It is possible to provide a speech recognition apparatus, a method, and a computer program for updating a speaker model to flexibly update the speaker model by adjusting the importance between a speaker model and a spoken speech input from a user with a weight.

소량의 데이터를 이용하여 화자 모델을 모델링한 경우, 사용자로부터 입력받은 음성을 추가적으로 이용하여 화자 모델을 보완할 수 있도록 하는 화자 모델을 업데이트하는 음성 인식 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다. When a speaker model is modeled using a small amount of data, a speech recognition apparatus, a method, and a computer program for updating a speaker model to supplement the speaker model by additionally using a voice input from a user can be provided.

도 1은 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 환경 요소를 도시한 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치에서 화자 모델의 업데이트 수행 여부에 기초하여 화자 모델을 업데이트하는 방법의 순서도이다.
도 4는 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치에서 화자 모델을 업데이트하는 방법의 순서도이다. 1 is a block diagram of a speech recognition apparatus for updating a speaker model according to an embodiment of the present invention.
2 is an exemplary diagram showing an environmental element according to an embodiment of the present invention.
3 is a flowchart of a method of updating a speaker model based on whether the speaker model is updated in a speech recognition device for updating a speaker model according to an embodiment of the present invention.
4 is a flowchart of a method of updating a speaker model in a speech recognition apparatus for updating a speaker model according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement the present invention. However, the present invention may be implemented in various different forms and is not limited to the embodiments described herein. In the drawings, parts irrelevant to the description are omitted in order to clearly describe the present invention, and similar reference numerals are assigned to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is said to be "connected" to another part, this includes not only "directly connected" but also "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included, and one or more other features, not excluding other components, unless specifically stated to the contrary. It is to be understood that it does not preclude the presence or addition of any number, step, action, component, part, or combination thereof.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In the present specification, the term "unit" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. Further, one unit may be realized using two or more hardware, or two or more units may be realized using one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.In this specification, some of the operations or functions described as being performed by the terminal or device may be performed instead by a server connected to the terminal or device. Likewise, some of the operations or functions described as being performed by the server may also be performed by a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치의 구성도이다. 도 1을 참조하면, 화자 모델을 업데이트하는 음성 인식 장치(100)는 음성 입력부(110), 벡터 추출부(120), 유사도 산출부(130), 화자 모델 선택부(140), 업데이트 판단부(150) 및 화자 모델 업데이트부(160)를 포함할 수 있다. 1 is a block diagram of a speech recognition apparatus for updating a speaker model according to an embodiment of the present invention. Referring to FIG. 1, a speech recognition apparatus 100 for updating a speaker model includes a speech input unit 110, a vector extraction unit 120, a similarity calculation unit 130, a speaker model selection unit 140, and an update determination unit ( 150) and a speaker model update unit 160.

음성 입력부(110)는 사용자로부터 발화된 음성을 입력받을 수 있다. The voice input unit 110 may receive a voice spoken by a user.

추출부(120)는 입력받은 음성으로부터 특징 벡터값을 추출할 수 있다. 예를 들어, 추출부(120)는 입력받은 음성으로부터 복수의 프레임을 생성하고, 생성된 복수의 프레임으로부터 특징 벡터값를 추출할 수 있다. The extraction unit 120 may extract a feature vector value from the input voice. For example, the extraction unit 120 may generate a plurality of frames from an input voice and extract a feature vector value from the generated plurality of frames.

추출부(120)는 추출한 특징 벡터값에 기초하여 유효 음성 프레임을 선별하고, 선별된 유효 음성 프레임으로부터 아이 벡터(I-Vector)를 추출할 수 있다. 아이 벡터(I-Vector)란 입력된 음성으로부터 화자를 인식하기 위해 특정 시간에서의 주파수 특성 혹은 MFCC(Mel Frequency Cepstral Coefficient)와 같은 음성의 프레임 단위 특징의 분포를 표현하는 가우시안 혼합 모델(GMM, Gaussian Mixture Model)이 갖는 여러 변이성을 작은 벡터로 표현한 특징 벡터이다. 아이 벡터는 화자 인식에서 널리 사용되는 특징 벡터로, 입력된 음성 내에 존재하는 화자, 녹음 상태, 잡음 등으로 인한 다양한 변이성을 작은 차원의 벡터로 표현하며, 화자 인식 분야에서 높은 성능을 보이고 있다. The extraction unit 120 may select an effective speech frame based on the extracted feature vector value, and extract an I-Vector from the selected valid speech frame. I-Vector is a Gaussian mixed model (GMM, Gaussian) that expresses the distribution of frame-by-frame features such as frequency characteristics at a specific time or MFCC (Mel Frequency Cepstral Coefficient) to recognize a speaker from the input voice It is a feature vector that expresses various variability of (Mixture Model) as a small vector. Eye vector is a feature vector that is widely used in speaker recognition. It expresses various variability due to speaker, recording state, noise, etc. existing in the input speech as a small-dimensional vector, and has high performance in speaker recognition.

유사도 산출부(130)는 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출할 수 있다. 이 때, 유사도 산출부(130)는 추출된 아이 벡터와 기등록된 복수의 화자 모델의 아이 벡터를 비교하여 유사도를 산출할 수 있다. 예를 들어, 유사도 산출부(130)는 추출된 아이 벡터에 대해 PLDA(Probabilistic Linear Discriminative Analysis), WCCN(Within Class Covariance Normalization), DNN(Deep Neural Network) 또는 이들의 조합과 t-norm과 z-norm같은 정규화 방식을 통해 기등록된 복수의 화자 모델의 아이 벡터를 비교하여 유사도를 산출할 수 있다. The similarity calculation unit 130 may calculate a similarity by comparing the extracted feature vector value with a plurality of previously registered speaker models. In this case, the similarity calculation unit 130 may calculate the similarity by comparing the extracted eye vector with the eye vectors of a plurality of pre-registered speaker models. For example, the similarity calculation unit 130 may perform probabilistic linear discriminative analysis (PLDA), within class covariance normalization (WCCN), deep neural network (DNN), or a combination thereof and t-norm and z- Similarity can be calculated by comparing the eye vectors of a plurality of previously registered speaker models through a normalization method such as norm.

유사도 산출부(130)는 다음의 수학식 1을 이용하여 추출된 아이 벡터와 기등록된 복수의 화자 모델 각각의 아이 벡터가 동일 화자일 가능성(likelihood) 및 서로 다른 화자일 가능성에 기초하여 유사도를 산출할 수 있다. The similarity calculation unit 130 calculates the similarity based on the likelihood that the child vector extracted using the following Equation 1 and the child vectors of each of the plurality of previously registered speaker models are the same speaker and the likelihood of different speakers. Can be calculated.

수학식 1을 참조하면, λ는 유사도를 나타내고,

는 추출된 아이 벡터와 기등록된 화자 모델의 아이 벡터가 동일 화자일 가능성(likelihood)을 나타내고,

는 추출된 아이 벡터와 기등록된 화자 모델의 아이 벡터가 다른 화자일 가능성을 나타낸다. 여기서, 유사도(λ)는 이 두 가능성(확률)에 log를 취함으로써 도출되며, 추출된 아이 벡터와 기등록된 화자 모델의 아이 벡터가 동일 화자라고 판단되는 경우 양수 값이 도출될 수 있다. Referring to Equation 1, λ represents the degree of similarity,

Represents the likelihood that the extracted child vector and the child vector of the pre-registered speaker model are the same speaker,

Represents the possibility that the extracted child vector and the child vector of the previously registered speaker model are different speakers. Here, the similarity (λ) is derived by taking a log of these two possibilities (probability), and a positive value can be derived when it is determined that the extracted eye vector and the eye vector of the previously registered speaker model are the same speaker.

화자 모델 선택부(140)는 산출된 유사도에 기초하여 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택할 수 있다. 예를 들어, 화자 모델 선택부(140)는 산출된 유사도의 우선순위에 기초하여 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택할 수 있다. 이 때, 화자 모델 선택부(140)는 산출된 유사도의 우선순위에 기초하여 기등록된 복수의 화자 모델 중 유사도가 가장 높은 화자 모델을 선택할 수 있다. The speaker model selection unit 140 may select any one speaker model from among a plurality of previously registered speaker models based on the calculated similarity. For example, the speaker model selection unit 140 may select any one speaker model from among a plurality of previously registered speaker models based on the calculated priority of similarity. In this case, the speaker model selection unit 140 may select a speaker model with the highest similarity among a plurality of previously registered speaker models based on the calculated priority of the similarity.

업데이트 판단부(150)는 1, 2단계를 거쳐 선택된 화자 모델의 업데이트 수행 여부를 판단할 수 있다. 업데이트 판단부(150)는 1단계에서 선택된 화자 모델의 업데이트 수행 여부를 수학식 2를 이용하여 판단할 수 있다. The update determination unit 150 may determine whether to perform the update of the speaker model selected through steps 1 and 2. The update determination unit 150 may determine whether to perform the update of the speaker model selected in step 1 by using Equation 2.

수학식 2를 참조하면, 업데이트 판단부(150)는 추출된 아이 벡터와 선택된 화자 모델 간의 유사도(λ)가 화자 문턱값(TH_spk)을 초과하는지 여부에 기초하여 선택된 화자 모델의 업데이트 수행 여부를 판단할 수 있다. 이 때, 유사도(λ)가 화자 문턱값(TH_spk)을 초과하는 경우, 2단계로 진행하여 선택된 화자 모델의 업데이트 수행 여부를 판단하고, 유사도(λ)가 화자 문턱값(TH_spk)을 초과하지 않은 경우, 선택된 화자 모델에 대한 업데이트를 수행하지 않을 수 있다. Referring to Equation 2, the update determination unit 150 determines whether to perform the update of the selected speaker model based on whether the similarity (λ) between the extracted eye vector and the selected speaker model exceeds the speaker threshold value (TH _spk ). I can judge. At this time, if the similarity (λ) exceeds the speaker threshold (TH _spk ), proceed to step 2 to determine whether to update the selected speaker model, and the similarity (λ) exceeds the speaker threshold (TH _spk ). If not, it is possible not to perform the update on the selected speaker model.

업데이트 판단부(150)는 2단계에서 선택된 화자 모델의 업데이트 수행 여부를 수학식 3을 이용하여 판단할 수 있다.The update determination unit 150 may determine whether to perform the update of the speaker model selected in step 2 using Equation (3).

업데이트 판단부(150)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성(likelihood,

)이 업데이트 문턱값(TH_dec)을 초과하는지 여부에 기초하여 선택된 화자 모델의 업데이트 수행 여부를 판단할 수 있다. 이 때, 추출된 아이 벡터가 선택된 화자 모델에서 생성되는지를 판단하여 화자 모델의 업데이트 수행 여부가 결정될 수 있다. The update determination unit 150 determines the likelihood that the extracted child vector and the child vector of the selected speaker model are the same speaker.

It may be determined whether to perform the update of the selected speaker model based on whether) exceeds the update threshold value TH _dec . In this case, it may be determined whether or not the speaker model is updated by determining whether the extracted eye vector is generated in the selected speaker model.

여기서, 유사도 산출에 이용된 추출된 아이 벡터와 기등록된 화자모델의 아이 벡터가 동일 화자일 가능성(

)을 재이용하는 이유는 유사도의 산출 시,

가 매우 작고(예를 들어, 0.0001),

이 작은 경우(예를 들어, 0.01)에도, 유사도(λ = 100)가 높게 산출되며, 이로 인해, 동일 화자일 가능성이 적더라도 화자 문턱값(TH_spk)을 통과할 수 있기 때문이다. Here, the possibility that the extracted eye vector used to calculate the similarity and the child vector of the previously registered speaker model are the same speaker (

The reason for reusing) is when calculating the similarity,

Is very small (for example, 0.0001),

This is because even in this small case (for example, 0.01), the similarity (λ = 100) is calculated high, and thus, the speaker threshold TH _spk can be passed even if the probability of being the same speaker is small.

화자 모델 업데이트부(160)는 수학식 4를 이용하여 추출된 특징 벡터값과 선택된 화자 모델 간의 유사도에 기초하여 선택된 화자 모델을 업데이트할 수 있다. 이 때, 화자 모델 업데이트부(160)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성(

) 및 환경 요소 중 적어도 하나 이상을 가중치(α, β)로 설정하여 선택된 화자 모델을 업데이트할 수 있다. The speaker model update unit 160 may update the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model using Equation 4. At this time, the speaker model update unit 160 may determine that the extracted child vector and the child vector of the selected speaker model are the same speaker (

) And at least one of the environmental elements as weights (α, β) to update the selected speaker model.

수학식 4를 참조하면,

은 업데이트할 화자 모델을 나타내고,

은 입력된 음성으로부터 추출한 아이 벡터를 나타낸다. Referring to Equation 4,

Represents the speaker model to be updated,

Represents the eye vector extracted from the input voice.

화자 모델 업데이트부(160)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성(

)을 가중치(α)로하여 선택된 화자 모델을 업데이트할 수 있다. 또한, 화자 모델 업데이트부(160)는 음성 이외의 변수를 제어하는 환경 요소를 가중치(β)로 하여 선택된 화자 모델을 업데이트할 수 있다. 환경 요소에 대해서는 도 2를 통해 상세히 설명하도록 한다. The speaker model update unit 160 has the possibility that the extracted eye vector and the eye vector of the selected speaker model are the same speaker (

) As a weight (α) to update the selected speaker model. In addition, the speaker model update unit 160 may update the selected speaker model by using an environmental factor that controls variables other than speech as a weight β. The environmental factors will be described in detail with reference to FIG. 2.

도 2는 본 발명의 일 실시예에 따른 가중치로 설정되는 환경 요소를 도시한 예시적인 도면이다. 도 2를 참조하면, 환경 요소는 SNR(Signal to Noise Ratio, 200) 및 가중치(β, 210)로 구성될 수 있다. 여기서, 가중치(β, 210)는 입력된 음성이 조용한 환경에서 입력되는지에 대한 입력 음성의 신뢰도를 나타내는 것일 수 있다. 2 is an exemplary diagram showing an environment element set as a weight according to an embodiment of the present invention. Referring to FIG. 2, the environmental element may include a signal to noise ratio (SNR) 200 and weights β and 210. Here, the weights β and 210 may indicate the reliability of the input voice as to whether the input voice is input in a quiet environment.

화자 모델 업데이트부(160)는 가중치(β, 210)가 입력되는 음성과 음성에 포함된 잡음에 대해 SNR(200)을 추정하고, 해당 SNR(200) 범위에 따라 가중치(β, 210)를 다르게 적용할 수 있다. 이 때, 깨끗하지 못한 환경에 가까울수록 '0'에 가까운 값을 이용하여 입력된 음성에 기초하여 화자 모델 업데이트가 수행되지 않도록 할 수 있다. The speaker model update unit 160 estimates the SNR 200 for the speech to which the weights (β, 210) are input and the noise included in the speech, and varies the weights (β, 210) according to the range of the SNR 200. Can be applied. In this case, the closer to an unclean environment, the speaker model update may not be performed based on the input voice by using a value closer to '0'.

이러한 화자 모델을 업데이트하는 음성 인식 장치(100)는 화자 모델을 업데이트하는 명령어들의 시퀀스를 포함하는 매체에 저장된 컴퓨터 프로그램에 의해 실행될 수 있다. 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 사용자로부터 발화된 음성을 입력받고, 입력받은 음성으로부터 특징 벡터값을 추출하고, 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출하고, 산출된 유사도에 기초하여 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택하고, 추출된 특징 벡터값과 선택된 화자 모델 간의 유사도에 기초하여 선택된 화자 모델을 업데이트하도록 하는 명령어들의 시퀀스를 포함할 수 있다. The speech recognition apparatus 100 for updating the speaker model may be executed by a computer program stored in a medium including a sequence of instructions for updating the speaker model. When the computer program is executed by a computing device, it receives speech uttered from the user, extracts a feature vector value from the received speech, and calculates a similarity by comparing the extracted feature vector value with a plurality of pre-registered speaker models. And a sequence of instructions for selecting any one speaker model from among a plurality of previously registered speaker models based on the calculated similarity, and updating the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model. can do.

도 3은 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치에서 화자 모델의 업데이트 수행 여부에 기초하여 화자 모델을 업데이트하는 방법의 순서도이다. 도 3에 도시된 화자 모델을 업데이트하는 음성 인식 장치(100)에서 화자 모델의 업데이트 수행 여부에 기초하여 화자 모델을 업데이트하는 방법은 도 1 및 도 2에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 및 도 2에 도시된 실시예에 따른 화자 모델 업데이트 장치(100)에서 화자 모델의 업데이트 수행 여부에 기초하여 화자 모델을 업데이트하는 방법에도 적용된다. FIG. 3 is a flowchart of a method of updating a speaker model based on whether the speaker model is updated in a speech recognition device that updates a speaker model according to an embodiment of the present invention. The method of updating the speaker model based on whether the speaker model is updated in the speech recognition apparatus 100 for updating the speaker model shown in FIG. 3 is processed in time series according to the embodiments shown in FIGS. 1 and 2. Includes steps. Therefore, even if omitted below, the method of updating the speaker model based on whether or not the speaker model is updated in the speaker model update apparatus 100 according to the exemplary embodiment illustrated in FIGS. 1 and 2 is also applied.

단계 S310에서 음성 인식 장치(100)는 사용자로부터 발화된 음성을 입력받을 수 있다. In step S310, the speech recognition apparatus 100 may receive a voice spoken by a user.

단계 S320에서 음성 인식 장치(100)는 입력받은 음성으로부터 복수의 프레임을 생성하고, 생성된 복수의 프레임으로부터 특징 벡터값을 추출할 수 있다. In step S320, the speech recognition apparatus 100 may generate a plurality of frames from the received speech, and extract a feature vector value from the generated plurality of frames.

단계 S330에서 음성 인식 장치(100)는 추출한 특징 벡터값에 기초하여 유효 음성 프레임을 선별하고, 선별된 유효 음성 프레임으로부터 아이 벡터를 추출할 수 있다. In operation S330, the speech recognition apparatus 100 may select a valid speech frame based on the extracted feature vector value, and extract an eye vector from the selected valid speech frame.

단계 S340에서 음성 인식 장치(100)는 추출된 아이 벡터와 기등록된 복수의 화자 모델의 아이 벡터를 비교하여 유사도를 산출하고, 산출된 유사도에 기초하여 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택할 수 있다. In step S340, the speech recognition apparatus 100 calculates a similarity by comparing the extracted eye vector with the eye vectors of a plurality of pre-registered speaker models, and based on the calculated similarity, the speech recognition apparatus 100 You can choose a speaker model.

단계 S350에서 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터 간의 유사도가 화자 문턱값을 초과하는지를 판단할 수 있다. In operation S350, the speech recognition apparatus 100 may determine whether the similarity between the extracted eye vector and the eye vector of the selected speaker model exceeds the speaker threshold.

예를 들어, 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터 간의 유사도가 화자 문턱값을 초과하지 않는 경우(S351), 화자 모델 업데이트를 수행하지 않을 수 있다. 다른 예를 들어, 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터 간의 유사도가 화자 문턱값을 초과하는 경우(S352), 단계 S360을 통해 업데이트 문턱값의 초과 여부를 판단할 수 있다. For example, when the similarity between the extracted eye vector and the eye vector of the selected speaker model does not exceed the speaker threshold (S351), the speech recognition apparatus 100 may not perform speaker model update. For another example, when the similarity between the extracted eye vector and the child vector of the selected speaker model exceeds the speaker threshold (S352), the speech recognition apparatus 100 may determine whether the update threshold is exceeded through step S360. I can.

단계 S360에서 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성(likelihood)이 업데이트 문턱값 이상인지를 판단할 수 있다. In step S360, the speech recognition apparatus 100 may determine whether the likelihood that the extracted eye vector and the eye vector of the selected speaker model are the same speaker is equal to or greater than the update threshold.

예를 들어, 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성이 업데이트 문턱값을 초과하지 않는 경우(S361), 화자 모델 업데이트를 수행하지 않을 수 있다. 다른 예를 들어, 음성 인식 장치(100)는 추출된 아이 벡터와 선택된 화자 모델의 아이 벡터가 동일 화자일 가능성이 업데이트 문턱값을 초과하는 경우(S362), 단계 S370을 통해 선택된 화자 모델을 업데이트할 수 있다. For example, when the likelihood that the extracted eye vector and the eye vector of the selected speaker model are the same speaker does not exceed the update threshold (S361), the speech recognition apparatus 100 may not perform speaker model update. For another example, when the likelihood that the extracted eye vector and the eye vector of the selected speaker model are the same speaker exceeds the update threshold (S362), the speech recognition apparatus 100 may update the selected speaker model through step S370. I can.

상술한 설명에서, 단계 S310 내지 S370은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다.In the above description, steps S310 to S370 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be switched.

도 4는 본 발명의 일 실시예에 따른 화자 모델을 업데이트하는 음성 인식 장치에서 화자 모델을 업데이트하는 방법의 순서도이다. 도 4에 도시된 음성 인식 장치(100)에서 화자 모델을 업데이트하는 방법은 도 1 내지 도 3에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도 1 내지 도 3에 도시된 실시예에 따른 음성 인식 장치(100)에서 화자 모델을 업데이트하는 방법에도 적용된다. 4 is a flowchart of a method of updating a speaker model in a speech recognition apparatus for updating a speaker model according to an embodiment of the present invention. A method of updating a speaker model in the speech recognition apparatus 100 shown in FIG. 4 includes steps that are processed in time series according to the embodiment shown in FIGS. 1 to 3. Accordingly, even if omitted below, it is also applied to a method of updating a speaker model in the speech recognition apparatus 100 according to the exemplary embodiment illustrated in FIGS. 1 to 3.

단계 S410에서 음성 인식 장치(100)는 사용자로부터 발화된 음성을 입력받을 수 있다. In step S410, the speech recognition apparatus 100 may receive a voice spoken by a user.

단계 S420에서 음성 인식 장치(100)는 입력받은 음성으로부터 특징 벡터값을 추출할 수 있다. In step S420, the speech recognition apparatus 100 may extract a feature vector value from the received speech.

단계 S430에서 음성 인식 장치(100)는 추출된 특징 벡터값과 기등록된 복수의 화자 모델을 비교하여 유사도를 산출할 수 있다. In step S430, the speech recognition apparatus 100 may calculate a similarity by comparing the extracted feature vector value with a plurality of previously registered speaker models.

단계 S440에서 음성 인식 장치(100)는 산출된 유사도에 기초하여 기등록된 복수의 화자 모델 중 어느 하나의 화자 모델을 선택할 수 있다. In step S440, the speech recognition apparatus 100 may select any one speaker model from among a plurality of previously registered speaker models based on the calculated similarity.

단계 S450에서 음성 인식 장치(100)는 추출된 특징 벡터값과 선택된 화자 모델 간의 유사도에 기초하여 선택된 화자 모델을 업데이트할 수 있다. In operation S450, the speech recognition apparatus 100 may update the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model.

상술한 설명에서, 단계 S410 내지 S450은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다.In the above description, steps S410 to S450 may be further divided into additional steps or may be combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as necessary, and the order between steps may be switched.

도 1 내지 도 4를 통해 설명된 음성 인식 장치에서 화자 모델을 업데이트하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 4를 통해 설명된 음성 인식 장치에서 화자 모델을 업데이트하는 방법은 컴퓨터에 의해 실행되는 매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method of updating the speaker model in the speech recognition apparatus described with reference to FIGS. 1 to 4 may be implemented in the form of a computer program stored in a medium executed by a computer or a recording medium including instructions executable by a computer. . In addition, the method of updating the speaker model in the speech recognition apparatus described with reference to FIGS. 1 to 4 may be implemented in the form of a computer program stored in a medium executed by a computer.

컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustrative purposes only, and those of ordinary skill in the art to which the present invention pertains will be able to understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative and non-limiting in all respects. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100: 음성 인식 장치
110: 음성 입력부
120: 추출부
130: 유사도 산출부
140: 화자 모델 선택부
150: 업데이트 판단부
160: 화자 모델 업데이트부100: speech recognition device
110: voice input unit
120: extraction unit
130: similarity calculation unit
140: speaker model selection unit
150: update determination unit
160: speaker model update unit

Claims

In the speech recognition device for updating a speaker model,
A voice signal input unit receiving a voice spoken from a user;
A vector extractor for extracting a feature vector value from the input voice;
A similarity calculation unit for calculating a similarity by comparing the extracted feature vector value with a plurality of previously registered speaker models;
A speaker model selection unit for selecting any one speaker model among the plurality of previously registered speaker models based on the calculated similarity; And
A speaker model update unit that updates the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model
Containing, speech recognition device.

The method of claim 1,
The vector extraction unit generates a plurality of frames from the input speech, extracts a feature vector value from the generated plurality of frames, and selects an effective speech frame based on the extracted feature vector values. .

The method of claim 2,
The vector extraction unit extracts an I-Vector from the selected valid speech frame.

The method of claim 3,
The similarity calculation unit calculates the similarity by comparing the extracted eye vector with the pre-registered eye vectors of the plurality of speaker models.

The method of claim 4,
The similarity calculation unit calculates the similarity based on a likelihood that the extracted child vector and the child vectors of each of the plurality of pre-registered speaker models are the same speaker and different speakers,
Wherein the speaker model selection unit selects any one speaker model from among the plurality of previously registered speaker models based on the calculated priority of the similarity.

The method of claim 5,
An update determination unit that determines whether to update the selected speaker model based on whether the similarity between the extracted eye vector and the selected speaker model exceeds a speaker threshold
That further comprises a voice recognition device.

The method of claim 6,
The update determination unit determines whether to perform the update of the selected speaker model based on whether the extracted child vector and the child vector of the selected speaker model are the same speaker exceeds an update threshold. Speech recognition device.

The method of claim 7,
The speaker model update unit updates the selected speaker model by setting at least one of a probability that the extracted eye vector and the eye vector of the selected speaker model are the same speaker and environment factors as weights.

In a method for updating a speaker model in a speech recognition device,
Receiving a voice spoken by a user;
Extracting a feature vector value from the input speech;
Calculating a similarity by comparing the extracted feature vector value with a plurality of previously registered speaker models;
Selecting any one speaker model from among the plurality of previously registered speaker models based on the calculated similarity; And
Updating the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model
Including, speaker model update method.

The method of claim 9,
The step of extracting the feature vector value,
Generating a plurality of frames from the received voice;
Extracting feature vector values from the generated plurality of frames;
And selecting a valid speech frame based on the extracted feature vector value.

The method of claim 10,
The step of extracting the feature vector value,
And extracting an I-Vector from the selected valid speech frame.

The method of claim 11,
The step of calculating the similarity,
And calculating the similarity by comparing the extracted eye vector with the eye vectors of the plurality of previously registered speaker models.

The method of claim 12,
Calculating the similarity based on a likelihood that the extracted eye vector and the eye vectors of each of the plurality of pre-registered speaker models are the same speaker and different speakers; And
And selecting any one speaker model from among the plurality of previously registered speaker models based on the calculated priority of the similarity.

The method of claim 13,
The method further comprising determining whether to update the selected speaker model based on whether a similarity between the extracted eye vector and the selected speaker model exceeds a speaker threshold.

The method of claim 14,
The step of determining whether to update the speaker model is performed,
Determining whether to update the selected speaker model based on whether the extracted eye vector and the child vector of the selected speaker model are the same speaker exceeds an update threshold, How to update speaker model.

The method of claim 15,
Updating the speaker model,
And updating the selected speaker model by setting at least one of a probability that the extracted eye vector and the eye vector of the selected speaker model are the same speaker and environment factors as weights.

A computer program stored in a computer-readable medium comprising a sequence of instructions for updating a speaker model, comprising:
When the computer program is executed by a computing device,
Receive the voice spoken by the user,
Extracting a feature vector value from the received voice,
A similarity is calculated by comparing the extracted feature vector value with a plurality of previously registered speaker models,
Selecting any one speaker model from among the plurality of previously registered speaker models based on the calculated similarity,
A computer program stored in a medium comprising a sequence of instructions for updating the selected speaker model based on the similarity between the extracted feature vector value and the selected speaker model.