KR20190119521A

KR20190119521A - Electronic apparatus and operation method thereof

Info

Publication number: KR20190119521A
Application number: KR1020190035388A
Authority: KR
Inventors: 후이 시에
Original assignee: 삼성전자주식회사
Priority date: 2018-04-12
Filing date: 2019-03-27
Publication date: 2019-10-22
Also published as: CN108510981A; CN108510981B

Abstract

An electronic apparatus and an operation method thereof are provided. The operation method of the electronic apparatus may include the following steps of: receiving a first voice signal through a first channel and a second voice signal through a second channel different from the first channel; acquiring first voice data from the first voice signal and the second voice data from the second voice signal; learning, based on the second voice data, a background speaker model acquired based on voice signals for a plurality of speakers; determining whether the first voice data is voice data corresponding to a first registered speaker registered in the electronic apparatus; and if the first voice data is the voice data corresponding to the first registered speaker, learning, based on the first voice data and the background speaker model, a first registered speaker model for recognizing the voice of the first registered speaker.

Description

ELECTRICAL APPARATUS AND OPERATION METHOD THEREOF

본 개시는 전자 장치 및 그 동작 방법에 관한 것으로, 특히 음성 인식 모델 학습용 음성 데이터를 획득하는 전자 장치 및 그 동작 방법에 관한 것이다.The present disclosure relates to an electronic device and a method of operating the same, and more particularly, to an electronic device for acquiring voice data for learning a voice recognition model and a method of operating the same.

모바일 지능형 단말기 음성 인식은 의미론적(semantic) 인식과 화자(speaker) 인식의 두 가지 카테고리로 분류된다. 화자 인식은 일반적으로 성문(voiceprint) 인식이라고도 하며, 문장 종속(Text-dependent) 방식과 문장 독립(Text-independent) 방식의 두 가지 유형으로 구분될 수 있다.Mobile intelligent terminal speech recognition falls into two categories: semantic recognition and speaker recognition. Speaker recognition is generally referred to as voiceprint recognition, and can be classified into two types, a text-dependent method and a text-independent method.

문장 종속 음성 인식은 사용자가 등록하고자 하는 관련 기능 정보를 기록하기 위해, 사용자에게 고정된 문장을 2-3회 반복해서 읽도록 요구한다. 음성 인식 시, 사용자는 동일한 고정 된 문장을 읽어야 한다.Sentence dependent speech recognition requires the user to read a fixed sentence 2-3 times in order to record the relevant function information that the user wants to register. In speech recognition, the user must read the same fixed sentence.

한편, 문장 독립 음성 인식은 사용자에게 고정된 문장을 따라 읽도록 요구하지 않는다. 사용자가 기계 학습을 위하여 많은 양의 음성 데이터를 입력함으로써, 사용자의 특징 정보는 많은 양의 데이터 학습을 통해 고도로 정제될 수 있다. 사용자는 음성 예측 시, 동일한 고정 된 문장을 읽어야 할 필요가 없다.On the other hand, sentence independent speech recognition does not require the user to read along with a fixed sentence. As a user inputs a large amount of voice data for machine learning, the feature information of the user can be highly refined through a large amount of data learning. The user does not need to read the same fixed sentence during speech prediction.

종래 기술에 따른 모바일 지능형 단말기는 사용자 신원을 구분할 수 없고, 서로 다른 사용자의 음성 특징을 구별하지 못하므로, 동일한 모바일 지능형 단말기가 서로 다른 사용자에게 동일한 음성 명령 서비스를 제공하게 될 수 있어, 사용자 프라이버시에 대한 보안성이 떨어지는 문제가 있다.Since the mobile intelligent terminal according to the prior art cannot distinguish user identities and different voice characteristics of different users, the same mobile intelligent terminal may provide the same voice command service to different users, thereby providing user privacy. There is a problem of poor security.

음성 어시스턴트를 예로 들면, 기존의 모바일 지능형 단말기는 음성 어시스턴트 서비스를 시작할 때, 고정된 웨이크업(wake up) 프로세스를 필요로 한다. 이와 같은 프로세스는 문장 종속 음성 인식의 결함으로서, 사용자는 음성 어시스턴트를 깨우기 위해 고정된 문장를 읽어야 하며, 음성 어시스턴트는 사용자의 모든 음성 명령어에 대해 신속하게 응답할 수 없다. Taking voice assistant as an example, existing mobile intelligent terminals require a fixed wake up process when starting a voice assistant service. This process is a flaw in sentence dependent speech recognition, in which the user has to read a fixed sentence to wake up the voice assistant, and the voice assistant cannot respond quickly to all the voice commands of the user.

즉, 모든 음성 명령은 음성 어시스턴트의 웨이크업 프로세스의 완료 후에 사용될 수 있다. 등록된 사용자가 고정된 문장를 사용하여 음성 어시스턴트를 깨우면, 등록되지 않은 사용자가 음성 명령을 내릴 수 있다. 이처럼 기존의 음성 어시스턴트가 화자를 구별하는 방법은, 문장 종속 음성 인식에 기초하여 고정된 문장의 사용을 통해 화자를 구별하는 것으로 제한된다.That is, all voice commands can be used after completion of the wake up process of the voice assistant. When a registered user wakes up the voice assistant using a fixed sentence, an unregistered user may give a voice command. As described above, the method of distinguishing a speaker from the existing voice assistant is limited to distinguishing the speaker through the use of a fixed sentence based on sentence dependent speech recognition.

완전한 학습 모델 및 다량의 음성 데이터 입력을 통한 학습을 통해 고도로 정제된 사용자 특징 정보 및 모델 파라미터를 얻기 위해, 문장 독립 음성 인식은 기계 학습(machine learning) 기술을 사용할 수 있다. To obtain highly refined user characteristic information and model parameters through learning through a complete learning model and a large amount of speech data input, sentence independent speech recognition may use machine learning techniques.

학습된 화자 모델에 기초하여, 사용자는 고정된 문장에 구애받지 않고 임의의 음성 입력을 통해 높은 정확도의 화자 음성 인식을 달성할 수 있다.Based on the learned speaker model, the user can achieve high accuracy speaker speech recognition through any speech input regardless of the fixed sentence.

그러나, 모바일 지능형 단말기에서 문장 독립 음성 인식을 구현하기 위해서는 등록된 사람과 등록되지 않은 사람의 음성 데이터가 모두 필요하다. 사용자는 음성 데이터를 입력하기 위해 많은 시간과 노력을 투자해야 하며, 등록되지 않은 사용자로부터 음성 데이터를 획득하는 것은 사용자에게도 난처한 문제이다. However, in order to implement sentence-independent speech recognition in a mobile intelligent terminal, both registered and unregistered voice data are required. The user must invest a lot of time and effort to input voice data, and obtaining voice data from an unregistered user is also a difficult problem for the user.

충분한 학습 데이터 없이 높은 인식 정확도를 얻는 것은 어렵기 때문에, 기존의 모바일 지능형 단말기는 일반적으로 문장 독립 음성 인식을 사용하지 않는다.Since it is difficult to obtain high recognition accuracy without sufficient training data, conventional mobile intelligent terminals generally do not use sentence independent speech recognition.

본 개시는 사용자의 음성 통화를 통해 획득한 음성 데이터를 학습에 사용함으로써, 사용자의 음성 데이터를 입력에 대한 부담을 줄이고 사용자 경험을 향상시킬 수 있는 전자 장치 및 그 동작 방법을 제공하는 것을 목적으로 한다.An object of the present disclosure is to provide an electronic device and a method of operating the same, which can reduce the burden on input and improve the user experience by using voice data acquired through a user's voice call for learning. .

본 개시의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 개시의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 개시의 실시예에 의해 보다 분명하게 이해될 것이다. 또한, 본 개시의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present disclosure are not limited to the above-mentioned objects, and other objects and advantages of the present disclosure which are not mentioned above may be understood by the following description, and will be more clearly understood by the embodiments of the present disclosure. Also, it will be readily appreciated that the objects and advantages of the present disclosure may be realized by the means and combinations thereof indicated in the claims.

상술한 기술적 과제를 달성하기 위한 기술적 수단으로서, 본 개시의 제1 측면은, 제1 채널을 통해 제1 음성 신호를 수신하고, 상기 제1 채널과 다른 제2 채널을 통해 제2 음성 신호를 수신하는 동작, 상기 제1 음성 신호로부터 제1 음성 데이터를 획득하고, 상기 제2 음성 신호로부터 제2 음성 데이터를 획득하는 동작, 상기 제2 음성 데이터에 기초하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델을 학습시키는 동작, 상기 제1 음성 데이터가 상기 전자 장치에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단하는 동작 및 상기 제1 음성 데이터가 상기 제1 등록 화자에 대응하는 음성 데이터인 경우, 상기 제1 음성 데이터 및 상기 배경 화자 모델에 기초하여, 상기 제1 등록 화자의 음성을 인식하는 제1 등록 화자 모델을 학습시키는 동작을 포함하는, 전자 장치의 동작 방법을 제공할 수 있다.As a technical means for achieving the above technical problem, a first aspect of the present disclosure, receiving a first voice signal through a first channel, and receives a second voice signal through a second channel different from the first channel To obtain first voice data from the first voice signal, and to obtain second voice data from the second voice signal, based on the second voice data, to a voice signal for a plurality of speakers. Learning a background speaker model obtained based on the information; determining whether the first voice data is voice data corresponding to a first registered speaker registered in the electronic device; and registering the first voice data with the first registration. In the case of voice data corresponding to a speaker, a first registered speaker model for recognizing a voice of the first registered speaker is studied based on the first voice data and the background speaker model. A method of operating an electronic device may be provided, including an operation of moistening.

또한, 본 개시의 제2 측면은, 제1 채널을 통해 제1 음성 신호를 수신하고, 상기 제1 채널과 다른 제2 채널을 통해 제2 음성 신호를 수신하고, 상기 제1 음성 신호로부터 제1 음성 데이터를 획득하고, 상기 제2 음성 신호로부터 제2 음성 데이터를 획득하고, 상기 제2 음성 데이터에 기초하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델을 학습시키고, 상기 제1 음성 데이터가 상기 전자 장치에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단하고, 상기 제1 음성 데이터가 상기 제1 등록 화자에 대응하는 음성 데이터인 경우, 상기 제1 음성 데이터 및 상기 배경 화자 모델에 기초하여, 상기 제1 등록 화자의 음성을 인식하는 제1 등록 화자 모델을 학습시키는 적어도 하나의 프로세서 및 상기 제1 음성 데이터 및 상기 제2 음성 데이터를 저장하는 메모리를 포함하는, 전자 장치를 제공할 수 있다. Further, a second aspect of the present disclosure provides a first voice signal through a first channel, a second voice signal through a second channel different from the first channel, and a first voice signal from the first voice signal. Acquire voice data, obtain second voice data from the second voice signal, learn a background speaker model acquired based on voice signals for a plurality of speakers, based on the second voice data, and It is determined whether first voice data is voice data corresponding to a first registered speaker registered in the electronic device, and when the first voice data is voice data corresponding to the first registered speaker, the first voice data. And at least one processor for learning a first registered speaker model for recognizing a voice of the first registered speaker and the first voice data and the second sound based on the background speaker model. Comprising a memory for storing data, it is possible to provide an electronic device.

또한, 본 개시의 제3 측면은, 제1 측면의 방법을 수행하도록 하는 프로그램이 저장된 기록매체를 포함하는, 컴퓨터 프로그램 제품을 제공할 수 있다.In addition, a third aspect of the present disclosure may provide a computer program product comprising a recording medium having stored thereon a program for performing the method of the first aspect.

본 개시의 전자 장치 및 그 동작 방법에 의하면, 사용자의 음성 통화를 통해 획득한 음성 데이터를 학습에 사용함으로써, 사용자의 음성 데이터를 입력에 대한 부담을 줄이고 사용자 경험을 향상시킬 수 있는 효과가 있다.According to the electronic device and the operation method of the present disclosure, by using the voice data acquired through the user's voice call for learning, there is an effect that can reduce the burden on the input of the user's voice data and improve the user experience.

도 1은 일부 실시예에 따른 등록 화자 모델 생성 방법의 예시를 나타내는 도면이다.
도 2는 일부 실시예에 따른 전자 장치 동작 방법의 흐름도이다.
도 3은 일부 실시예에 따른 전자 장치의 제1 등록 화자 모델 학습 방법을 나타낸 흐름도이다.
도 4는 일부 실시예에 따른 전자 장치의 학습, 등록 및 인식 방법을 개략적으로 나타낸 도면이다.
도 5a는 일부 실시예에 따른 GMM의 분포와 학습할 화자의 음성 특징 분포를 나타낸 도면이다.
도 5b는 일부 실시예에 따른 MAP 적응 후의 GMM 모델의 분포를 나타낸 도면이다.
도 6은 일부 실시예에 따른 전자 장치의 제2 등록 화자 모델 학습 방법을 나타낸 흐름도이다.
도 7은 일부 실시예에 따른 전자 장치의 음성 명령 수행 과정을 나타낸 도면이다.
도 8은 일부 실시예에 따른 전자 장치의 음성 명령 수행 과정을 나타낸 도면이다.
도 9는 일부 실시예에 따른 호스트 전자 장치의 음성 처리 방법을 나타낸 도면이다.
도 10은 일부 실시예에 따른 전자 장치의 구성을 개략적으로 나타낸 블록도이다.
도 11은 일부 실시예에 따른 서버의 구성을 개략적으로 나타낸 구성도이다.1 is a diagram illustrating an example of a method of generating a registered speaker model, according to some embodiments.
2 is a flowchart of a method of operating an electronic device, according to some embodiments.
3 is a flowchart illustrating a method of learning a first registered speaker model of an electronic device according to an exemplary embodiment.
4 is a diagram schematically illustrating a method of learning, registering, and recognizing an electronic device, according to an exemplary embodiment.
5A is a diagram illustrating a distribution of a GMM and a speech feature distribution of a speaker to be learned, according to an exemplary embodiment.
5B is a diagram illustrating a distribution of a GMM model after MAP adaptation according to some embodiments.
6 is a flowchart illustrating a method of learning a second registered speaker model of an electronic device according to an exemplary embodiment.
7 is a flowchart illustrating a process of performing a voice command of an electronic device according to some embodiments.
8 is a flowchart illustrating a process of performing a voice command of an electronic device according to some embodiments.
9 is a diagram illustrating a voice processing method of a host electronic device according to some embodiments.
10 is a block diagram schematically illustrating a configuration of an electronic device according to some embodiments.
11 is a configuration diagram schematically illustrating a configuration of a server according to some embodiments.

전술한 목적, 특징 및 장점은 첨부된 도면을 참조하여 상세하게 후술되며, 이에 따라 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 본 발명의 기술적 사상을 용이하게 실시할 수 있을 것이다. 본 발명을 설명함에 있어서 본 발명과 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 상세한 설명을 생략한다. 이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명하기로 한다. 도면에서 동일한 참조부호는 동일 또는 유사한 구성요소를 가리키는 것으로 사용된다.The above objects, features, and advantages will be described in detail with reference to the accompanying drawings, whereby those skilled in the art to which the present invention pertains may easily implement the technical idea of the present invention. In describing the present invention, when it is determined that the detailed description of the known technology related to the present invention may unnecessarily obscure the gist of the present invention, the detailed description will be omitted. Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. In the drawings, the same reference numerals are used to indicate the same or similar components.

도 1은 일부 실시예에 따른 등록 화자 모델 생성 방법의 예시를 나타내는 도면이다.1 is a diagram illustrating an example of a method of generating a registered speaker model, according to some embodiments.

도 1을 참조하면, 일부 실시예에 따른 전자 장치(1000)는 사용자(10)로부터 음성 신호를 입력받을 수 있다.Referring to FIG. 1, an electronic device 1000 according to an exemplary embodiment may receive a voice signal from a user 10.

전자 장치(1000)는 사용자(10)에 의해 이용될 수 있는 단말 장치일 수 있다. 전자 장치(1000)는, 예를 들어, 휴대폰(mobile phone), 스마트폰(smart phone). 스마트 TV, 스마트 오디오, 스마트 스피커, 인공지능 스피커, PC(Personal Computer), 노트북 컴퓨터 및 태블릿 PC와 같이 사용자(10)의 음성 신호를 직접 입력받을 수 있는 장치일 수 있으며, 반드시 이에 한정되는 것은 아니다.The electronic device 1000 may be a terminal device that may be used by the user 10. The electronic device 1000 is, for example, a mobile phone or a smart phone. It may be a device that can directly receive a voice signal of the user 10, such as smart TV, smart audio, smart speaker, artificial intelligence speaker, PC (Personal Computer), notebook computer and tablet PC, but is not necessarily limited thereto. .

전자 장치(1000)는 사용자(10)로부터 입력받은 입력 음성 신호(16)에 기초하여, 사용자(10)가 등록 화자인지 비 등록 화자인지 판단할 수 있다.The electronic apparatus 1000 may determine whether the user 10 is a registered speaker or a non-registered speaker based on the input voice signal 16 received from the user 10.

전자 장치(1000)는, 예를 들어, 사용자(10)가 비 등록 화자로 판단될 경우(161), 입력 음성 신호(16)로부터 획득된 음성 특징을 사용하여 복수의 화자에 대응하는 배경 화자 모델을 학습(18)시킬 수 있다.For example, when the user 10 determines that the user 10 is a non-registered speaker (161), the electronic device 1000 uses a voice feature obtained from the input voice signal 16 to correspond to a background speaker model corresponding to the plurality of speakers. Can be learned 18.

화자를 판단하는 방법의 경우, 가우시안 혼합 모델(GMM, Gaussian mixture model)을 이용하는 방법이 널리 알려져 있다. 가우시안 혼합 모델을 이용하는 화자 인식시스템이 높은 정확도를 갖기 위해서는 학습을 위한 대량의 발화가 필요하고, 사전에 복수의 화자에 대한 발화를 이용하여 배경 화자 모델을 구축 후, 사용자의 소량 발화를 이용하여 적응 학습을 수행하는 GMM-UBM방법이 이용되고 있다.In the case of a method of determining a speaker, a method using a Gaussian mixture model (GMM) is widely known. A speaker recognition system using Gaussian mixture model requires a large amount of speech for learning, and builds a background speaker model using speech for a plurality of speakers in advance, and then adapts the user using a small amount of speech. The GMM-UBM method of performing learning is used.

전자 장치(1000)는 사용자(10)가 비 등록 화자로 판단될 경우(161), 입력 음성 신호(16)에 기초하여 복수의 화자에 대응하는 배경 화자 모델을 학습 시키고, 학습된 배경 화자 모델을 이후 사용자(10)의 소량 발화를 이용한 적응 학습에 사용할 수 있다. 적응 학습을 통해, 사용자(10)에 대한 등록 화자 모델이 생성 및 저장될 수 있다(19).When the user 10 determines that the user 10 is a non-registered speaker (161), the electronic apparatus 1000 trains the background speaker model corresponding to the plurality of speakers based on the input voice signal 16, and learns the learned background speaker model. After that, it can be used for adaptive learning using a small amount of speech of the user 10. Through adaptive learning, a registered speaker model for user 10 may be generated and stored (19).

전자 장치(1000)는, 예를 들어, 사용자(10)가 등록 화자로 판단될 경우(163), 사용자(10)로부터 입력받은 입력 음성 신호(16) 및 미리 획득된 배경 화자 모델을 사용하여 사용자(10)에 대한 음성 모델을 학습(18)시킬 수 있다.For example, when the user 10 is determined to be a registered speaker (163), the electronic apparatus 1000 may use the input voice signal 16 received from the user 10 and the previously obtained background speaker model. The speech model for (10) can be trained (18).

한편, 일부 실시예에 따른 전자 장치(1000)는 외부 장치로부터 음성 신호를 수신할 수 있다. 외부 장치로부터 수신된 수신 음성 신호(14)는, 예를 들어, 사용자(10)가 전자 장치(1000)를 통해 대화하는 상대방(12)의 통화 음성에 기초하여 생성된 음성 신호일 수 있으나, 반드시 이에 한정되는 것은 아니다.Meanwhile, the electronic device 1000 according to some embodiments may receive a voice signal from an external device. The received voice signal 14 received from the external device may be, for example, a voice signal generated based on the voice of the call of the counterpart 12 with whom the user 10 communicates through the electronic device 1000. It is not limited.

전자 장치(1000)는 외부 장치로부터 수신한 수신 음성 신호(14)에 기초하여, 복수의 화자에 대응하는 배경 화자 모델을 학습(18)시킬 수 있다. 즉, 입력 음성 신호(16)와 다르게, 전자 장치(1000)는 외부 장치로부터 수신한 수신 음성 신호(14)에 대해서는 등록 화자에 의하여 입력된 음성 신호인지 여부를 판단하지 않고, 복수의 화자에 대응하는 배경 화자 모델의 학습에 사용할 수 있다.The electronic apparatus 1000 may learn 18 a background speaker model corresponding to the plurality of speakers based on the received voice signal 14 received from the external device. That is, unlike the input voice signal 16, the electronic apparatus 1000 does not determine whether the received voice signal 14 received from an external device is a voice signal input by a registered speaker, and corresponds to a plurality of speakers. Background can be used to train the speaker model.

이처럼 일부 실시예에 따른 전자 장치(1000)는 서로 다른 채널을 통해 전자 장치(1000)에 입력되는 음성 신호에 대하여 서로 다른 방식의 음성 처리 과정을 적용함으로써, 사용자(10) 및 사용자(10)의 통화 상대방 등의 음성을 효과적으로 분류하고, 분류된 음성에 기초하여 음성 모델 학습을 수행할 수 있다. As such, the electronic apparatus 1000 according to an exemplary embodiment may apply a different voice processing process to a voice signal input to the electronic apparatus 1000 through different channels, thereby allowing the user 10 and the user 10 to perform a different process. Voices such as call parties and the like can be effectively classified, and voice model training can be performed based on the classified voices.

도 2는 일부 실시예에 따른 전자 장치 동작 방법의 흐름도이다.2 is a flowchart of a method of operating an electronic device, according to some embodiments.

S201 단계에서, 전자 장치(1000)는 제1 채널을 통해 제1 음성 신호를 수신할 수 있다. S202 단계에서, 전자 장치(1000)는 제1 채널과 다른 제2 채널을 통해 제2 음성 신호를 수신할 수 있다.In operation S201, the electronic apparatus 1000 may receive a first voice signal through a first channel. In operation S202, the electronic apparatus 1000 may receive a second voice signal through a second channel different from the first channel.

일부 실시예에서, 사용자(10)는 전자 장치(1000)를 통해 상대방(12)과 음성 통화를 수행할 수 있다. 여기서 음성 통화는, 음성만을 주고 받는 통화뿐만 아니라 VoIP, VoLTE 등의 화상 통화를 포함할 수 있다. 예를 들어, 사용자가 전자 장치(1000)를 통해 음성 통화를 개시하면, 도 2의 동작 방법이 트리거(trigger)될 수 있다. In some embodiments, the user 10 may perform a voice call with the counterpart 12 through the electronic device 1000. Here, the voice call may include a video call such as VoIP and VoLTE as well as a call that only exchanges voice. For example, when a user initiates a voice call through the electronic device 1000, the operation method of FIG. 2 may be triggered.

S202 단계에서, 전자 장치(1000)는 제1 음성 신호로부터 제1 음성 데이터를 획득하고, 제2 음성 신호로부터 제2 음성 데이터를 획득할 수 있다.In operation S202, the electronic apparatus 1000 may obtain first voice data from the first voice signal and may obtain second voice data from the second voice signal.

사용자의 음성, 즉 음성 파형은 사용자의 발성 메커니즘에 의해 생성된 정보를 모두 가지고 있다는 가정하에 입력 음성 신호로서 전자 장치(1000)에 입력될 수 있다. 전자 장치(1000)는 음성 신호에 포함된 모든 정보 중 음성 인식 목적에 부합되는 일부의 정보만을 사용할 수 있다. 인식 목적에 부합되는 일부의 정보는 음성 신호로부터 통계적인 방법으로 추출될 수 있다. The voice of the user, that is, the voice waveform, may be input to the electronic device 1000 as an input voice signal under the assumption that all the information generated by the user's voice mechanism is included. The electronic apparatus 1000 may use only some of the information included in the voice signal corresponding to the purpose of speech recognition. Some information meeting the recognition purpose may be extracted from the speech signal in a statistical manner.

이와 같이 음성 인식 시스템에 사용되기 위하여 음성 신호로부터 추출된 정보들은 음성 특징으로 지칭될 수 있다. 음성 특징은, 예를 들어, 주파수 상의 스펙트럼(spectrum) 분포가 서로 상이한 복수의 성분을 포함하도록 음성 신호로부터 추출될 수 있다.As such, the information extracted from the speech signal for use in the speech recognition system may be referred to as a speech feature. The speech feature may be extracted from the speech signal, for example, to include a plurality of components in which the spectrum distribution on the frequency is different from each other.

전자 장치(1000)는 음성 특징 추출 과정에서, 불필요하게 중복되는 음성정보를 없애고, 동일 음성 신호들 간의 일관성을 높임과 동시에 다른 음성 신호와는 변별력을 높일 수 있는 정보로서, 특징 벡터를 추출할 수 있다.In the voice feature extraction process, the electronic apparatus 1000 may extract feature vectors as information that may eliminate unnecessary redundant voice information, increase consistency among the same voice signals, and increase discrimination from other voice signals. have.

이와 같은 특징 벡터는, 예를 들어, 선형예측계수(Linear Predictive Coefficient), 켑스트럼(Cepstrum), 멜 프리퀀시 켑스트럼(Mel Frequency Cepstral Coefficient, MFCC), 주파수 대역별 에너지(Filter Bank Energy) 등의 방식을 통해 음성 신호로부터 추출될 수 있으나, 반드시 이에 한정되는 것은 아니다.Such feature vectors may include, for example, linear predictive coefficients, cepstrum, mel frequency cepstral coefficients (MFCC), energy according to frequency bands (Filter Bank Energy), and the like. It may be extracted from the voice signal through the method of, but is not necessarily limited thereto.

전자 장치(1000)에 의해 추출된 음성 특징은, 예를 들어, 음성 데이터 형태로 저장될 수 있으며, 화자 모델의 생성, 등록 및 화자에 대한 음성 인식 등에 사용될 수 있다.The voice feature extracted by the electronic apparatus 1000 may be stored, for example, in the form of voice data, and may be used for generating, registering, and recognizing a speaker.

S202 단계에서, 전자 장치(1000)는, 예를 들어, 제1 음성 데이터와 제2 음성 데이터가 기 설정된 화자 모델 학습 조건에 부합하는지 여부를 판단할 수 있다.In operation S202, the electronic apparatus 1000 may determine whether, for example, the first voice data and the second voice data meet a preset speaker model learning condition.

화자 모델 학습 조건은, 예를 들어, 각 음성 데이터가 사용자의 발성 메커니즘에 의해 생성된 것인지 여부에 관한 조건을 포함할 수 있다. 즉, 전자 장치(1000)는, 예를 들어, 제1 음성 데이터 및 제2 음성 데이터가 non-mute features를 포함하는지 여부를 판단할 수 있다. The speaker model training condition may include, for example, a condition regarding whether each voice data is generated by a user's speech mechanism. That is, the electronic apparatus 1000 may determine whether the first voice data and the second voice data include non-mute features, for example.

전자 장치(1000)는 제1 음성 데이터 및 제2 음성 데이터가 non-mute features를 포함하지 않을 경우, 즉, 제1 음성 데이터 및 제2 음성 데이터에는 화자에 의한 어떠한 음성도 기록되어 있지 않은 것으로 판단될 경우, 제1 음성 데이터 및 제2 음성 데이터가 화자 모델 학습 조건을 충족하지 않는 것으로 판단할 수 있다. When the first voice data and the second voice data do not include non-mute features, that is, the electronic apparatus 1000 determines that no voice by the speaker is recorded in the first voice data and the second voice data. In this case, it may be determined that the first voice data and the second voice data do not satisfy the speaker model training condition.

판단 결과, 제1 음성 데이터와 제2 음성 데이터가 기 설정된 화자 모델 학습 조건에 부합하면, 음성 데이터 획득 장치는 S203 단계를 수행할 수 있다.As a result of the determination, when the first voice data and the second voice data meet the preset speaker model learning condition, the voice data obtaining apparatus may perform step S203.

일부 실시예에서, 선택적으로, S202 단계 이후, S203 단계 이전에, 제1 음성 데이터 및 제2 음성 데이터에 대해 음성 클리닝이 추가로 수행될 수 있다. 음성 클리닝은 노이즈 제거(denoising) 및 노이즈 감소(noise reduction) 등을 포함할 수 있므로, 제1 음성 데이터 및 제2 음성 데이터의 품질이 향상되어 화자 모델 학습 효과 또한 향상될 수 있다.In some embodiments, voice cleaning may be further performed on the first voice data and the second voice data after step S202 and before step S203. Since voice cleaning may include denoising, noise reduction, and the like, the quality of the first voice data and the second voice data may be improved, and thus the speaker model learning effect may also be improved.

한편, 일부 실시예에 따른 전자 장치(1000)가 모바일 지능형 단말기에 해당할 경우, S202 단계는 음성 데이터를 저장하는 단계로서, 전자 장치(1000)의 운영 체제에서 하드웨어 장치 동작 계층이 설정될 수 있다. Meanwhile, when the electronic device 1000 according to some embodiments corresponds to a mobile intelligent terminal, step S202 is storing voice data, and a hardware device operation layer may be set in an operating system of the electronic device 1000. .

사용자가 전자 장치(1000)의 애플리케이션을 통해 음성 통화를 시작하면, 사용자는 입력된 음성 데이터를 시스템의 하드웨어 장치 동작 계층에 실시간으로 입력하고 저장할 수 있다. 이때 전자 장치(1000)의 마이크로폰에 대한 입력 음성 신호는 모바일 지능형 단말기, 즉 전자 장치(1000)의 사용자의 음성 데이터에 대응된다. 한편, 전자 장치(1000)의 통신부에 대한 수신 음성 신호는 음성 통화 상대방의 음성 데이터에 대응된다.When a user starts a voice call through an application of the electronic device 1000, the user may input and store the input voice data into the hardware device operation layer of the system in real time. In this case, the input voice signal to the microphone of the electronic device 1000 corresponds to voice data of the mobile intelligent terminal, that is, the user of the electronic device 1000. The received voice signal to the communication unit of the electronic apparatus 1000 corresponds to voice data of the voice call counterpart.

S203 단계에서, 전자 장치(1000)는 제2 음성 데이터에 기초하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델을 학습시킬 수 있다.In operation S203, the electronic apparatus 1000 may learn the background speaker model obtained based on the voice signals for the plurality of speakers based on the second voice data.

전자 장치(1000)는, 예를 들어, 배경 화자 모델의 학습에 사용되는 제2 음성 데이터를 배경 화자 모델 학습용 데이터로 분류하여, 전자 장치(1000)의 메모리에 저장할 수 있다. For example, the electronic apparatus 1000 may classify the second voice data used for learning the background speaker model as data for background speaker model training and store the same in the memory of the electronic apparatus 1000.

배경 화자 모델은, 최대한 다양한 화자의 음성 신호를 수집한 후, 수집된 음성 신호에 대응하는 음성 특징에 대하여 기대값 최대화 알고리즘(expectation maximization algorithm, 이하 EM algorithm이라고 지칭함)을 적용하여 생성되는 가우시안 혼합 모델이다.The background speaker model is a Gaussian mixture model that is generated by collecting the speech signals of as many speakers as possible and then applying an expectation maximization algorithm (hereinafter referred to as an EM algorithm) to speech features corresponding to the collected speech signals. to be.

S204 단계에서, 전자 장치(1000)는 제1 음성 데이터가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단할 수 있다.In operation S204, the electronic device 1000 may determine whether the first voice data is voice data corresponding to the first registered speaker registered in the electronic device 1000.

S205 단계에서, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터인 경우, 전자 장치(1000)는 제1 음성 데이터 및 배경 화자 모델에 기초하여, 제1 등록 화자의 음성을 인식하는 제1 등록 화자 모델(Speaker Model)을 학습시킬 수 있다.In operation S205, when the first voice data is voice data corresponding to the first registered speaker, the electronic apparatus 1000 may recognize the voice of the first registered speaker based on the first voice data and the background speaker model. You can train the Speaker Model.

전자 장치(1000)는, 예를 들어, 제1 등록 화자 모델의 학습에 사용되는 제1 음성 데이터를 제1 등록 화자 모델 학습용 데이터로 분류하여, 전자 장치(1000)의 메모리에 저장할 수 있다. For example, the electronic device 1000 may classify the first voice data used for learning the first registered speaker model as data for learning the first registered speaker model and store the same in the memory of the electronic device 1000.

전자 장치(1000)는, 예를 들어, 제1 등록 화자 모델을 학습시키기 위하여, 제1 음성 데이터로부터 특징 벡터를 추출하고, 추출된 특징 벡터 및 배경 화자 모델에 기초한 최대 사후 확률(Maximum A Posteriori, MAP) 적응 알고리즘을 사용할 수 있다. 음성 데이터 및 배경 화자 모델에 기초하여 등록 화자 모델을 학습시키는 구체적인 방법에 대해서는 도 4, 도 5a 및 도 5b를 통해 상세히 후술한다.For example, the electronic apparatus 1000 may extract a feature vector from the first voice data to train the first registered speaker model, and may calculate a maximum posterior probability based on the extracted feature vector and the background speaker model. MAP) adaptive algorithm can be used. A detailed method of learning the registered speaker model based on the voice data and the background speaker model will be described in detail later with reference to FIGS. 4, 5A, and 5B.

신경망 음성 인식 모델의 음성 특징 값 추출 기술은 화자의 음성 특징을 얻기 위해 많은 양의 개인 음성을 사전에 입력해야 한다. 기존의 방법은 사용자가 특정 프로그램에 의해 학습용 음성 데이터를 입력할 수 있게 하는 것이며, 사용자는 개인 음성 기능 학습을 수행하기 위한 추가 시간이 필요하게 된다.The speech feature extraction technique of neural network speech recognition model requires inputting a large amount of personal speech in advance to obtain the speaker's speech feature. Existing methods allow a user to input learning voice data by a specific program, and the user needs additional time for performing personal voice function learning.

본 개시의 일부 실시예에서, 사용자는 일상적으로 일하고 생활하면서 발생하는 인스턴트 음성 데이터를 전자 장치(1000)를 통해 수집하고, 수집된 음성 데이터를 음성 인식 모델의 학습에 사용함으로써, 음성 인식 모델의 인식 정확도를 지속적으로 향상시킬 수 있다. In some embodiments of the present disclosure, the user recognizes the voice recognition model by collecting the instant voice data generated while working and living on a daily basis through the electronic device 1000 and using the collected voice data for learning the voice recognition model. Accuracy can be improved continuously.

종래 기술과 비교하여, 본 개시의 학습 방법의 사용자는 지루하고 복잡한 학습용 음성 데이터 입력 작업을 수행할 필요가 없기 때문에, 사용자의 부담을 줄이고 사용자 경험을 향상시킬 수 있는 장점이 있다.Compared with the prior art, since the user of the learning method of the present disclosure does not need to perform a tedious and complicated learning voice data input operation, there is an advantage of reducing the burden on the user and improving the user experience.

동시에, 본 개시의 학습 방법은, 문장 독립 음성 인식의 축적이 모바일 지능형 단말기 상에서 구현될 수 있고, 기존의 문장 종속 음성 인식의 제한을 넘어서, 모바일 지능형 단말기가 각 사용자의 특성을 보다 지능적으로 이해하도록 할 수 있다.At the same time, the learning method of the present disclosure allows the accumulation of sentence independent speech recognition to be implemented on a mobile intelligent terminal, so that the mobile intelligent terminal more intelligently understands the characteristics of each user, beyond the limitations of existing sentence dependent speech recognition. can do.

음성 어시스턴트의 경우, 문장 독립 음성 인식 기능을 추가하면, 사용자의 신원을 식별하고 음성 인식 모델의 적용 대상인 음성 명령어만 처리함으로써 보안성을 강화할 수 있다.In the case of the voice assistant, the sentence-independent speech recognition function can be added to enhance security by identifying a user's identity and processing only a voice command to which the voice recognition model is applied.

한편, 전자 장치(1000)는 자원의 낭비 및 사용자 프라이버시의 누설을 피하기 위해, 학습이 종료된 직후, 제1 음성 데이터 및 제2 음성 데이터를 제거할 수 있고, 도 1의 방법이 종료될 수 있다. 또는, 음성 인식 모델의 학습이 종료되면 관련 음성 데이터는 삭제되고, 도 2의 방법은 종료될 수 있다.Meanwhile, in order to avoid waste of resources and leakage of user privacy, the electronic apparatus 1000 may remove the first voice data and the second voice data immediately after the learning is completed, and the method of FIG. 1 may end. . Alternatively, when the training of the speech recognition model is terminated, the associated speech data may be deleted, and the method of FIG. 2 may be terminated.

도 3은 일부 실시예에 따른 전자 장치의 제1 등록 화자 모델 학습 방법을 나타낸 흐름도이다.3 is a flowchart illustrating a method of learning a first registered speaker model of an electronic device according to an exemplary embodiment.

도 3을 참조하면, 일부 실시예에 따른 전자 장치(1000)는 사용자(10)로부터 제1 음성 신호(301)를 입력받고, 입력받은 제1 음성 신호(301)로부터 제1 음성 데이터(303)를 획득할 수 있다. Referring to FIG. 3, the electronic apparatus 1000 according to an exemplary embodiment receives the first voice signal 301 from the user 10 and the first voice data 303 from the received first voice signal 301. Can be obtained.

한편, 전자 장치(1000)는 외부 장치로부터 상대방(12)의 통화 음성에 기초한 제2 음성 신호(302)를 수신하고, 수신된 제2 음성 신호(302)로부터 제2 음성 데이터(304)를 획득할 수 있다.Meanwhile, the electronic apparatus 1000 receives a second voice signal 302 based on the call voice of the counterpart 12 from the external device, and obtains second voice data 304 from the received second voice signal 302. can do.

전자 장치(1000)는 획득한 제2 음성 데이터(304)를 사용하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델(306)을 학습시킬 수 있다.The electronic apparatus 1000 may learn the background speaker model 306 obtained based on the voice signals for the plurality of speakers using the acquired second voice data 304.

전자 장치(1000)는 획득한 제1 음성 데이터(303)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(305)할 수 있다.The electronic device 1000 may determine whether the acquired first voice data 303 is voice data corresponding to the first registered speaker registered in the electronic device 1000 (305).

전자 장치(1000)는, 예를 들어, 사용자로부터 별도의 인증용 데이터를 입력받아, 제1 음성 데이터(303)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(305)할 수 있다.For example, the electronic apparatus 1000 may receive separate authentication data from a user and determine whether the first voice data 303 is voice data corresponding to the first registered speaker registered in the electronic apparatus 1000. Decision 305 may be made.

전자 장치(1000)는, 예를 들어, 카메라를 통해 촬영된 이미지에 기초한 얼굴 인식 및 지문 인식 센서를 사용하는 지문 인식 중 적어도 하나에 기초하여, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(305)할 수 있다. The electronic apparatus 1000 may, for example, based on at least one of a face recognition based on an image photographed through a camera and a fingerprint recognition using a fingerprint recognition sensor, a voice in which the first voice data corresponds to the first registered speaker. It may be determined whether the data is 305.

얼굴 인식에 기초한 판단과 관련하여, 전자 장치(1000)는 미리 제1 등록 화자에 대해 촬영된 얼굴 이미지를 획득하여 저장할 수 있다. In relation to the determination based on face recognition, the electronic apparatus 1000 may obtain and store a face image photographed in advance for the first registered speaker.

사용자는, 예를 들어, 제1 음성 신호의 입력과 동시에 전자 장치(1000)에 포함된 카메라를 통해 자신의 얼굴 이미지를 입력함으로써, 자신이 제1 음성 데이터에 대응하는 제1 음성 신호를 입력한 사용자임을 인증할 수 있다.For example, the user inputs his or her face image through a camera included in the electronic apparatus 1000 simultaneously with the input of the first voice signal, whereby the user inputs the first voice signal corresponding to the first voice data. You can authenticate that you are a user.

사용자는, 다른 예로, 전자 장치(1000)에 제1 음성 신호를 입력한 후, 전자 장치(1000)의 화면에 표시되는 얼굴 이미지 입력 요청에 대한 대답으로, 전자 장치(1000)포함된 카메라를 통해 자신의 얼굴 이미지를 입력함으로써, 자신이 제1 음성 데이터에 대응하는 제1 음성 신호를 입력한 사용자임을 인증할 수도 있다.As another example, the user inputs a first voice signal to the electronic device 1000 and then responds to a request for inputting a face image displayed on a screen of the electronic device 1000 through a camera included in the electronic device 1000. By inputting a face image of one's own face, the user may authenticate that the user inputs a first voice signal corresponding to the first voice data.

지문 인식에 기초한 판단과 관련하여, 전자 장치(1000)는 미리 제1 등록 화자에 대해 획득된 지문 데이터를 저장할 수 있다.In relation to the determination based on fingerprint recognition, the electronic apparatus 1000 may store fingerprint data acquired for the first registered speaker in advance.

사용자는, 예를 들어, 제1 음성 신호의 입력과 동시에 전자 장치(1000)에 포함된 지문 인식 센서를 통해 자신의 지문 데이터를 입력함으로써, 자신이 제1 음성 데이터에 대응하는 제1 음성 신호를 입력한 사용자임을 인증할 수 있다.For example, the user inputs his or her fingerprint data through the fingerprint recognition sensor included in the electronic apparatus 1000 simultaneously with the input of the first voice signal, so that the user receives the first voice signal corresponding to the first voice data. The user can be authenticated.

사용자는, 다른 예로, 전자 장치(1000)에 제1 음성 신호를 입력한 후, 전자 장치(1000)의 화면에 표시되는 지문 데이터 요청에 대한 대답으로, 전자 장치(1000)포함된 지문 인식 센서를 통해 자신의 지문 이미지를 입력함으로써, 자신이 제1 음성 데이터에 대응하는 제1 음성 신호를 입력한 사용자임을 인증할 수도 있다.As another example, the user inputs a first voice signal to the electronic apparatus 1000 and then uses the fingerprint recognition sensor included in the electronic apparatus 1000 in response to a fingerprint data request displayed on the screen of the electronic apparatus 1000. By inputting a fingerprint image of the user through the user, the user may authenticate that the user inputs the first voice signal corresponding to the first voice data.

판단(305) 결과, 제1 음성 데이터(303)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터일 경우, 전자 장치(1000)는 제1 음성 데이터(303) 및 배경 화자 모델(306)을 사용하여 제1 등록 화자 모델(307)을 학습시킬 수 있다.As a result of the determination 305, when the first voice data 303 is voice data corresponding to the first registered speaker registered in the electronic device 1000, the electronic device 1000 may determine the first voice data 303 and the background speaker. The model 306 can be used to train the first registered speaker model 307.

반대로, 판단(305) 결과, 제1 음성 데이터(303)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터가 아닐 경우, 전자 장치(1000)는, 제2 음성 데이터(304)의 경우와 동일하게, 제1 음성 데이터(303)를 사용하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델(306)을 학습시킬 수 있다.In contrast, when the determination 305 determines that the first voice data 303 is not voice data corresponding to the first registered speaker registered in the electronic device 1000, the electronic device 1000 may transmit the second voice data 304. As in the case of), the first voice data 303 may be used to train the background speaker model 306 obtained based on the voice signals for the plurality of speakers.

전자 장치(1000)는, 다른 예로, 미리 설정된 제1 등록 화자 모델에 기초한 음성 인식을 수행하여, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(305)할 수 있다. 즉, 전자 장치(1000)는 얼굴 인식이나 지문 인식에 의한 사용자 인증을 하지 않고, 사용자 인증을 위해 음성 인식 모델 자체를 사용할 수 있으며, 이와 같은 인증은 음성 인식 모델의 초기 학습을 고려하면 판단 오류가 크므로, 사용자에 의한 수동의 신원 인증이 보조로 수반될 수도 있다. As another example, the electronic apparatus 1000 may perform voice recognition based on a preset first registered speaker model to determine whether the first voice data is voice data corresponding to the first registered speaker (305). That is, the electronic apparatus 1000 may use the voice recognition model itself for user authentication without performing user authentication by face recognition or fingerprint recognition, and such authentication may cause a determination error in consideration of initial learning of the voice recognition model. As such, manual identity authentication by the user may be accompanied by assistance.

전자 장치(1000)는, 예를 들어, 미리 설정된 제1 등록 화자 모델에 기초한 음성 인식을 수행하기 위해, 미리 설정된 문장의 발화를 통해 생성된 화자 등록 음성 신호를 수신하고, 수신된 화자 등록 음성 신호를 사용하여 제1 등록 화자 모델을 미리 등록할 수 있다. 이때, 판단(305) 결과, 제1 음성 데이터(303)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터일 경우, 전자 장치(1000)는 제1 음성 데이터(303) 및 배경 화자 모델(306)을 사용하여, 미리 설정된 제1 등록 화자 모델을 갱신하여, 새로운 제1 등록 화자 모델(307)을 획득할 수 있다.For example, the electronic apparatus 1000 may receive a speaker registration voice signal generated through utterance of a preset sentence to perform voice recognition based on a preset first registered speaker model, and receive the received speaker registration voice signal. Register the first registered speaker model in advance. In this case, when the determination 305 indicates that the first voice data 303 is voice data corresponding to the first registered speaker registered in the electronic apparatus 1000, the electronic apparatus 1000 may determine the first voice data 303 and The background speaker model 306 may be used to update the preset first registered speaker model to obtain a new first registered speaker model 307.

전자 장치(1000)는, 예를 들어, 전술한 얼굴 인식이나 지문 인식에 의한 사용자 인증 방식 및 사용자 음성 인식 모델 자체를 사용하는 사용자 인증 방식 중 적어도 하나의 방식을 통해 사용자 인증을 수행할 수 있다. For example, the electronic apparatus 1000 may perform user authentication through at least one of a user authentication method using the above-described face recognition or fingerprint recognition and a user authentication method using the user voice recognition model itself.

전자 장치(1000)는, 예를 들어, 전술한 얼굴 인식이나 지문 인식에 의한 사용자 인증 방식 및 사용자 음성 인식 모델 자체를 사용하는 사용자 인증 방식을 동시에, 또는 순차적으로 사용하여 사용자 인증을 수행할 수 있으나, 반드시 이에 한정되는 것은 아니며, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터에 해당하는지 여부를 판단할 수 있는 어떠한 인증 방식도 사용 가능하다.For example, the electronic apparatus 1000 may perform user authentication by simultaneously or sequentially using the above-described user authentication method using face recognition or fingerprint recognition and a user authentication method using the user voice recognition model itself. However, the present invention is not limited thereto, and any authentication scheme capable of determining whether the first voice data corresponds to voice data corresponding to the first registered speaker may be used.

이처럼 일부 실시예에 따른 전자 장치(1000)는, 제1 채널을 통해 획득한 음성 데이터라도, 등록 화자가 아닌 사용자의 음성 데이터로 판단될 경우, 획득된 음성 데이터를 배경 화자 모델의 학습에 사용함으로써, 보다 효율적으로 배경 화자 모델 학습용 음성 데이터를 수집할 수 있는 장점을 갖는다.As such, the electronic apparatus 1000 according to an exemplary embodiment may use the acquired voice data for learning a background speaker model even when the voice data acquired through the first channel is determined to be voice data of a user rather than a registered speaker. In addition, it has the advantage of collecting voice data for background speaker model training more efficiently.

도 4는 일부 실시예에 따른 전자 장치의 학습, 등록 및 인식 방법을 개략적으로 나타낸 도면이다.4 is a diagram schematically illustrating a method of learning, registering, and recognizing an electronic device, according to an exemplary embodiment.

도 4를 참조하면, 학습 단계, 등록 단계 및 인식 단계에서, 전자 장치(1000)는 입력 또는 수신된 음성 신호(401, 431, 451)에 대한 음성 특징(402, 432, 452)을 추출할 수 있다. Referring to FIG. 4, in the learning step, the registration step, and the recognition step, the electronic apparatus 1000 may extract voice features 402, 432, and 452 for the input or received voice signals 401, 431, and 451. have.

학습 단계에서 전자 장치(1000)는 입력된 음성 신호(401)로부터 추출된 음성 특징(402)에 대한 기대값 최대화 알고리즘(expectation maximization algorithm, EM algorithm)(403)을 적용하여 배경 화자 모델(404)을 생성할 수 있다.In the learning step, the electronic apparatus 1000 applies the expectation maximization algorithm (EM algorithm) 403 to the speech feature 402 extracted from the input speech signal 401 to the background speaker model 404. Can be generated.

한편, 등록 단계에서 전자 장치(1000)는 입력된 음성 신호(431)로부터 추출된 음성 특징(432) 및 학습 단계에서 생성된 배경 화자 모델(404)에 대해 미리 설정된 MAP 적응 알고리즘(maximum a posteriori adaptation algorithm, MAP adaptation algorithm)을 적용하여 발성한 사용자에 대해 적응된 GMM, 즉 화자 모델(434)을 생성할 수 있다.Meanwhile, in the registration step, the electronic apparatus 1000 presets a maximum a posteriori adaptation for the voice feature 432 extracted from the input voice signal 431 and the background speaker model 404 generated in the learning step. algorithm, MAP adaptation algorithm) may be applied to generate a GMM, ie, a speaker model 434, adapted to a spoken user.

한편, 인식 단계에서 전자 장치(1000)는 입력된 음성 신호(451)로부터 추출된 음성 특징(452) 및 화자 모델(434)을 사용하여 유사도 스코어를 계산(453)하고, 계산된 스코어에 기초하여 발성한 사용자에 대한 인식을 수행할 수 있다(454).In the recognition step, the electronic apparatus 1000 calculates a similarity score 453 using the speech feature 452 and the speaker model 434 extracted from the input speech signal 451 and based on the calculated score, Recognition of the spoken user may be performed (454).

도 5a는 일부 실시예에 따른 GMM의 분포와 학습할 화자의 음성 특징 분포를 나타낸 도면이다.5A is a diagram illustrating a distribution of a GMM and a speech feature distribution of a speaker to be learned, according to an exemplary embodiment.

도 5a를 참조하면, 배경 화자 모델, 즉, 복수의 화자에 대한 EM 알고리즘 적용을 통해 획득된 가우시안 혼합 모델의 분포(501) 및 학습 대상이 되는 사용자의 음성 특징의 분포(511)가 도시되어 있다.Referring to FIG. 5A, there is shown a background speaker model, that is, a distribution 501 of a Gaussian mixture model obtained by applying an EM algorithm to a plurality of speakers, and a distribution 511 of a voice feature of a user to be trained. .

도 5b는 일부 실시예에 따른 MAP 적응 후의 GMM 모델의 분포를 나타낸 도면이다.5B is a diagram illustrating a distribution of a GMM model after MAP adaptation according to some embodiments.

도 5b를 참조하면, 생성된 배경 화자 모델에 대하여, 전자 장치(1000)는 MAP 적응 알고리즘을 사용하여 학습 대상이 되는 사용자의 음성 데이터에 대하여 a posteriori 확률이 최대가 되도록 배경 화자 모델의 파라미터를 시프팅(shifting)할 수 있다. 시프팅 결과로써, 학습 대상이 되는 사용자에 대한 가우시안 혼합 모델의 분포(502)가 생성될 수 있으며, 이와 같은 가우시안 혼합 모델의 생성 방식은 GMM-UBM 방식으로 지칭될 수 있다.Referring to FIG. 5B, with respect to the generated background speaker model, the electronic apparatus 1000 may shift a parameter of the background speaker model so that a posteriori probability is maximized with respect to voice data of a user to be trained using a MAP adaptive algorithm. Can be shifted. As a result of the shifting, a distribution 502 of the Gaussian mixture model for the user to be learned may be generated, and such a generation method of the Gaussian mixture model may be referred to as a GMM-UBM method.

이와 같은 GMM-UBM 방식은, 여러 사용자에 대한 가우시안 혼합 모델이 획득될 경우, 각 가우시안 혼합 모델의 파라미터들은 모두 동일한 배경 화자 모델에 대한 시프팅을 통해 획득될 수 있으므로, 각 가우시안 혼합 모델 간 비교가 용이하다는 장점이 있다.In the GMM-UBM method, when Gaussian mixture models for multiple users are obtained, the parameters of each Gaussian mixture model can be obtained by shifting the same background speaker model. It has the advantage of being easy.

도 6은 일부 실시예에 따른 전자 장치의 제2 등록 화자 모델 학습 방법을 나타낸 흐름도이다.6 is a flowchart illustrating a method of learning a second registered speaker model of an electronic device according to an exemplary embodiment.

도 6을 참조하면, 일부 실시예에 따른 전자 장치(1000)는 사용자(10)로부터 제1 음성 신호(601)를 입력받고, 입력받은 제1 음성 신호(601)로부터 제1 음성 데이터(603)를 획득할 수 있다. Referring to FIG. 6, the electronic apparatus 1000 according to an exemplary embodiment receives a first voice signal 601 from a user 10, and receives first voice data 603 from the received first voice signal 601. Can be obtained.

한편, 전자 장치(1000)는 외부 장치로부터 상대방(12)의 통화 음성에 기초한 제2 음성 신호(602)를 수신하고, 수신된 제2 음성 신호(602)로부터 제2 음성 데이터(604)를 획득할 수 있다.Meanwhile, the electronic apparatus 1000 receives a second voice signal 602 based on the call voice of the counterpart 12 from the external device, and obtains second voice data 604 from the received second voice signal 602. can do.

전자 장치(1000)는 획득한 제2 음성 데이터(604)를 사용하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델(607)을 학습시킬 수 있다.The electronic apparatus 1000 may learn the background speaker model 607 obtained based on the voice signals for the plurality of speakers using the acquired second voice data 604.

전자 장치(1000)는 획득한 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(605)할 수 있다.The electronic device 1000 may determine whether the acquired first voice data 603 is voice data corresponding to the first registered speaker registered in the electronic device 1000 (605).

전자 장치(1000)는, 예를 들어, 사용자로부터 별도의 인증용 데이터를 입력받아, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(605)할 수 있다.For example, the electronic apparatus 1000 may receive separate authentication data from a user and determine whether the first voice data 603 is voice data corresponding to the first registered speaker registered in the electronic apparatus 1000. Decision 605 may be made.

전자 장치(1000)는, 다른 예로, 미리 설정된 제1 등록 화자 모델에 기초한 음성 인식을 수행하여, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단(605)할 수 있다.As another example, the electronic apparatus 1000 may perform voice recognition based on a preset first registered speaker model to determine whether the first voice data is voice data corresponding to the first registered speaker (605).

판단(605) 결과, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터일 경우, 전자 장치(1000)는 제1 음성 데이터(603) 및 배경 화자 모델(607)을 사용하여 제1 등록 화자 모델(608)을 학습시킬 수 있다.As a result of the determination 605, when the first voice data 603 is voice data corresponding to the first registered speaker registered in the electronic device 1000, the electronic device 1000 may determine the first voice data 603 and the background speaker. The model 607 may be used to train the first registered speaker model 608.

반대로, 판단(605) 결과, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터가 아닐 경우, 전자 장치(1000)는, 제2 음성 데이터(604)의 경우와 동일하게, 제1 음성 데이터(603)를 사용하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델(607)을 학습시킬 수 있다.On the contrary, when the determination 605 indicates that the first voice data 603 is not voice data corresponding to the first registered speaker registered in the electronic device 1000, the electronic device 1000 may transmit the second voice data 604. As in the case of), the first voice data 603 may be used to train the background speaker model 607 obtained based on the voice signals for the plurality of speakers.

한편, 전자 장치(1000)는 획득한 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제2 등록 화자에 대응하는 음성 데이터인지 여부를 판단(606)할 수 있다.The electronic device 1000 may determine whether the acquired first voice data 603 is voice data corresponding to the second registered speaker registered in the electronic device 1000 (606).

전자 장치(1000)는, 예를 들어, 사용자로부터 별도의 인증용 데이터를 입력받아, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제2 등록 화자에 대응하는 음성 데이터인지 여부를 판단(606)할 수 있다.For example, the electronic apparatus 1000 may receive separate authentication data from a user and determine whether the first voice data 603 is voice data corresponding to the second registered speaker registered in the electronic apparatus 1000. Decision 606 may be made.

전자 장치(1000)는, 다른 예로, 미리 설정된 제2 등록 화자 모델에 기초한 음성 인식을 수행하여, 제1 음성 데이터가 제2 등록 화자에 대응하는 음성 데이터인지 여부를 판단(606)할 수 있다.As another example, the electronic apparatus 1000 may perform voice recognition based on a preset second registered speaker model to determine whether the first voice data is voice data corresponding to the second registered speaker (606).

판단(606) 결과, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터일 경우, 전자 장치(1000)는 제1 음성 데이터(603) 및 배경 화자 모델(607)을 사용하여 제2 등록 화자 모델(609)을 학습시킬 수 있다.As a result of the determination 606, when the first voice data 603 is voice data corresponding to the first registered speaker registered in the electronic device 1000, the electronic device 1000 may determine the first voice data 603 and the background speaker. The model 607 may be used to train the second registered speaker model 609.

반대로, 판단(606) 결과, 제1 음성 데이터(603)가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터가 아닐 경우, 전자 장치(1000)는, 제2 음성 데이터(604)의 경우와 동일하게, 제1 음성 데이터(603)를 사용하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델(607)을 학습시킬 수 있다. In contrast, when the determination 606 determines that the first voice data 603 is not voice data corresponding to the first registered speaker registered in the electronic apparatus 1000, the electronic apparatus 1000 may determine the second voice data 604. As in the case of), the first voice data 603 may be used to train the background speaker model 607 obtained based on the voice signals for the plurality of speakers.

즉, 전자 장치(1000)가 제1 사용자의 소유이나, 제1 사용자가 아닌 제2 사용자가 전자 장치(1000)를 대여하여 사용하는 경우라도, 전자 장치(1000)는 제2 사용자에 의해 입력되는 제1 음성 신호로부터 추출된 제1 음성 데이터를 제2 등록 화자에 대응되는 음성 데이터로 판단할 수 있다.That is, even when the electronic apparatus 1000 is owned by the first user or a second user who is not the first user rents the electronic apparatus 1000, the electronic apparatus 1000 may be input by the second user. The first voice data extracted from the first voice signal may be determined as voice data corresponding to the second registered speaker.

도 6은 제1 음성 데이터(603)에 대한 제1 등록 화자에 대응하는 음성 데이터 해당 여부 판단(605) 및 제1 음성 데이터(603)에 대한 제2 등록 화자에 대응하는 음성 데이터 해당 여부 판단(606) 과정이 병렬적으로 도시하고 있지만, 판단(605, 606) 과정은 동시에 또는 순차적으로 이루어질 수 있으며, 당업자는 도 6의 방법이 n명의 복수의 등록 화자가 존재하는 경우에도 동일하게 적용될 수 있음을 쉽게 이해할 수 있을 것이다.FIG. 6 illustrates whether or not the voice data corresponding to the first registered speaker for the first voice data 603 corresponds to the voice data corresponding to the second registered speaker for the first voice data 603; 606) Although the process is shown in parallel, the determination process 605, 606 may be performed simultaneously or sequentially, and those skilled in the art may apply equally to the method of FIG. 6 even when there are a plurality of registered speakers. It will be easy to understand.

이처럼 일부 실시예에 따른 전자 장치(1000)는, 제1 채널을 통해 획득한 음성 데이터에 대하여, 복수의 등록 화자 중 어느 하나의 등록 화자에 대응하는 음성 데이터로 판단될 경우, 획득된 음성 데이터를 배경 화자 모델의 학습에 사용함으로써, 전자 장치(1000)에 대한 복수의 사용자에 대하여 보다 효율적으로 배경 화자 모델 학습용 음성 데이터를 수집할 수 있는 장점을 갖는다.As such, when it is determined that the voice data corresponding to any one of the plurality of registered speakers is the voice data acquired through the first channel, the electronic apparatus 1000 according to the exemplary embodiment may obtain the acquired voice data. By using the background speaker model for learning, the voice data for training the background speaker model can be collected more efficiently for a plurality of users of the electronic apparatus 1000.

특히, 도 6의 일부 실시예에 따른 방법은 전자 장치(1000)의 사용자가 전자 장치(1000)의 소유자가 아닌 경우 뿐만 아니라, 전자 장치(1000)의 마이크로폰을 통해 복수의 사용자의 음성이 동시에 또는 순차적으로 입력되는 다중 음성 통화의 경우에도 동일하게 적용될 수 있으므로, 일부 실시예에 따른 전자 장치(1000)는 복수의 사용자에 대하여 보다 효율적으로 배경 화자 모델 학습용 음성 데이터를 수집할 수 있는 장점을 갖는다.In particular, the method according to some embodiments of FIG. 6 not only does not only the user of the electronic device 1000 be the owner of the electronic device 1000 but also voices of a plurality of users simultaneously or through the microphone of the electronic device 1000. Since the same applies to a multi-voice call that is sequentially input, the electronic apparatus 1000 according to an exemplary embodiment has an advantage of collecting voice data for background speaker model training more efficiently for a plurality of users.

도 7은 일부 실시예에 따른 전자 장치의 음성 명령 수행 과정을 나타낸 도면이다.7 is a flowchart illustrating a process of performing a voice command of an electronic device according to some embodiments.

도 7을 참조하면, 전자 장치(1000)는 아버지(701), 아들(703) 및 어머니(705)를 포함하는 복수의 사용자에 대한 등록 화자 모델을 사용할 수 있다. 예를 들어, 아버지(701), 아들(703) 및 어머니(705)는 모두 "엄마한테 전화해"와 같은 동일한 명령 문구(72)에 대응되는 입력 음성 신호를 전자 장치(1000)에 입력할 수 있다. 입력된 음성 신호는 전자 장치(1000) 및/또는 서버에 의해 처리(71)되어, 사용자 인증 및 음성 명령에 대응하는 동작의 수행에 사용될 수 있다.Referring to FIG. 7, the electronic apparatus 1000 may use a registered speaker model for a plurality of users including a father 701, a son 703, and a mother 705. For example, father 701, son 703, and mother 705 may all input input voice signals corresponding to the same command phrase 72, such as “Call Mom” to electronic device 1000. have. The input voice signal may be processed by the electronic apparatus 1000 and / or the server 71 and used to perform an operation corresponding to a user authentication and a voice command.

이때, 전자 장치(1000)는 각 사용자가 입력한 동일한 명령 문구(72)에 대응하는 입력 음성 신호에 대하여, 각 사용자에 대응하는 등록 화자 모델을 검출하고, 검출된 등록 화자 모델에 포함된 각 사용자에 대한 정보를 분석함으로써, 각 사용자에 대응되는 적합한 특정 기능을 수행할 수 있다.At this time, the electronic apparatus 1000 detects a registered speaker model corresponding to each user with respect to an input voice signal corresponding to the same command phrase 72 input by each user, and each user included in the detected registered speaker model. By analyzing the information about, it is possible to perform a specific function appropriate for each user.

예를 들어, 전자 장치(1000)는 아버지(701)가 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 아버지(701)에 대응하는 등록 화자 모델과 비교할 수 있다. 비교 결과, 입력된 음성 신호가 아버지(701)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 아버지(701)에 대한 인증이 성공한 것으로 판단하고, 아버지(701)에 대응하는 등록 화자 모델로부터 아버지의 어머니(731)에 관한 정보를 획득할 수 있다. 아버지의 어머니(731)에 관한 정보를 획득한 전자 장치(1000)는 아버지(701)의 입력 음성 신호에 대응하여, 아버지의 어머니(731)에 대한 음성 통화 요청 기능을 수행할 수 있다. For example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the father 701, and compare the acquired voice data with a registered speaker model corresponding to the father 701. As a result of the comparison, when it is determined that the input voice signal is input by the father 701, the electronic apparatus 1000 determines that the authentication of the father 701 is successful, and from the registered speaker model corresponding to the father 701. Information about the father's mother 731 can be obtained. The electronic apparatus 1000 that obtains information about the father's mother 731 may perform a voice call request function for the father's mother 731 in response to the input voice signal of the father 701.

다른 예로, 전자 장치(1000)는 아들(703)이 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 아들(703)에 대응하는 등록 화자 모델과 비교할 수 있다. 비교 결과, 입력된 음성 신호가 아들(703)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 아들(703)에 대한 인증이 성공한 것으로 판단하고, 아들(703)에 대응하는 등록 화자 모델로부터 (아들의) 어머니(705)에 관한 정보를 획득할 수 있다. 어머니(705)에 관한 정보를 획득한 전자 장치(1000)는 아들(703)의 입력 음성 신호에 대응하여, 어머니(705)에 대한 음성 통화 요청 기능을 수행할 수 있다. As another example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the son 703, and compare the acquired voice data with a registered speaker model corresponding to the son 703. As a result of the comparison, when it is determined that the input voice signal is input by the son 703, the electronic apparatus 1000 determines that authentication of the son 703 is successful, and from the registered speaker model corresponding to the son 703. Information may be obtained about the mother (705) (son). The electronic apparatus 1000 having obtained the information about the mother 705 may perform a voice call request function for the mother 705 in response to an input voice signal of the son 703.

또 다른 예로, 전자 장치(1000)는 어머니(705)가 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 어머니(705)에 대응하는 등록 화자 모델과 비교할 수 있다. 비교 결과, 입력된 음성 신호가 어머니(705)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 어머니(705)에 대한 인증이 성공한 것으로 판단하고, 어머니(705)에 대응하는 등록 화자 모델로부터 어머니의 어머니(735)에 관한 정보를 획득할 수 있다. 어머니의 어머니(735)에 관한 정보를 획득한 전자 장치(1000)는 어머니(705)의 입력 음성 신호에 대응하여, 어머니의 어머니(735)에 대한 음성 통화 요청 기능을 수행할 수 있다. As another example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the mother 705, and compare the acquired voice data with a registered speaker model corresponding to the mother 705. As a result of the comparison, when it is determined that the input voice signal is input by the mother 705, the electronic apparatus 1000 determines that authentication of the mother 705 is successful, and from the registered speaker model corresponding to the mother 705, Information about the mother 735 of the mother may be obtained. The electronic apparatus 1000 that obtains information about the mother 735 of the mother may perform a voice call request function for the mother 735 of the mother in response to the input voice signal of the mother 705.

도 8은 일부 실시예에 따른 전자 장치의 음성 명령 수행 과정을 나타낸 도면이다.8 is a flowchart illustrating a process of performing a voice command of an electronic device according to some embodiments.

도 8을 참조하면, 도 7과 마찬가지로, 전자 장치(1000)는 아버지(701), 아들(703) 및 어머니(705)를 포함하는 복수의 사용자에 대한 등록 화자 모델을 사용할 수 있다. 다만, 도 7과 달리, 도 8의 일부 실시예에 따른 전자 장치(1000) 스마트 TV에 해당하며, 이때 전자 장치(1000)는 외부 장치로부터 복수의 사용자에 대한 등록 화자 모델을 수신할 수 있다.Referring to FIG. 8, similar to FIG. 7, the electronic apparatus 1000 may use a registered speaker model for a plurality of users including a father 701, a son 703, and a mother 705. However, unlike FIG. 7, the electronic device 1000 corresponds to a smart TV according to some embodiments of FIG. 8, and the electronic device 1000 may receive a registered speaker model for a plurality of users from an external device.

즉, 스마트 TV와 같이, 전자 장치(1000)가 사용자 및 통화 상대방의 음성 신호를 사용하는 장치가 아닐 경우, 전자 장치(1000)는 사용자 및 통화 상대방의 음성 신호를 사용하는 외부 장치(711, 713, 715)로부터, 외부 장치(711, 713, 715)에 의해 생성된 복수의 사용자에 대한 등록 화자 모델을 수신함으로써, 직접 등록 화자 모델에 대한 학습을 수행하지 않고도 사용자 인증 및 음성 명령에 대응하는 동작을 수행할 수 있다.That is, when the electronic device 1000 is not a device that uses voice signals of the user and the call counterpart, such as a smart TV, the electronic device 1000 may use external devices 711 and 713 that use voice signals of the user and the call counterpart. 715, corresponding to user authentication and voice command without receiving training on the direct registered speaker model by receiving registered speaker models for a plurality of users generated by the external devices 711, 713, and 715. Can be performed.

예를 들어, 아버지(701), 아들(703) 및 어머니(705)는 모두 "내 선호 채널 틀어줘"와 같은 동일한 명령 문구(82)에 대응되는 입력 음성 신호를 전자 장치(1000)에 입력할 수 있다. 입력된 음성 신호는 전자 장치(1000) 및/또는 서버에 의해 처리(81)되어, 사용자 인증 및 음성 명령에 대응하는 동작의 수행에 사용될 수 있다.For example, the father 701, the son 703, and the mother 705 all input the input voice signal corresponding to the same command phrase 82, such as “Please turn on my favorite channel” to the electronic device 1000. Can be. The input voice signal may be processed by the electronic apparatus 1000 and / or the server 81 and used to perform an operation corresponding to a user authentication and a voice command.

이때, 전자 장치(1000)는 각 사용자가 입력한 동일한 명령 문구(82)에 대응하는 입력 음성 신호에 대하여, 각 사용자에 대응하는 등록 화자 모델을 검출하고, 검출된 등록 화자 모델에 포함된 각 사용자에 대한 정보를 분석함으로써, 각 사용자에 대응되는 적합한 특정 기능을 수행할 수 있다.At this time, the electronic apparatus 1000 detects a registered speaker model corresponding to each user with respect to an input voice signal corresponding to the same command phrase 82 input by each user, and each user included in the detected registered speaker model. By analyzing the information about, it is possible to perform a specific function appropriate for each user.

예를 들어, 전자 장치(1000)는 아버지(701)가 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 아버지(701)에 대응하는 등록 화자 모델과 비교할 수 있다. 아버지(701)에 대응하는 등록 화자 모델은, 예를 들어, 아버지(701) 소유의 제1 외부 장치(711)로부터 수신된 등록 화자모델일 수 있다. 비교 결과, 입력된 음성 신호가 아버지(701)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 아버지(701)에 대한 인증이 성공한 것으로 판단하고, 아버지(701)에 대응하는 등록 화자 모델로부터 아버지의 선호 채널(831)에 관한 정보를 획득할 수 있다. 아버지의 선호 채널(831)에 관한 정보를 획득한 전자 장치(1000)는 아버지(701)의 입력 음성 신호에 대응하여, 전자 장치(1000)의 채널을 아버지의 선호 채널(831)인 스포츠 채널로 전환하기 위한 기능을 수행할 수 있다. For example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the father 701, and compare the acquired voice data with a registered speaker model corresponding to the father 701. The registered speaker model corresponding to the father 701 may be, for example, a registered speaker model received from the first external device 711 owned by the father 701. As a result of the comparison, when it is determined that the input voice signal is input by the father 701, the electronic apparatus 1000 determines that the authentication of the father 701 is successful, and from the registered speaker model corresponding to the father 701. Information about the father's preferred channel 831 may be obtained. The electronic apparatus 1000 that has obtained the information on the father's preferred channel 831 corresponds to an input voice signal of the father 701 and changes the channel of the electronic apparatus 1000 to a sports channel which is the father's preferred channel 831. Can perform the function to switch.

다른 예로, 전자 장치(1000)는 아들(703)이 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 아들(703)에 대응하는 등록 화자 모델과 비교할 수 있다. 아들(703)에 대응하는 등록 화자 모델은, 예를 들어, 아들(703) 소유의 제2 외부 장치(713)로부터 수신된 등록 화자모델일 수 있다. 비교 결과, 입력된 음성 신호가 아들(703)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 아들(703)에 대한 인증이 성공한 것으로 판단하고, 아들(703)에 대응하는 등록 화자 모델로부터 에 관한 정보를 획득할 수 있다. 아들의 선호 채널(833)에 관한 정보를 획득한 전자 장치(1000)는 아들(703)의 입력 음성 신호에 대응하여, 전자 장치(1000)의 채널을 아들의 선호 채널(833)인 애니메이션 채널로 전환하기 위한 기능을 수행할 수 있다.As another example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the son 703, and compare the acquired voice data with a registered speaker model corresponding to the son 703. The registered speaker model corresponding to the son 703 may be, for example, a registered speaker model received from the second external device 713 owned by the son 703. As a result of the comparison, when it is determined that the input voice signal is input by the son 703, the electronic apparatus 1000 determines that authentication of the son 703 is successful, and from the registered speaker model corresponding to the son 703. Obtain information about. The electronic apparatus 1000 that obtains information about the son's preferred channel 833 may change the channel of the electronic apparatus 1000 into an animation channel which is the son's favorite channel 833 in response to the input voice signal of the son 703. Can perform the function to switch.

또 다른 예로, 전자 장치(1000)는 어머니(705)가 입력한 입력 음성 신호로부터 음성 데이터를 획득하고, 획득된 음성 데이터를 어머니(705)에 대응하는 등록 화자 모델과 비교할 수 있다. 어머니(705)에 대응하는 등록 화자 모델은, 예를 들어, 어머니(705) 소유의 제3 외부 장치(715)로부터 수신된 등록 화자모델일 수 있다. 비교 결과, 입력된 음성 신호가 어머니(705)에 의해 입력된 것으로 판단되면, 전자 장치(1000)는 어머니(705)에 대한 인증이 성공한 것으로 판단하고, 어머니(705)에 대응하는 등록 화자 모델로부터 어머니의 어머니(735)에 관한 정보를 획득할 수 있다. 어머니의 어머니(735)에 관한 정보를 획득한 전자 장치(1000)는 어머니(705)의 입력 음성 신호에 대응하여, 전자 장치(1000)의 채널을 어머니의 선호 채널(835)인 드라마 채널로 전환하기 위한 기능을 수행할 수 있다.As another example, the electronic apparatus 1000 may obtain voice data from an input voice signal input by the mother 705, and compare the acquired voice data with a registered speaker model corresponding to the mother 705. The registered speaker model corresponding to the mother 705 may be, for example, a registered speaker model received from the third external device 715 owned by the mother 705. As a result of the comparison, when it is determined that the input voice signal is input by the mother 705, the electronic apparatus 1000 determines that authentication of the mother 705 is successful, and from the registered speaker model corresponding to the mother 705, Information about the mother 735 of the mother may be obtained. The electronic apparatus 1000 that obtains information about the mother 735 of the mother, changes the channel of the electronic apparatus 1000 to a drama channel, which is the mother's preferred channel 835, in response to the input voice signal of the mother 705. To perform the function.

도 9는 일부 실시예에 따른 호스트 전자 장치의 음성 처리 방법을 나타낸 도면이다.9 is a diagram illustrating a voice processing method of a host electronic device according to some embodiments.

도 9를 참조하면, 본 개시의 일 실시예에 따른 전자 장치의 동작 방법은 호스트 전자 장치(901)에 의하여 수행될 수 있다. 호스트 전자 장치(901)란, 복수의 화자가 회의를 진행하는 동안 입력되는 음성 신호에 대하여 녹음 파일 생성 및 복수의 화자 각각에 대응하는 화자 모델의 학습을 수행할 수 있는 전자 장치를 지칭한다.Referring to FIG. 9, a method of operating an electronic device according to an embodiment of the present disclosure may be performed by the host electronic device 901. The host electronic device 901 refers to an electronic device capable of generating a recording file and learning a speaker model corresponding to each of a plurality of speakers with respect to a voice signal inputted by a plurality of speakers during a conference.

회의 진행 전, 회의 참석자(971)들은 각자 자신의 등록 화자 모델을 호스트 전자 장치(901) 및 회의 불참석자(972)들과 공유(903)할 수 있다. 회의 참석자(971)들로부터 등록 화자 모델들을 전달받은 호스트 전자 장치(901)는 회의를 진행하는 동안 입력되는 음성에 기초하여, GMM-UBM 방식으로 각 등록 화자 모델에 대한 학습 및 갱신을 수행할 수 있다. 학습 및 갱신이 수행된 등록 화자 모델들은, 다시 호스트 전자 장치(901), 회의 참석자(971)들 및 회의 불참석자(972)들 사이에서 재 공유될 수 있다. Before proceeding with the conference, the conference attendants 971 may share 903 their registered speaker models with the host electronic device 901 and the conference attendees 972. The host electronic device 901 which has received the registered speaker models from the meeting attendants 971 may learn and update each registered speaker model in a GMM-UBM manner based on a voice input during the conference. have. Registered speaker models from which learning and updating have been performed may be re-shared between the host electronic device 901, the meeting attendants 971, and the meeting attendees 972.

한편, 호스트 전자 장치(901)는 회의를 진행하는 동안 입력되는 음성 신호에 대하여 녹음 파일(905)을 생성하고, 생성된 녹음 파일을 회의 불참석자(972)들에게 공유할 수 있다. 회의 종료 후, 생성된 녹음 파일을 전달받은 불참석자(972)들 또한, 전달 받은 녹음 파일에 포함된 음성 신호에 기초하여, GMM-UBM 방식으로 각 등록 화자 모델에 대한 학습 및 갱신을 수행할 수 있다.Meanwhile, the host electronic device 901 may generate a recording file 905 for the voice signal input during the conference, and share the generated recording file with the non-participants 972. After the meeting, non-attendees 972 who received the generated recording file may also learn and update each registered speaker model in the GMM-UBM method based on the voice signal included in the received recording file. have.

이처럼 일부 실시예에 따른 전자 장치는 복수의 사용자에 대한 등록 화자 모델을 공유함으로써, 사용자의 발화가 발생하는 시간 및 장소에 관한 제약 없이, 보다 효율적으로 등록 화자 모델을 학습할 수 있는 장점을 갖는다. As such, the electronic device according to an embodiment of the present disclosure has an advantage of more efficiently learning the registered speaker model without limiting the time and place where the user's speech occurs by sharing the registered speaker models for the plurality of users.

도 10은 일부 실시예에 따른 전자 장치의 구성을 개략적으로 나타낸 블록도이다.10 is a block diagram schematically illustrating a configuration of an electronic device according to some embodiments.

도 10을 참조하면, 일부 실시예에 따른 전자 장치(1000)는 프로세서(1001), 메모리(1003), 마이크로폰(1005), 통신부(1007) 및 카메라(1009)를 포함할 수 있다. Referring to FIG. 10, an electronic device 1000 according to some embodiments may include a processor 1001, a memory 1003, a microphone 1005, a communication unit 1007, and a camera 1009.

전자 장치(1000)는, 예를 들어, 적어도 하나의 프로세서(1001)를 포함할 수 있다.The electronic device 1000 may include, for example, at least one processor 1001.

프로세서(1001)는, 통상적으로 전자 장치(1000)의 전반적인 동작을 제어할 수 있다. 예를 들어, 프로세서(1001)는, 메모리(1003)에 저장된 프로그램들을 실행함으로써, 메모리(1003), 마이크로폰(1005), 통신부(1007) 및 카메라(1009)를 전반적으로 제어할 수 있다. The processor 1001 may typically control overall operations of the electronic apparatus 1000. For example, the processor 1001 may control the memory 1003, the microphone 1005, the communication unit 1007, and the camera 1009 by executing programs stored in the memory 1003.

프로세서(1001)는, 제1 채널을 통해 제1 음성 신호를 수신하고, 제1 채널과 다른 제2 채널을 통해 제2 음성 신호를 수신할 수 있다. The processor 1001 may receive a first voice signal through a first channel and receive a second voice signal through a second channel different from the first channel.

프로세서(1001)는, 예를 들어, 마이크로폰(1005)을 통하여 제1 음성 신호를 수신하고, 통신부(1007)를 통하여 외부 장치로부터 제2 음성 신호를 수신할 수 있다. 프로세서(1001)는, 제1 음성 신호로부터 제1 음성 데이터를 획득하고, 제2 음성 신호로부터 제2 음성 데이터를 획득할 수 있다.The processor 1001 may receive, for example, a first voice signal through the microphone 1005 and a second voice signal from an external device through the communication unit 1007. The processor 1001 may obtain first voice data from the first voice signal and obtain second voice data from the second voice signal.

프로세서(1001)는, 제2 음성 데이터에 기초하여, 복수의 화자들에 대한 음성 신호에 기초하여 획득된 배경 화자 모델을 학습시킬 수 있다.The processor 1001 may train the background speaker model obtained based on the voice signals for the plurality of speakers based on the second voice data.

프로세서(1001)는, 제1 음성 데이터가 전자 장치(1000)에 등록된 제1 등록 화자에 대응하는 음성 데이터인지 여부를 판단할 수 있다. The processor 1001 may determine whether the first voice data is voice data corresponding to the first registered speaker registered in the electronic apparatus 1000.

프로세서(1001)는, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터인 경우, 제1 음성 데이터 및 배경 화자 모델에 기초하여, 제1 등록 화자의 음성을 인식하는 제1 등록 화자 모델을 학습시킬 수 있다.When the first voice data is voice data corresponding to the first registered speaker, the processor 1001 may generate a first registered speaker model that recognizes the voice of the first registered speaker based on the first voice data and the background speaker model. I can learn.

프로세서(1001)는, 예를 들어, 제1 음성 데이터에 대응하는 특징 벡터를 추출하고, 추출된 특징 벡터 및 배경 화자 모델을 사용하는 최대 사후 확률 적응 알고리즘에 기초하여 제1 등록 화자 모델을 학습시킬 수 있다.The processor 1001 may extract, for example, a feature vector corresponding to the first speech data and train the first registered speaker model based on a maximum posterior probability adaptive algorithm using the extracted feature vector and the background speaker model. Can be.

반대로, 프로세서(1001)는, 제1 음성 데이터가 제1 등록 화자에 대응하는 음성 데이터가 아닐 경우, 제1 음성 데이터에 기초하여 배경 화자 모델을 학습시킬 수 있다. In contrast, when the first voice data is not voice data corresponding to the first registered speaker, the processor 1001 may train the background speaker model based on the first voice data.

한편, 프로세서(1001)는, 제1 음성 데이터가 전자 장치(1000)에 등록된 제2 등록 화자에 대응하는 음성 데이터인지 여부를 판단할 수 있다. The processor 1001 may determine whether the first voice data corresponds to voice data corresponding to a second registered speaker registered in the electronic apparatus 1000.

프로세서(1001)는, 예를 들어, 제1 음성 데이터가 제2 등록 화자에 대응하는 음성 데이터인 경우, 제1 음성 데이터 및 상기 배경 화자 모델에 기초하여, 제2 등록 화자의 음성을 인식하는 제2 등록 화자 모델을 학습시킬 수 있다.For example, when the first voice data is voice data corresponding to the second registered speaker, the processor 1001 recognizes a voice of the second registered speaker based on the first voice data and the background speaker model. 2 You can train the registered speaker model.

반대로, 프로세서(1001)는, 제1 음성 데이터가 제2 등록 화자에 대응하는 음성 데이터가 아닐 경우, 제1 음성 데이터에 기초하여 배경 화자 모델을 학습시킬 수 있다.In contrast, when the first voice data is not voice data corresponding to the second registered speaker, the processor 1001 may train the background speaker model based on the first voice data.

메모리(1003)는 전자 장치(1000)의 동작을 제어하기 위한 프로그램을 저장할 수 있다. 메모리(1003)는 전자 장치(1000)의 동작을 제어하기 위한 적어도 하나의 인스트럭션을 포함할 수 있다. 메모리(1003)는, 예를 들어, 사용자에 의해 입력된 음성 신호, 특정 사용자에 대응하는 사용자 특징 벡터에 관한 정보를 포함하는 등록 화자 모델, 음성 신호로부터 사용자의 발화에 의한 음성 데이터를 추출하기 위한 음향 모델(acoustic model), 음성 신호로부터 추출된 음성 데이터 및 배경 화자 모델 등을 저장할 수 있다. 메모리(1003)에 저장된 프로그램들은 그 기능에 따라 복수 개의 모듈들로 분류될 수 있다.The memory 1003 may store a program for controlling the operation of the electronic apparatus 1000. The memory 1003 may include at least one instruction for controlling the operation of the electronic apparatus 1000. The memory 1003 may be, for example, a voice signal input by a user, a registered speaker model including information about a user feature vector corresponding to a specific user, and a voice signal for extracting voice data by the user's speech from the voice signal. An acoustic model, voice data extracted from a voice signal, a background speaker model, and the like may be stored. Programs stored in the memory 1003 may be classified into a plurality of modules according to their functions.

메모리(1003)는 플래시 메모리 타입(flash memory type), 하드디스크 타입(hard disk type), 멀티미디어 카드 마이크로 타입(multimedia card micro type), 카드 타입의 메모리(예를 들어 SD 또는 XD 메모리 등), 램(RAM, Random Access Memory) SRAM(Static Random Access Memory), 롬(ROM, Read-Only Memory), EEPROM(Electrically Erasable Programmable Read-Only Memory), PROM(Programmable Read-Only Memory), 자기 메모리, 자기 디스크, 광디스크 중 적어도 하나의 타입의 저장매체를 포함할 수 있다.The memory 1003 may be a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (eg, SD or XD memory, etc.), RAM Random Access Memory (RAM) Static Random Access Memory (SRAM), Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Programmable Read-Only Memory (PROM), Magnetic Memory, Magnetic Disk It may include at least one type of storage medium of the optical disk.

마이크로폰(1005)은 사용자로부터 입력 음성 신호를 입력받을 수 있다. 마이크로폰(1005)은 전자 장치(1000)가 동작하는 도중에, 전자 장치(1000)의 주변으로부터의 사용자 음성을 실시간으로 입력받을 수 있다. The microphone 1005 may receive an input voice signal from a user. The microphone 1005 may receive a user voice from the vicinity of the electronic device 1000 in real time while the electronic device 1000 is operating.

통신부(1007)는 외부 장치로부터 음성 신호를 수신할 수 있다. 통신부(1007)는 전자 장치(1000)가 동작하는 도중에, 외부 장치로부터 사용자의 통화 상대방에 의해 생성된 음성 신호를 실시간으로 수신할 수 있다.The communication unit 1007 may receive a voice signal from an external device. The communication unit 1007 may receive a voice signal generated by the call counterpart of the user in real time from the external device while the electronic device 1000 is operating.

통신부(1007)는 서버와의 통신을 위한 하나 이상의 통신 모듈을 포함할 수 있다. 예를 들어, 통신부(1007)는, 근거리 통신부, 이동 통신부를 포함할 수 있다. 근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. The communication unit 1007 may include one or more communication modules for communicating with a server. For example, the communication unit 1007 may include a short range communication unit and a mobile communication unit. The short-range wireless communication unit includes a Bluetooth communication unit, a Bluetooth low energy (BLE) communication unit, a near field communication unit (Near Field Communication unit), a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, an infrared ray (IrDA) It may include, but is not limited to, a Data Association (W Association) communication unit, a WFD (Wi-Fi Direct) communication unit, an ultra wideband (UWB) communication unit, an Ant + communication unit, and the like.

카메라(1009)는 전자 장치(1000)의 사용자의 얼굴을 촬영하여, 사용자 얼굴 이미지를 획득할 수 있다. 카메라(1009)는, 예를 들어, 전자 장치(1000)가 동작하는 도중에, 사용자의 인증을 위하여, 사용자 얼굴 이미지를 실시간으로 획득할 수 있다.The camera 1009 may capture a face of the user of the electronic device 1000 to obtain a user face image. For example, the camera 1009 may acquire a user's face image in real time for authentication of the user while the electronic apparatus 1000 is operating.

도 11은 일부 실시예에 따른 서버의 구성을 개략적으로 나타낸 구성도이다.11 is a configuration diagram schematically illustrating a configuration of a server according to some embodiments.

본 개시의 일부 실시예에 따른 사용자 인증 방법은, 전자 장치(1000) 및 전자 장치(1000)와 유선 또는 무선 통신을 통해 연결되는 서버(1100)에 의해 수행될 수도 있다.The user authentication method according to some embodiments of the present disclosure may be performed by the server 1100 connected to the electronic device 1000 and the electronic device 1000 through wired or wireless communication.

도 11을 참조하면, 일부 실시예에 따른 서버(1100)는, 프로세서(1101), 통신부(1105) 및 메모리(1103)를 포함할 수 있다. 서버(1100)는 사용자로부터 직접 사용자 인증에 필요한 데이터를 입력받지 않고, 전자 장치(1000)로부터 등록 화자 모델, 획득한 음성 신호, 음성 데이터 등을 수신하여 사용자 인증 및/또는 등록 화자 모델의 학습을 수행할 수 있다.Referring to FIG. 11, a server 1100 according to some embodiments may include a processor 1101, a communication unit 1105, and a memory 1103. The server 1100 receives the registered speaker model, the acquired voice signal, the voice data, and the like from the electronic device 1000 to receive the user's authentication and / or the learning of the registered speaker model without directly receiving data necessary for user authentication from the user. Can be done.

메모리(1103)는 서버(1100)의 동작을 제어하기 위한 프로그램을 저장할 수 있다. 메모리(1103)는 서버(1100)의 동작을 제어하기 위한 적어도 하나의 인스트럭션을 포함할 수 있다. 메모리(1103)는, 예를 들어, 사용자에 의해 입력된 음성 신호, 특정 사용자에 대응하는 사용자 특징 벡터에 관한 정보를 포함하는 등록 화자 모델, 음성 신호로부터 사용자의 발화에 의한 음성 데이터를 추출하기 위한 음향 모델(acoustic model), 음성 신호로부터 추출된 음성 데이터 및 배경 화자 모델 등을 저장할 수 있다. 메모리(1103)에 저장된 프로그램들은 그 기능에 따라 복수 개의 모듈들로 분류될 수 있다.The memory 1103 may store a program for controlling the operation of the server 1100. The memory 1103 may include at least one instruction for controlling the operation of the server 1100. The memory 1103 may be, for example, a voice signal input by a user, a registered speaker model including information about a user feature vector corresponding to a specific user, and a voice signal for extracting voice data of the user's speech from the voice signal. An acoustic model, voice data extracted from a voice signal, a background speaker model, and the like may be stored. Programs stored in the memory 1103 may be classified into a plurality of modules according to their functions.

한편, 메모리(1103)는, 여러 사용자에 관한 등록 화자 모델, 여러 전자 장치의 데이터 등을 통합하여 관리할 수 있도록, 복수의 DB를 포함할 수 있다.The memory 1103 may include a plurality of DBs to integrate and manage registered speaker models for various users, data of various electronic devices, and the like.

통신부(1105)는 전자 장치(1000)와의 통신을 위한 하나 이상의 통신 모듈을 포함할 수 있다. 예를 들어, 통신부(1105)는, 근거리 통신부, 이동 통신부를 포함할 수 있다. 근거리 통신부(short-range wireless communication unit)는, 블루투스 통신부, BLE(Bluetooth Low Energy) 통신부, 근거리 무선 통신부(Near Field Communication unit), WLAN(와이파이) 통신부, 지그비(Zigbee) 통신부, 적외선(IrDA, infrared Data Association) 통신부, WFD(Wi-Fi Direct) 통신부, UWB(ultra wideband) 통신부, Ant+ 통신부 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 이동 통신부는, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신한다. 여기에서, 무선 신호는, 음성 호 신호, 화상 통화 호 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다.The communication unit 1105 may include one or more communication modules for communicating with the electronic device 1000. For example, the communication unit 1105 may include a short range communication unit and a mobile communication unit. The short-range wireless communication unit includes a Bluetooth communication unit, a Bluetooth low energy (BLE) communication unit, a near field communication unit (Near Field Communication unit), a WLAN (Wi-Fi) communication unit, a Zigbee communication unit, an infrared ray (IrDA) It may include, but is not limited to, a Data Association (W Association) communication unit, a WFD (Wi-Fi Direct) communication unit, an ultra wideband (UWB) communication unit, an Ant + communication unit, and the like. The mobile communication unit transmits and receives a radio signal with at least one of a base station, an external terminal, and a server on a mobile communication network. Here, the wireless signal may include various types of data according to transmission and reception of a voice call signal, a video call call signal, or a text / multimedia message.

전자 장치(1000)는, 예를 들어, 적어도 하나의 프로세서(1101)를 포함할 수 있다. The electronic device 1000 may include, for example, at least one processor 1101.

프로세서(1101)는, 통상적으로 서버(1100)의 전반적인 동작을 제어할 수 있다. 예를 들어, 프로세서(1101)는, 메모리(1103)에 저장된 프로그램들을 실행함으로써, 메모리(1103) 및 통신부(1105)를 제어할 수 있다.The processor 1101 may typically control overall operations of the server 1100. For example, the processor 1101 may control the memory 1103 and the communication unit 1105 by executing programs stored in the memory 1103.

서버(1100)의 프로세서(1001)는 도 10의 일 실시예에 따른 전자 장치(1000)의 프로세서(1001)와 동일한 동작을 수행할 수 있으며, 이에 따라 본 일부 실시예에 대한 설명에서는 프로세서(1001)의 동작에 관한 설명을 생략한다. 그러나, 당업자는 본 개시의 전자 장치 동작 방법의 각 단계가, 동일한 음성 신호, 음성 데이터, 배경 화자 모델 및 등록 화자 모델에 대하여, 전자 장치(1000)의 프로세서(1001) 또는 서버(1100)의 프로세서(1001) 중 적어도 하나에 의해 수행될 수 있음을 쉽게 이해할 수 있을 것이다.The processor 1001 of the server 1100 may perform the same operation as that of the processor 1001 of the electronic device 1000 according to the exemplary embodiment of FIG. 10, and accordingly, the description of some exemplary embodiments may include the processor 1001. The description of the operation of the) will be omitted. However, those skilled in the art will appreciate that each step of the method of operating the electronic device of the present disclosure may be performed by the processor 1001 of the electronic device 1000 or the processor of the server 1100 for the same voice signal, voice data, background speaker model, and registered speaker model. It will be readily understood that it may be performed by at least one of 1001.

일부 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.Some embodiments may also be embodied in the form of a recording medium containing instructions executable by a computer, such as program modules executed by the computer. Computer readable media can be any available media that can be accessed by a computer and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer readable medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

또한, 본 명세서에서, "부"는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.Further, in this specification, “unit” may be a hardware component such as a processor or a circuit, and / or a software component executed by a hardware component such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The foregoing description of the disclosure is provided by way of example, and it will be understood by those skilled in the art that the present disclosure may be easily modified into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are exemplary in all respects and not restrictive. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as distributed may be implemented in a combined form.

본 개시의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 개시의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present disclosure is indicated by the appended claims rather than the detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be construed as being included in the scope of the present disclosure. do.

Claims

In the operating method of the electronic device,
Receiving a first voice signal through a first channel and receiving a second voice signal through a second channel different from the first channel;
Acquiring first voice data from the first voice signal and acquiring second voice data from the second voice signal;
Learning a background speaker model obtained based on voice signals for a plurality of speakers based on the second voice data;
Determining whether the first voice data is voice data corresponding to a first registered speaker registered in the electronic device; And
If the first voice data is voice data corresponding to the first registered speaker, learning a first registered speaker model for recognizing the voice of the first registered speaker based on the first voice data and the background speaker model. Causing; to include.

The method of claim 1,
Receiving a first voice signal through the first channel, and receiving a second voice signal through a second channel different from the first channel
Receiving the first voice signal through a microphone included in the electronic device; And
Receiving the second voice signal from an external device through a communication unit.

The method of claim 1,
The operation of determining whether the first voice data is voice data corresponding to the first registered speaker registered in the electronic device is performed.
Determining whether the first voice data is voice data corresponding to the first registered speaker based on at least one of face recognition based on an image photographed through a camera and fingerprint recognition using a fingerprint recognition sensor. How to.

The method of claim 1,
If the first voice data is not voice data corresponding to the first registered speaker, further comprising learning the background speaker model based on the first voice data.

The method of claim 1,
Receiving a speaker registration voice signal generated through speech of a preset sentence; And
And registering the first registered speaker with the electronic device in advance by using the speaker registered voice signal.

The method of claim 5,
The operation of determining whether the first voice data is voice data corresponding to the first registered speaker registered in the electronic device is performed.
And determining whether the first voice data is voice data corresponding to the first registered speaker based on voice recognition based on the first registered speaker model.

The method of claim 1,
Learning a first registered speaker model for recognizing the voice of the first registered speaker
Extracting a feature vector corresponding to the first voice data; And
Training the first registered speaker model based on a maximum posterior probability (MAP) adaptive algorithm using the feature vector and the background speaker model.

The method of claim 1,
Determining whether the first voice data is voice data corresponding to a second registered speaker registered in the electronic device; And
If the first voice data is voice data corresponding to the second registered speaker, learning a second registered speaker model for recognizing the voice of the second registered speaker based on the first voice data and the background speaker model. Further comprising;

The method of claim 1,
Receiving a third registered speaker model for recognizing a voice of the first registered speaker from an external device; And
And performing voice recognition of the first registered speaker using the first registered speaker model and the third registered speaker model.

In an electronic device,
Receive a first voice signal through a first channel, receive a second voice signal through a second channel different from the first channel, obtain first voice data from the first voice signal, and perform the second voice signal. Acquire second voice data from the signal, learn a background speaker model obtained based on the voice signal for a plurality of speakers based on the second voice data, and register the first voice data with the electronic device; It is determined whether the voice data corresponds to the first registered speaker, and when the first voice data is voice data corresponding to the first registered speaker, based on the first voice data and the background speaker model, At least one processor for learning a first registered speaker model that recognizes a voice of the first registered speaker; And
And a memory configured to store the first voice data and the second voice data.

The method of claim 10,
The at least one processor
And receiving the first voice signal through a microphone included in the electronic device, and receiving the second voice signal from an external device through a communication unit.

The method of claim 10,
The at least one processor
The electronic device determines whether the first voice data is voice data corresponding to the first registered speaker based on at least one of face recognition based on an image captured by a camera and fingerprint recognition using a fingerprint recognition sensor. .

The method of claim 10,
The at least one processor
And if the first voice data is not voice data corresponding to the first registered speaker, training the background speaker model based on the first voice data.

The method of claim 10,
The at least one processor
And receiving a speaker registration voice signal generated through speech of a preset sentence, and registering the first registered speaker using the speaker registration voice signal in advance.

The method of claim 14,
The at least one processor
And determining whether the first voice data is voice data corresponding to the first registered speaker based on voice recognition based on the first registered speaker model.

The method of claim 10,
The at least one processor
Extracting a feature vector corresponding to the first speech data and training the first registered speaker model based on a maximum posterior probability (MAP) adaptive algorithm using the feature vector and the background speaker model, Electronic devices.

The method of claim 10,
The at least one processor
It is determined whether the first voice data is voice data corresponding to a second registered speaker registered in the electronic device, and when the first voice data is voice data corresponding to the second registered speaker, the first voice data. And a second registered speaker model for recognizing a voice of the second registered speaker based on data and the background speaker model.

The method of claim 10,
The at least one processor
Receiving a third registered speaker model for recognizing a voice of the first registered speaker from an external device, and performing voice recognition of the first registered speaker using the first registered speaker model and the third registered speaker model , Electronic device.

Receiving a first voice signal through a first channel and receiving a second voice signal through a second channel different from the first channel;
Acquiring first voice data from the first voice signal and acquiring second voice data from the second voice signal;
Learning a background speaker model obtained based on voice signals for a plurality of speakers based on the second voice data;
Determining whether the first voice data is voice data corresponding to a first registered speaker registered in the electronic device; And
If the first voice data is voice data corresponding to the first registered speaker, learning a first registered speaker model for recognizing the voice of the first registered speaker based on the first voice data and the background speaker model. And a recording medium having a program stored thereon, the recording medium storing the program.

The method of claim 19,
Receiving a first voice signal through the first channel, and receiving a second voice signal through a second channel different from the first channel
Receiving, via a microphone, the first voice signal; And
And receiving the second voice signal through a communication unit.