KR20200040425A

KR20200040425A - Speaker recognition apparatus and operation method thereof

Info

Publication number: KR20200040425A
Application number: KR1020180120294A
Authority: KR
Inventors: 박재한; 이가희; 황원준
Original assignee: 주식회사 케이티
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2020-04-20
Also published as: KR102621897B1

Abstract

The present invention relates to a speaker recognition method, which may be applied to an AI home assistant service. The speaker recognition method comprises the steps of: receiving a voice signal of a user at a home, who requests voice registration; measuring similarity between a user voice requesting the voice registration and a previously registered user voice; comparing a measured similarity score and a reference score for determining whether the user belongs to a specific group; and setting a reference score for speaker identification based on a comparison result.

Description

SPEAKER RECOGNITION APPARATUS AND OPERATION METHOD THEREOF}

본 발명은 사용자 맞춤형 화자 인식 장치 및 그 동작방법에 관한 것으로서, 보다 구체적으로는 댁 내 사용자들 간의 목소리 유사도에 따라 화자 식별을 위한 기준점수를 자동으로 업데이트할 수 있는 화자 인식 장치 및 그 동작방법에 관한 것이다.The present invention relates to a user-specific speaker recognition device and an operation method thereof, and more specifically, to a speaker recognition device and an operation method capable of automatically updating a reference score for speaker identification according to voice similarity between users in the home. It is about.

최근 정보통신 기술의 비약적인 발전에 따라 IoT(Internet of Things) 기술에 대한 관심 및 수요가 급격히 증가하고 있다. 이러한 IoT는, 이를 바라보는 관점에 따라 다양한 방식으로 정의될 수 있다. 그러나, 본질적으로 IoT는 인터넷을 기반으로 다양한 사물들을 통신 네트워크로 연결함으로써 사람과 사물, 사물과 사물 간의 통신을 가능하게 하는 지능형 정보통신 기술 내지 서비스이다.Recently, with the rapid development of information and communication technology, interest and demand for Internet of Things (IoT) technology is rapidly increasing. Such an IoT may be defined in various ways according to a viewpoint looking at it. However, in essence, IoT is an intelligent information communication technology or service that enables communication between people and things, things and things by connecting various things with a communication network based on the Internet.

이러한 IoT 기술은 스마트 홈(smart home), 스마트 헬스(smart health), 스마트 카(smart car) 등과 같은 다양한 기술 분야에서 응용되고 있다. 특히, IoT 기술을 홈 네트워크 시스템에 접목한 스마트 홈 서비스에 관한 연구가 활발히 진행되고 있다.The IoT technology is applied in various technology fields such as a smart home, a smart health, and a smart car. In particular, research on smart home services incorporating IoT technology into home network systems is actively being conducted.

스마트 홈 서비스는 통신 네트워크가 구축된 주거 환경에서 사물인터넷 기능이 포함된 IoT 기기를 통해 생활 수준 향상을 추구하는 시스템 전반을 의미한다. TV/냉장고/에어컨 등의 가전제품, 전기/수도 등의 에너지소비장치, 보안 서비스 등을 통신 네트워크로 연결하여 사용자로 하여금 스마트 폰 또는 음성 제어기(또는 AI 스피커) 등을 통해 댁 내의 상황 정보를 원격으로 실시간 확인 및 제어가 가능하도록 할 수 있다. 특히, 최근에는 AI 스피커를 통해 댁 내에 존재하는 IoT 기기들을 원격으로 제어할 수 있는 스마트 홈 서비스가 점점 증가하는 추세이다.Smart home service refers to the overall system that seeks to improve living standards through IoT devices that include IoT functions in a residential environment where a communication network is built. By connecting home appliances such as TV / refrigerator / air conditioner, energy consumption devices such as electricity / water, security services, etc. to a communication network, users can remotely monitor the situation information in the home through a smart phone or voice controller (or AI speaker). Real-time check and control can be enabled. In particular, in recent years, smart home services that can remotely control IoT devices existing in the home through AI speakers have been increasing.

AI 스피커는 음성인식(Voice Recognition) 기술 및 인공지능(Artificial Intelligence) 기술 등을 활용하여 대화형 AI 홈 비서 서비스를 제공할 수 있다. 여기서, 대화형 AI 홈 비서 서비스란 단어 그대로 댁 내에서 인공지능이 화자를 식별하여 화자의 홈 비서 역할을 수행해주는 서비스를 의미한다.　이러한 AI 스피커는 개인 일정 관리 서비스, SNS 관리 서비스, 앱 실행 서비스, 인터넷 쇼핑 서비스, 이메일 관리 서비스, 메신저 관리 서비스, 멀티미디어 재생 서비스, 날씨/교통/여행 정보 제공 서비스, IoT 기기 제어 서비스 등과 같은 다양한 서비스를 제공할 수 있다.The AI speaker can provide an interactive AI home secretary service by utilizing voice recognition technology and artificial intelligence technology. Here, the interactive AI home secretary service literally means a service in which an artificial intelligence in a home identifies a speaker and serves as a speaker's home secretary. These AI speakers are various services such as personal schedule management service, SNS management service, app execution service, internet shopping service, email management service, messenger management service, multimedia playback service, weather / traffic / travel information service, IoT device control service, etc. Can provide.

화자 인식 장치는 AI 스피커를 통해 입력되는 사용자 음성과 기존에 등록된 사용자 음성 간의 유사도를 측정하고, 상기 측정된 유사도 점수가 기준점수(threshold)를 초과하는 경우에 해당 사용자를 기 등록된 사용자로 판단하게 된다.The speaker recognition device measures the similarity between the user voice input through the AI speaker and the previously registered user voice, and determines that the user is a pre-registered user when the measured similarity score exceeds a threshold. Is done.

그런데, 기존의 화자 인식 장치는 화자 식별을 위한 기준점수를 하나만 설정하여 사용하고 있기 때문에, 형제/자매/쌍둥이처럼 목소리가 다른 일반인들보다 유사한 특수집단의 경우 화자 인식률이 저하되는 문제가 있었다. 따라서, 댁 내에 목소리가 유사한 형제/자매/쌍둥이 등이 존재하는 경우에도 화자를 정확하게 식별하기 위한 방안이 필요하다.However, since the existing speaker recognition apparatus uses only one reference score for speaker identification, in the case of a special group having similar voices than other ordinary people, such as siblings / sisters / twins, there is a problem in that the speaker recognition rate is lowered. Therefore, there is a need for a method for accurately identifying a speaker even when there are siblings / sisters / twins with similar voices in the home.

본 발명은 전술한 문제 및 다른 문제를 해결하는 것을 목적으로 한다. 또 다른 목적은 음성 등록을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도에 기초하여 화자 식별을 위한 기준점수를 자동으로 업데이트할 수 있는 화자 인식 장치 및 그 동작방법을 제공함에 있다.The present invention aims to solve the above and other problems. Another object is to provide a speaker recognition apparatus and an operation method thereof that can automatically update a reference score for speaker identification based on a similarity between a user voice attempting voice registration and a pre-registered user voice.

또 다른 목적은, 기 등록된 댁 내 사용자들이 특수집단에 속하는 경우, 자동 업데이트된 기준점수를 이용하여 화자를 식별할 수 있는 화자 인식 장치 및 그 동작방법을 제공함에 있다.Another object is to provide a speaker recognition apparatus and an operation method thereof, which can identify a speaker by using an automatically updated reference score when pre-registered home users belong to a special group.

상기 또는 다른 목적을 달성하기 위해 본 발명의 일 측면에 따르면, 음성 등록을 요청하는 댁 내 사용자의 음성 신호를 수신하는 단계; 상기 음성 등록을 요청하는 사용자 음성과 기 등록된 사용자 음성 간의 유사도를 측정하는 단계; 상기 측정된 유사도 점수와 특수집단 여부를 판별하기 위한 기준점수를 비교하는 단계; 및 상기 비교 결과에 기초하여 화자 식별을 위한 기준점수를 설정하는 단계를 포함하는 화자 인식 방법을 제공한다.According to an aspect of the present invention to achieve the above or another object, receiving a voice signal of a user at home requesting voice registration; Measuring similarity between a user voice requesting the voice registration and a pre-registered user voice; Comparing the measured similarity score with a reference score for determining whether a special group exists; And setting a reference score for speaker identification based on the comparison result.

좀 더 바람직하게는, 상기 설정 단계는, 측정된 유사도 점수가 특수집단 여부를 판별하기 위한 기준점수보다 작거나 같은 경우, 화자 식별을 위한 기준점수를 유지하는 것을 특징으로 한다. 또한, 상기 설정 단계는, 측정된 유사도 점수가 특수집단 여부를 판별하기 위한 기준점수보다 큰 경우, 화자 식별을 위한 기준점수를 업데이트하는 것을 특징으로 한다. 상기 화자 식별을 위한 기준점수는, 상기 측정된 유사도 점수를 기반으로 업데이트되는 것을 특징으로 한다. More preferably, the setting step is characterized in that when the measured similarity score is less than or equal to the reference score for determining whether a special group, the reference score for speaker identification is maintained. In addition, the setting step is characterized in that when the measured similarity score is greater than the reference score for determining whether a special group, the reference score for speaker identification is updated. The reference score for the speaker identification is characterized in that it is updated based on the measured similarity score.

좀 더 바람직하게는, 상기 수신 단계는, 댁 내 사용자의 음성 신호를 AI 스피커로부터 수신하는 것을 특징으로 한다. 상기 화자 인식 방법은, AI 스피커로부터 수신된 음성 신호의 특징 값들을 추출하고, 상기 추출된 특징 값들을 기반으로 상기 댁 내 사용자에 대응하는 화자 모델을 생성하는 단계를 더 포함하는 것을 특징으로 한다. More preferably, the receiving step is characterized in that the voice signal of the user at home is received from the AI speaker. The speaker recognition method further comprises extracting feature values of the voice signal received from the AI speaker, and generating a speaker model corresponding to the user in the home based on the extracted feature values.

좀 더 바람직하게는, 상기 화자 인식 방법은, 음성 인식을 요청하는 댁 내 사용자의 음성 신호를 수신하는 단계와, 상기 음성 인식을 요청하는 사용자 음성과 기 등록된 사용자 음성 간의 유사도를 측정하는 단계를 더 포함하는 것을 특징으로 한다. 또한, 상기 화자 인식 방법은, 음성 인식을 요청하는 사용자 음성과 기 등록된 사용자 음성 간의 유사도 점수와 상기 설정된 기준점수를 비교하여 화자를 식별하는 단계를 더 포함하는 것을 특징으로 한다. More preferably, the speaker recognition method comprises: receiving a voice signal of a home user requesting speech recognition, and measuring similarity between a user voice requesting speech recognition and a pre-registered user voice. It is characterized by further including. In addition, the speaker recognition method may further include identifying a speaker by comparing the similarity score between the user voice requesting speech recognition and a pre-registered user voice and the set reference score.

본 발명의 다른 측면에 따르면, 음성 등록을 요청하는 댁 내 사용자의 음성 신호를 수신하는 과정; 상기 음성 등록을 요청하는 사용자 음성과 기 등록된 사용자 음성 간의 유사도를 측정하는 과정; 상기 측정된 유사도 점수와 특수집단 여부를 판별하기 위한 기준점수를 비교하는 과정; 및 상기 비교 결과에 기초하여 화자 식별을 위한 기준점수를 설정하는 과정이 컴퓨터 상에서 실행되도록 컴퓨터 판독 가능한 기록매체에 저장된 프로그램을 제공한다.According to another aspect of the invention, the process of receiving a voice signal of a user at home requesting voice registration; Measuring a similarity between the user voice requesting the voice registration and a pre-registered user voice; Comparing the measured similarity score with a reference score for determining whether a special group exists; And a program stored in a computer-readable recording medium so that a process of setting a reference score for speaker identification based on the comparison result is executed on a computer.

본 발명의 또 다른 측면에 따르면, 음성 등록을 요청하는 댁 내 사용자의 음성 신호를 수신하는 음성 수신부; 상기 음성 등록을 요청하는 사용자 음성과 기 등록된 사용자 음성 간의 유사도를 측정하는 유사도 측정부; 및 상기 측정된 유사도 점수와 특수집단 여부를 판별하기 위한 기준점수를 서로 비교하여 화자 식별을 위한 기준점수를 설정하는 기준점수 설정부를 포함하는 화자 인식 장치를 제공한다.According to another aspect of the present invention, a voice receiving unit for receiving a voice signal of a home user requesting voice registration; A similarity measuring unit measuring a similarity between the user voice requesting the voice registration and a pre-registered user voice; And a reference score setting unit configured to set a reference score for speaker identification by comparing the measured similarity scores with reference scores for determining whether a special group exists.

좀 더 바람직하게는, 상기 화자 인식 장치는, 미리 결정된 음성 특징 추출 알고리즘을 이용하여 음성 신호의 특징 값들을 추출하는 음성특징 추출부를 더 포함하는 것을 특징으로 한다. 또한, 상기 화자 인식 장치는, 음성 신호의 특징 값들을 기반으로 댁 내 사용자에 대응하는 화자 모델을 생성하는 화자모델 생성부를 더 포함하는 것을 특징으로 한다. 또한, 상기 화자 인식 장치는, 음성 인식을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도 점수와 기준점수를 비교하여 화자를 식별하는 화자 식별부를 더 포함하는 것을 특징으로 한다.More preferably, the speaker recognition apparatus is characterized in that it further comprises a voice feature extraction unit for extracting feature values of the voice signal using a predetermined voice feature extraction algorithm. In addition, the speaker recognition device is characterized in that it further comprises a speaker model generating unit for generating a speaker model corresponding to the user in the home based on the feature values of the voice signal. In addition, the speaker recognition apparatus is characterized in that it further comprises a speaker identification unit for identifying a speaker by comparing the similarity score and a reference score between a user voice attempting speech recognition and a pre-registered user voice.

본 발명의 실시 예들에 따른 화자 인식 장치 및 그 동작방법의 효과에 대해 설명하면 다음과 같다.Referring to the effects of the speaker recognition apparatus and its operation method according to embodiments of the present invention are as follows.

본 발명의 실시 예들 중 적어도 하나에 의하면, 음성 등록을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도에 따라 화자 식별을 위한 기준점수를 자동으로 업데이트함으로써, 형제/자매/쌍둥이처럼 목소리가 매우 유사한 특수집단의 경우에도 화자를 정확하게 식별할 수 있다는 장점이 있다.According to at least one of the embodiments of the present invention, by automatically updating a reference score for speaker identification according to a similarity between a user voice attempting to register a voice and a pre-registered user voice, the voice is very like a sibling / sister / twin. Similar special groups have the advantage of being able to accurately identify the speaker.

다만, 본 발명의 실시 예들에 따른 화자 인식 장치 및 그 동작방법이 달성할 수 있는 효과는 이상에서 언급한 것들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the effects that the speaker recognition apparatus and the operation method thereof according to the embodiments of the present invention can achieve are not limited to those mentioned above, and other effects that are not mentioned are from the following description. It will be clearly understood by those with ordinary knowledge.

도 1은 본 발명의 일 실시 예에 따른 화자 인식 시스템의 구성 블록도;
도 2는 본 발명의 일 실시 예에 따른 AI 스피커의 구성 블록도;
도 3은 본 발명의 일 실시 예에 따른 화자 인식 장치의 구성 블록도;
도 4는 본 발명의 일 실시 예에 따른 음성 등록 방법을 설명하는 순서도;
도 5는 화자 식별을 위한 기준점수를 업데이트하는 방법을 예시하는 도면;
도 6은 본 발명의 일 실시 예에 따른 화자 인식 방법을 설명하는 순서도.1 is a block diagram of a speaker recognition system according to an embodiment of the present invention;
2 is a block diagram of an AI speaker according to an embodiment of the present invention;
3 is a block diagram of a speaker recognition apparatus according to an embodiment of the present invention;
4 is a flowchart illustrating a voice registration method according to an embodiment of the present invention;
5 is a diagram illustrating a method of updating a reference score for speaker identification;
6 is a flowchart illustrating a speaker recognition method according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 도면 부호에 관계없이 동일하거나 유사한 구성요소는 동일한 참조 번호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 즉, 본 발명에서 사용되는 '부'라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, '부'는 어떤 역할들을 수행한다. 그렇지만 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 '부'들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '부'들로 결합되거나 추가적인 구성요소들과 '부'들로 더 분리될 수 있다.Hereinafter, exemplary embodiments disclosed herein will be described in detail with reference to the accompanying drawings, but the same or similar elements are assigned the same reference numbers regardless of the reference numerals, and overlapping descriptions thereof will be omitted. The suffixes "modules" and "parts" for the components used in the following description are given or mixed only considering the ease of writing the specification, and do not have meanings or roles distinguished from each other in themselves. That is, the term 'unit' used in the present invention refers to a hardware component such as software, FPGA or ASIC, and 'unit' performs certain roles. However, 'wealth' is not limited to software or hardware. The 'unit' may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors. Thus, as an example, 'part' refers to components such as software components, object-oriented software components, class components and task components, processes, functions, attributes, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables. The functionality provided within components and 'parts' may be combined into a smaller number of components and 'parts' or further separated into additional components and 'parts'.

또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.In addition, in describing the embodiments disclosed in the present specification, when it is determined that detailed descriptions of related known technologies may obscure the gist of the embodiments disclosed herein, detailed descriptions thereof will be omitted. In addition, the accompanying drawings are only for easy understanding of the embodiments disclosed in the present specification, and the technical spirit disclosed in the specification is not limited by the accompanying drawings, and all modifications included in the spirit and technical scope of the present invention , It should be understood to include equivalents or substitutes.

본 발명은 음성 등록을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도에 기초하여 화자 식별을 위한 기준점수를 자동으로 업데이트할 수 있는 화자 인식 장치 및 그 동작방법을 제안한다. 또한, 본 발명은, 기 등록된 댁 내 사용자들이 특수집단에 속하는 경우, 자동 업데이트된 기준점수를 이용하여 화자를 식별할 수 있는 화자 인식 장치 및 그 동작방법을 제공한다.The present invention proposes a speaker recognition apparatus and an operation method thereof capable of automatically updating a reference score for speaker identification based on the similarity between a user voice attempting voice registration and a pre-registered user voice. In addition, the present invention provides a speaker recognition apparatus and an operation method for identifying a speaker using an automatically updated reference score when pre-registered home users belong to a special group.

이하에서는, 본 발명의 다양한 실시 예들에 대하여, 도면을 참조하여 상세히 설명한다.Hereinafter, various embodiments of the present invention will be described in detail with reference to the drawings.

도 1은 본 발명의 일 실시 예에 따른 화자 인식 시스템의 구성 블록도이다.1 is a block diagram of a speaker recognition system according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시 예에 따른 화자 인식 시스템(100)은, 댁 내에 위치하는 공유기(110) 및 AI 스피커(120)와, 댁 외에 위치하는 인터넷 망(130) 및 인터넷 서비스 제공 서버(140)를 포함할 수 있다. 상기 인터넷 서비스 제공 서버(140)는 댁 내 사용자들의 목소리(음성)를 식별하기 위한 화자 인식 서버(또는 화자 인식 장치)를 구비할 수 있다.Referring to FIG. 1, the speaker recognition system 100 according to an embodiment of the present invention includes a router 110 and an AI speaker 120 located inside a home, an Internet network 130 located outside the home, and an Internet service It may include a provision server 140. The Internet service providing server 140 may include a speaker recognition server (or speaker recognition device) for identifying voices (voices) of users in the home.

공유기(Access Point, 110)는 댁 내에 위치하는 AI 스피커(120)와 복수의 IoT 기기들(미도시)을 인터넷 망(130)에 접속할 수 있도록 유/무선 통신을 연결해주는 장치이다. 공유기(110)는 인터넷 서비스 제공자가 제공하는 인터넷 주소(IP address)를 댁 내에 존재하는 복수의 단말들이 서로 나눠 쓸 수 있도록 공유해주는 기능을 수행할 수 있다.The router (Access Point, 110) is a device that connects wired / wireless communication so that the AI speaker 120 located in the home and a plurality of IoT devices (not shown) can be connected to the Internet network 130. The router 110 may perform a function of sharing an Internet address (IP address) provided by an Internet service provider so that multiple terminals existing in the home can share with each other.

AI 스피커(또는 음성 제어기, 120)는 공유기(110) 또는 통신 모뎀(미도시) 등과 유/무선 통신 인터페이스를 통해 연결 가능하며, 상기 공유기(110) 또는 통신 모뎀을 통해 외부에 위치하는 인터넷 서비스 제공 서버(140)와 통신할 수 있다.The AI speaker (or voice controller, 120) can be connected to a router 110 or a communication modem (not shown) through a wired / wireless communication interface, and provides an Internet service located externally through the router 110 or a communication modem It may communicate with the server 140.

AI 스피커(120)는 다수의 IoT 기기들과 유/무선 통신 인터페이스를 통해 직접적으로 연결되어, 상기 IoT 기기들과 통신할 수 있다. 한편, 도면에 도시되고 있지 않지만, 다른 실시 예로, AI 스피커(120)는 홈 허브(home hub, 미도시)와 유/무선 통신 인터페이스를 통해 연결 가능하며, 상기 홈 허브를 통해 다수의 IoT 기기들과 통신할 수도 있다.The AI speaker 120 is directly connected to a plurality of IoT devices through a wired / wireless communication interface to communicate with the IoT devices. On the other hand, although not shown in the drawing, in another embodiment, the AI speaker 120 may be connected to a home hub (not shown) through a wired / wireless communication interface, and a plurality of IoT devices may be connected through the home hub. It can also communicate with.

AI 스피커(120)는 댁 내 사용자들의 음성 명령에 대응하는 AI 홈 비서 서비스를 제공할 수 있다. 일 예로, AI 스피커(120)는 사용자의 음성 명령에 대응하여 댁 내에 존재하는 다수의 IoT 기기들의 상태를 확인하거나 혹은 해당 기기들의 동작을 제어할 수 있다.The AI speaker 120 may provide an AI home secretary service corresponding to voice commands of home users. For example, the AI speaker 120 may check the status of a plurality of IoT devices existing in the home in response to a user's voice command or control the operation of the devices.

AI 홈 비서 서비스에 복수의 사용자 계정이 등록된 경우, 댁 내 사용자들의 목소리를 식별(구별)하기 위해, AI 스피커(120)는 마이크로폰을 통해 입력되는 사용자 음성을 인터넷 서비스 제공 서버(140)로 제공할 수 있다.When a plurality of user accounts are registered in the AI home secretary service, in order to identify (differentiate) the voices of users in the home, the AI speaker 120 provides the user voice input through the microphone to the Internet service providing server 140 can do.

인터넷 서비스 제공 서버(140)는 인터넷 망(130)을 통해 댁 내에 존재하는 공유기(110)와 접속할 수 있고, 상기 공유기(110)를 통해 AI 스피커(120)와 통신할 수 있다.The Internet service providing server 140 may access the router 110 existing in the home through the Internet network 130 and communicate with the AI speaker 120 through the router 110.

인터넷 서비스 제공 서버(140)는, 화자 인식 서버를 통해, AI 스피커(120)로부터 수신되는 사용자의 음성 신호를 분석하여 화자를 식별할 수 있다. 상기 화자 식별 결과, 음성 인식을 요청하는 사용자가 기 등록된 댁 내 사용자인 경우, 인터넷 서비스 제공 서버(140)는 AI 스피커(120)와 연동하여 댁 내 사용자들의 음성 명령에 대응하는 AI 홈 비서 서비스를 제공할 수 있다.The Internet service providing server 140 may identify a speaker by analyzing a user's voice signal received from the AI speaker 120 through the speaker recognition server. As a result of the speaker identification, if the user requesting speech recognition is a pre-registered home user, the Internet service providing server 140 works with the AI speaker 120 to provide an AI home secretary service corresponding to the voice commands of the home users. Can provide.

도 2는 본 발명의 일 실시 예에 따른 AI 스피커의 구성 블록도이다.2 is a block diagram of an AI speaker according to an embodiment of the present invention.

도 2를 참조하면, 본 발명의 일 실시 예에 따른 AI 스피커(200)는 통신부(210), 입력부(220), 출력부(230), 메모리(240) 및 제어부(250)를 포함할 수 있다. 도 2에 도시된 구성요소들은 AI 스피커(200)를 구현하는데 있어서 필수적인 것은 아니어서, 본 명세서상에서 설명되는 AI 스피커는 위에서 열거된 구성요소들보다 많거나 또는 적은 구성요소들을 가질 수 있다.Referring to FIG. 2, the AI speaker 200 according to an embodiment of the present invention may include a communication unit 210, an input unit 220, an output unit 230, a memory 240, and a control unit 250. . The components shown in FIG. 2 are not essential for implementing the AI speaker 200, so the AI speaker described herein may have more or fewer components than those listed above.

통신부(210)는 유선 통신을 지원하기 위한 유선 통신 모듈과 근거리 무선 통신을 지원하기 위한 근거리 통신 모듈을 포함할 수 있다. 유선 통신 모듈은, 유선 통신을 위한 기술표준들 또는 통신방식(예를 들어, 이더넷(Ethernet), PLC(Power Line Communication), 홈 PNA(Home PNA), IEEE 1394 등)에 따라 구축된 유선 통신망에서 통신 모뎀, 공유기, IoT 기기 중 적어도 하나와 유선 신호를 송수신한다. 상기 근거리 통신 모듈은 근거리 통신(Short range communication)을 위한 것으로서, 블루투스(Bluetooth?), RFID(Radio Frequency Identification), 적외선 통신(Infrared Data Association; IrDA), UWB(Ultra-Wideband), ZigBee, NFC(Near Field Communication), Wi-Fi(Wireless-Fidelity), Wi-Fi Direct, Wireless USB(Wireless Universal Serial Bus) 기술 중 적어도 하나를 이용하여 근거리 무선 통신을 지원할 수 있다.The communication unit 210 may include a wired communication module for supporting wired communication and a short range communication module for supporting short-range wireless communication. The wired communication module is a wired communication network constructed according to technical standards or communication methods (for example, Ethernet, Power Line Communication (PLC), Home PNA (Home PNA), IEEE 1394, etc.) for wired communication. It transmits and receives a wired signal to and from at least one of a communication modem, a router, and an IoT device. The short-range communication module is for short-range communication (Bluetooth?), Radio Frequency Identification (RFID), Infrared Data Association (IrDA), Ultra-Wideband (UWB), ZigBee, NFC ( Near Field Communication (Wireless-Fidelity), Wi-Fi Direct, and Wireless Universal Serial Bus (USB) technology can be used to support short-range wireless communication.

입력부(220)는 오디오 신호 입력을 위한 마이크로폰(microphone)을 포함할 수 있다. 마이크로폰은 외부의 음향 신호를 전기적인 음성 데이터로 처리한다. 처리된 음성 데이터는 AI 스피커(200)에서 수행 중인 기능(또는 실행 중인 응용 프로그램)에 따라 다양하게 활용될 수 있다. 한편, 마이크로폰에는 외부의 음향 신호를 입력 받는 과정에서 발생되는 잡음(noise)을 제거하기 위한 다양한 잡음 제거 알고리즘이 구현될 수 있다.The input unit 220 may include a microphone for inputting an audio signal. The microphone processes external sound signals as electrical voice data. The processed voice data may be variously used according to a function (or a running application program) performed by the AI speaker 200. Meanwhile, various noise cancellation algorithms for removing noise generated in the process of receiving an external sound signal may be implemented in the microphone.

출력부(230)는 시각, 청각 또는 촉각 등과 관련된 출력을 발생시키기 위한 것으로, 디스플레이부, 음향 출력부, 햅팁 모듈, 광 출력부 중 적어도 하나를 포함할 수 있다. 음향 출력부는 메모리(240)에 저장된 오디오 데이터를 출력할 수 있다. 음향 출력부는 AI 스피커(200)에서 수행되는 기능과 관련된 음향 신호를 출력하기도 한다. 이러한 음향 출력부에는 스피커(speaker) 및 버저(buzzer) 등이 포함될 수 있다.The output unit 230 is for generating output related to visual, auditory, or tactile senses, and may include at least one of a display unit, an audio output unit, a hap tip module, and an optical output unit. The audio output unit may output audio data stored in the memory 240. The sound output unit may also output sound signals related to functions performed by the AI speaker 200. The sound output unit may include a speaker and a buzzer.

메모리(240)는 AI 스피커(200)의 다양한 기능을 지원하는 데이터를 저장한다. 메모리(240)는 AI 스피커(200)에서 구동되는 응용 프로그램(application program 또는 애플리케이션(application)), AI 스피커(200)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다.The memory 240 stores data supporting various functions of the AI speaker 200. The memory 240 may store application programs or applications driven by the AI speaker 200 and data and commands for the operation of the AI speaker 200.

제어부(250)는 메모리(240)에 저장된 응용 프로그램과 관련된 동작과, 통상적으로 AI 스피커(200)의 전반적인 동작을 제어한다. 나아가 제어부(250)는 이하에서 설명되는 다양한 실시 예들을 본 발명에 따른 AI 스피커(200) 상에서 구현하기 위하여, 위에서 살펴본 구성요소들을 중 적어도 하나를 조합하여 제어할 수 있다.The control unit 250 controls the operation related to the application program stored in the memory 240 and generally the overall operation of the AI speaker 200. Furthermore, in order to implement various embodiments described below on the AI speaker 200 according to the present invention, the controller 250 may control by combining at least one of the above-described components.

제어부(250)는 음성 인식 모듈(255)을 더 포함할 수 있다. 음성 인식 모듈(255)은 음성 인식 알고리즘이 적용된 음성 인식 엔진을 구동하여 마이크로폰을 통해 입력된 외부 음성을 인식한다. 즉, 음성 인식 모듈(255)은 마이크로폰을 통해 입력되는 외부 음성을 디지털 데이터로 변환하고, 상기 변환된 디지털 데이터를 증폭(Pre-emphasis)한 후, 디지털 변환된 음성의 시작 지점과 끝 지점을 검출한다. 이어서, 음성 인식 모듈(255)은 검출한 시작 지점과 끝 지점 사이의 음성에 대한 음성 특징 값들을 추출하여 고유의 음성 또는 음색을 인식한다. The control unit 250 may further include a speech recognition module 255. The voice recognition module 255 recognizes an external voice input through a microphone by driving a voice recognition engine to which a voice recognition algorithm is applied. That is, the voice recognition module 255 converts the external voice input through the microphone into digital data, amplifies the converted digital data (Pre-emphasis), and then detects the start and end points of the digitally converted voice. do. Subsequently, the speech recognition module 255 extracts speech feature values for speech between the detected start point and end point to recognize a unique voice or tone.

한편, 본 실시 예에서는, 음성 인식 모듈(255)이 제어부(250) 내에 구현되는 것을 예시하고 있으나 이를 제한하지는 않으며, 상기 제어부(250)와 독립적으로 구성될 수 있다. 더 나아가, 상기 음성 인식 모듈은 AI 스피커(200)와 연동되는 인터넷 서비스 제공 서버(140) 상에 구현될 수도 있다.On the other hand, in the present embodiment, the voice recognition module 255 is illustrated to be implemented in the control unit 250, but is not limited thereto, and may be configured independently of the control unit 250. Furthermore, the voice recognition module may be implemented on the Internet service providing server 140 interworking with the AI speaker 200.

도 3은 본 발명의 일 실시 예에 따른 화자 인식 장치의 구성 블록도이다.3 is a block diagram of a speaker recognition apparatus according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시 예에 따른 화자 인식 장치(또는 화자 인식 서버, 300)는 음성 수신부(310), 음성특징 추출부(320), 화자모델 생성부(330), 유사도 측정부(340), 기준점수 설정부(350), 화자 식별부(360) 및 데이터베이스(370)를 포함할 수 있다. 도 3에 도시된 구성요소들은 화자 인식 장치(300)를 구현하는데 있어서 필수적인 것은 아니어서, 본 명세서상에서 설명되는 화자 인식 장치는 위에서 열거된 구성요소들보다 많거나, 또는 적은 구성요소들을 가질 수 있다.Referring to FIG. 3, a speaker recognition apparatus (or speaker recognition server, 300) according to an embodiment of the present invention includes a voice reception unit 310, a voice feature extraction unit 320, a speaker model generation unit 330, and similarity measurement It may include a unit 340, a reference score setting unit 350, a speaker identification unit 360 and a database 370. The components illustrated in FIG. 3 are not essential for implementing the speaker recognition apparatus 300, so the speaker recognition apparatus described herein may have more or fewer components than those listed above. .

음성 수신부(310)는 AI 스피커(200)로부터 댁 내 사용자의 음성 신호를 수신할 수 있다. 음성 수신부(310)는 AI 스피커(200)로부터 수신된 음성 신호를 데이터베이스(또는 메모리, 370)에 저장할 수 있다. 또한, 음성 수신부(310)는 AI 스피커(200)로부터 수신된 음성 신호를 음성특징 추출부(320)로 제공할 수 있다.The voice receiver 310 may receive a voice signal of a user in the home from the AI speaker 200. The voice receiver 310 may store the voice signal received from the AI speaker 200 in a database (or memory, 370). In addition, the voice receiver 310 may provide the voice signal received from the AI speaker 200 to the voice feature extraction unit 320.

음성특징 추출부(320)는 미리 결정된 음성 특징 추출 알고리즘을 이용하여 상기 수신된 음성 신호의 특징 값(또는 특징 벡터)들을 추출할 수 있다. 상기 음성 특징 추출 알고리즘으로는 MFCC(Mel-frequency cepstrum coefficients) 알고리즘, PLP(perceptual linear prediction) 알고리즘, GTCC(Gammatone Cepstral Coefficients) 알고리즘 및 ZCPA(Zero-Crossings with Peak Amplitudes) 알고리즘 중 어느 하나가 사용될 수 있다. The voice feature extraction unit 320 may extract feature values (or feature vectors) of the received voice signal using a predetermined voice feature extraction algorithm. As the speech feature extraction algorithm, any one of a Mel-frequency cepstrum coefficients (MFCC) algorithm, a perceptual linear prediction (PLP) algorithm, a Gammatone Cepstral Coefficients (GTCC) algorithm, and a Zero-Crossings with Peak Amplitudes (ZCPA) algorithm may be used. .

좀 더 바람직하게는, 계산량 대비 정확도가 높은 MFCC 알고리즘이 사용될 수 있다. 상기 MFCC 알고리즘은 MFC 계수를 이용하여 음성 신호의 특징 벡터를 추출할 수 있다. 여기서, MFC(Mel-frequency cepstrum)는 단위 음성 프레임(frame) 내의 음성 데이터에 대해 계산한 파워 스펙트럼(power spectrum)을 청각기의 주파수 반응도를 모사한 멜 스케일 주파수 도메인(mel-scale frequency domain)에서 이산 코사인 변환(discrete cosine transform, DCT)을 취한 값을 의미한다.More preferably, an MFCC algorithm with high accuracy compared to the computational amount can be used. The MFCC algorithm can extract feature vectors of speech signals using MFC coefficients. Here, MFC (Mel-frequency cepstrum) is discrete in the mel-scale frequency domain (mel-scale frequency domain) that simulates the frequency response of the power spectrum (power spectrum) calculated for speech data in a unit speech frame (frame) It means the value of taking the cosine transform (DCT).

화자모델 생성부(330)는 목소리(음성)를 등록하고자 하는 댁 내 사용자의 음성 특징들을 훈련 데이터(training data)로 사용하여 화자 모델(speaker model)을 생성할 수 있다. 상기 화자모델 생성부(330)는 댁 내 사용자에 대응하는 화자 모델을 데이터베이스(370)에 저장할 수 있다. 상기 데이터베이스(370)에 저장된 화자 모델들은 댁 내 사용자들의 목소리를 식별하기 위해 사용될 수 있다.The speaker model generation unit 330 may generate a speaker model by using voice characteristics of a user who wants to register a voice (voice) as training data. The speaker model generation unit 330 may store a speaker model corresponding to a user in the home in the database 370. Speaker models stored in the database 370 may be used to identify the voices of users in the home.

화자모델 생성부(330)는 미리 결정된 통계 모델 기법을 이용하여 화자 모델을 생성할 수 있다. 상기 통계 모델 기법으로는 GMM(Gaussian Mixture Model) 기법과 SVM(Support Vector Machine) 기법 중 어느 하나가 사용될 수 있으며 반드시 이에 제한되지는 않는다. The speaker model generation unit 330 may generate a speaker model using a predetermined statistical model technique. As the statistical model technique, any one of a Gaussian Mixture Model (GMM) technique and a Support Vector Machine (SVM) technique may be used, but is not limited thereto.

유사도 측정부(340)는 음성 등록(voice registration) 또는 음성 인식(voice recognition)을 시도하는 사용자 음성과 데이터베이스(370)에 저장된 화자 모델들 간의 유사도(similarity)를 측정할 수 있다. 상기 유사도 측정부(340)는 강건한 화자 식별을 위하여 음성 프레임 단위로 유사도를 측정할 수 있다.The similarity measurement unit 340 may measure the similarity between the voice of the user attempting voice registration or voice recognition and the speaker models stored in the database 370. The similarity measurement unit 340 may measure the similarity in units of voice frames for robust speaker identification.

유사도 측정부(340)는 미리 결정된 유사도 측정 방법을 이용하여 음성 등록 또는 음성 인식을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도를 측정할 수 있다. 상기 유사도 측정 방법으로는 VQ(Vector Quantization) 기반의 유사도 측정 방법과 DTW(Dynamic Time Warping) 기반의 유사도 측정 방법 중 어느 하나가 사용될 수 있으며 반드시 이에 제한되지는 않는다. The similarity measurement unit 340 may measure a similarity between a user voice attempting voice registration or voice recognition and a pre-registered user voice using a predetermined similarity measurement method. As the method for measuring the similarity, either a VQ (Vector Quantization) -based similarity measurement method or a DTW (Dynamic Time Warping) -based similarity measurement method may be used, but is not limited thereto.

기준점수 설정부(350)는 일반적인 화자 식별을 위한 기준점수(또는 임계값, threshold)를 설정할 수 있다. 상기 기준점수는 화자 인식률을 최적화하기 위한 값으로 설정된다. 또한, 상기 기준점수는 화자 식별을 위해 유사도 점수와 비교될 수 있다.The reference score setting unit 350 may set a reference score (or threshold, threshold) for general speaker identification. The reference score is set to a value for optimizing the speaker recognition rate. Also, the reference score may be compared with a similarity score for speaker identification.

기준점수 설정부(350)는 기 등록된 댁 내 사용자들 간의 목소리 유사도에 기초하여 해당 사용자들이 일반적인 집단인지 아니면 특수한 집단인지 여부를 확인할 수 있다. 즉, 댁 내 사용자들 간의 목소리 유사도가 특수집단 여부를 판별하기 위한 기준점수(이하, 설명의 편의상 '제1 기준점수'라 칭함)를 초과하는 경우, 기준점수 설정부(350)는 해당 사용자들을 '특수집단(special group)'으로 분류(인식)할 수 있다. 한편, 상기 댁 내 사용자들 간의 목소리 유사도가 제1 기준 점수보다 작거나 같은 경우, 기준점수 설정부(350)는 해당 사용자들을 '일반집단(normal group)'으로 분류(인식)할 수 있다.The reference score setting unit 350 may determine whether the corresponding users are a general group or a special group based on the similarity of voices between pre-registered home users. That is, when the voice similarity between the users in the house exceeds a reference score (hereinafter referred to as 'first reference score' for convenience of description) for determining whether a special group is present, the reference score setting unit 350 determines the corresponding users It can be classified (recognized) as a 'special group'. On the other hand, if the voice similarity between the users in the home is less than or equal to the first reference score, the reference score setting unit 350 may classify (recognize) the users as a 'normal group'.

기준점수 설정부(350)는 기 등록된 댁 내 사용자들이 일반집단/특수집단인지 여부에 따라 화자 식별을 위한 기준점수(이하, 설명의 편의상 '제2 기준점수'라 칭함)를 유지하거나 혹은 해당 기준점수를 업데이트할 수 있다. 즉, 기준점수 설정부(350)는, 기 등록된 댁 내 사용자들이 일반집단인 경우, 화자 식별을 위한 기준점수를 유지하고, 기 등록된 댁 내 사용자들이 특수집단인 경우, 화자 식별을 위한 기준점수를 업데이트할 수 있다.The reference score setting unit 350 maintains a reference score (hereinafter referred to as 'second reference score' for convenience of description) for speaker identification depending on whether or not pre-registered home users are a general group / special group, or applicable The baseline score can be updated. That is, the reference score setting unit 350 maintains a reference score for speaker identification when pre-registered home users are a general group, and a reference for speaker identification when pre-registered home users are a special group. The score can be updated.

상기 업데이트된 기준점수는 특수집단에 속하는 댁 내 사용자들 간의 유사도 점수를 기반으로 설정될 수 있다. 일 예로, 상기 업데이트된 기준점수는 특수집단에 속하는 사용자 음성들 간의 유사도 점수의 110%로 설정될 수 있으며 반드시 이에 제한되지는 않는다. The updated reference score may be set based on the similarity score between users in the home belonging to the special group. For example, the updated reference score may be set to 110% of the similarity score between user voices belonging to a special group, but is not limited thereto.

화자 식별부(350)는 음성 인식을 시도하는 사용자 음성과 기 등록된 화자 모델들 간의 유사도 점수와 상기 제2 기준점수를 비교하여 화자를 식별할 수 있다. 가령, 화자 식별부(350)는 음성 인식을 시도하는 사용자 음성과 특정 화자 모델 간의 유사도 점수가 제2 기준점수를 초과하는 경우, 상기 음성 인식을 시도하는 사용자를 상기 특정 화자 모델에 대응하는 사용자로 인식할 수 있다. The speaker identification unit 350 may identify a speaker by comparing the similarity score between the user's voice attempting speech recognition and pre-registered speaker models and the second reference score. For example, if the similarity score between the user's voice attempting speech recognition and a specific speaker model exceeds a second reference score, the speaker identification unit 350 may convert the user attempting speech recognition into a user corresponding to the specific speaker model. Can be recognized.

데이터베이스(또는 메모리, 370)는 화자 인식 장치(300)의 다양한 기능을 지원하는 데이터를 저장한다. 데이터베이스(370)는 화자 인식 장치(300)에서 구동되는 다수의 응용 프로그램(application program 또는 애플리케이션(application)), 화자 인식 장치(300)의 동작을 위한 데이터들, 명령어들을 저장할 수 있다.The database (or memory, 370) stores data supporting various functions of the speaker recognition device 300. The database 370 may store a number of application programs or applications that are driven by the speaker recognition device 300 and data and commands for the operation of the speaker recognition device 300.

데이터베이스(370)는 음성 수신부(310)를 통해 획득된 댁 내 사용자들의 음성 신호에 관한 정보, 기준점수 설정부(350)를 통해 설정된 기준점수에 관한 정보, 화자 모델 생성부(330)를 통해 생성된 화자 모델들에 관한 정보 등을 저장할 수 있다.The database 370 is generated through the voice signal information of the home users acquired through the voice receiver 310, information about the reference score set through the reference score setting unit 350, and the speaker model generation unit 330. Information about the old speaker models.

이상, 상술한 바와 같이, 본 발명에 따른 화자 인식 장치는, 음성 등록을 시도하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도에 따라 화자 식별을 위한 기준점수를 자동으로 업데이트함으로써, 형제/자매/쌍둥이처럼 목소리가 매우 유사한 특수집단의 경우에도 화자를 정확하게 식별할 수 있다.As described above, as described above, the speaker recognition apparatus according to the present invention automatically updates the reference score for speaker identification according to the similarity between the user voice attempting to register the voice and the pre-registered user voices, thereby sibling / sister / Even special groups with very similar voices, such as twins, can accurately identify the speaker.

도 4는 본 발명의 일 실시 예에 따른 음성 등록 방법을 설명하는 순서도이다. 4 is a flowchart illustrating a voice registration method according to an embodiment of the present invention.

도 4를 참조하면, 댁 내 사용자는 자신의 목소리(음성)를 AI 홈 비서 서비스에 등록 요청할 수 있다(S410). 이때, 댁 내 사용자는 자신의 음성 신호를 AI 스피커(200)의 마이크로폰을 통해 입력할 수 있다.Referring to FIG. 4, the user at home may request to register his voice (voice) with the AI home secretary service (S410). At this time, the user at home may input his / her voice signal through the microphone of the AI speaker 200.

화자 인식 장치(300)는 AI 스피커(200)로부터 댁 내 사용자의 음성 신호를 수신하고, 상기 수신된 음성 신호에 관한 데이터를 데이터베이스에 저장할 수 있다(S420).The speaker recognition device 300 may receive a voice signal of the user at home from the AI speaker 200 and store data regarding the received voice signal in a database (S420).

화자 인식 장치(300)는 AI 스피커(200)로부터 수신된 음성 신호를 분석하여 해당 음성 신호의 특징 값(또는 특징 벡터)들을 추출할 수 있다(S430). 이때, 상기 화자 인식 장치(300)는 미리 결정된 음성 특징 추출 알고리즘을 이용하여 댁 내 사용자의 음성 특징 값들을 추출할 수 있다.The speaker recognition device 300 may analyze the voice signal received from the AI speaker 200 and extract feature values (or feature vectors) of the voice signal (S430). At this time, the speaker recognition apparatus 300 may extract voice feature values of the user at home using a predetermined voice feature extraction algorithm.

화자 인식 장치(300)는 댁 내 사용자의 음성 특징 값들을 훈련 데이터로 사용하여 해당 사용자의 화자 모델을 생성하고, 상기 생성된 화자 모델을 데이터베이스에 저장할 수 있다(S440). 이때, 상기 화자 인식 장치(300)는 GMM 통계 모델 기법이나 혹은 SVM 통계 모델 기법을 이용하여 화자 모델을 생성할 수 있다.The speaker recognition apparatus 300 may generate a speaker model of the corresponding user using the voice feature values of the user in the home as training data, and may store the generated speaker model in a database (S440). At this time, the speaker recognition apparatus 300 may generate a speaker model using a GMM statistical model technique or an SVM statistical model technique.

화자 인식 장치(300)는 미리 결정된 유사도 측정 방법을 이용하여 음성 등록을 요청하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도를 측정할 수 있다(S450). The speaker recognition apparatus 300 may measure the similarity between the user voice requesting voice registration and the pre-registered user voices using a predetermined similarity measurement method (S450).

화자 인식 장치(300)는 상기 측정된 유사도 점수가 특수집단 여부를 판별하기 위한 기준점수(즉, 제1 기준점수)를 초과하는지 여부를 확인할 수 있다(S460).The speaker recognition apparatus 300 may check whether the measured similarity score exceeds a reference score (that is, a first reference score) for determining whether a special group is present (S460).

상기 460 단계의 확인 결과, 상기 측정된 유사도 점수가 제1 기준점수보다 작거나 같은 경우, 화자 인식 장치(300)는 음성 등록을 요청하는 사용자와 기 등록된 댁 내 사용자를 일반집단으로 인식하여 화자 식별을 위한 기준 점수(즉, 제2 기준점수)를 유지할 수 있다(S470). As a result of the check in step 460, if the measured similarity score is less than or equal to the first reference score, the speaker recognition device 300 recognizes a user requesting voice registration and a pre-registered home user as a general speaker. A reference score for identification (ie, a second reference score) may be maintained (S470).

한편, 상기 460 단계의 확인 결과, 상기 측정된 유사도 점수가 제1 기준점수보다 큰 경우, 화자 인식 장치(300)는 음성 등록을 요청하는 사용자와 기 등록된 댁 내 사용자를 특수집단으로 인식하여 화자 식별을 위한 기준 점수를 업데이트할 수 있다(S480). 이때, 상기 화자 인식 장치(300)는 상기 측정된 유사도 점수를 이용하여 화자 식별을 위한 기준점수를 업데이트할 수 있다. On the other hand, as a result of the check in step 460, when the measured similarity score is greater than the first reference score, the speaker recognition device 300 recognizes a user requesting voice registration and a pre-registered home user as a special group and is a speaker The reference score for identification may be updated (S480). At this time, the speaker recognition device 300 may update the reference score for speaker identification using the measured similarity score.

기 등록된 댁 내 사용자들이 일반집단인 경우, 화자 인식 장치(300)는 최초 설정된 기준점수를 이용하여 화자를 식별할 수 있다. 한편, 기 등록된 댁 내 사용자들이 특수집단인 경우, 화자 인식 장치(300)는 최초 설정된 기준점수를 업데이트하여 화자를 식별할 수 있다.If the pre-registered home users are a general group, the speaker recognition device 300 may identify the speaker using the initially set reference score. On the other hand, if the pre-registered home users are a special group, the speaker recognition device 300 may identify the speaker by updating the initially set reference score.

이상, 상술한 바와 같이, 본 발명에 따른 화자 인식 장치는, 음성 등록을 요청하는 사용자의 음성 특징을 추출하여 화자 모델을 생성하고, 상기 음성 등록을 요청하는 사용자 음성과 기 등록된 사용자 음성들 간의 유사도에 따라 화자 식별을 위한 기준점수를 자동으로 업데이트할 수 있다.As described above, as described above, the speaker recognition apparatus according to the present invention extracts a voice feature of a user requesting voice registration to generate a speaker model, and creates a speaker model between the user voice requesting voice registration and the pre-registered user voices. The reference score for speaker identification can be automatically updated according to the similarity.

도 5는 화자 식별을 위한 기준점수를 업데이트하는 방법을 예시하는 도면이다.5 is a diagram illustrating a method of updating a reference score for speaker identification.

도 5를 참조하면, 댁 내 구성원들 중 쌍둥이 B와 아버지의 음성이 AI 홈 비서 서비스에 등록된 상태이고, 쌍둥이 A가 해당 서비스에 목소리 등록을 요청하는 상태임을 가정한다.Referring to FIG. 5, it is assumed that voices of twins B and fathers among members of the house are registered in the AI home secretary service, and twins A are requesting voice registration for the corresponding service.

쌍둥이 A가 AI 스피커(200)를 통해 목소리 등록을 요청하면, 화자 인식 장치(300)는 상기 AI 스피커(200)로부터 수신된 음성 신호를 기반으로 쌍둥이 A에 해당하는 화자 모델을 생성할 수 있다(S510).When the twin A requests voice registration through the AI speaker 200, the speaker recognition device 300 may generate a speaker model corresponding to the twin A based on the voice signal received from the AI speaker 200 ( S510).

화자 인식 장치(300)는 쌍둥이 A의 음성과 기 등록된 아버지의 음성 간에 유사도를 측정할 수 있다(S520). 본 실시 예에서, 쌍둥이 A의 음성과 아버지의 음성 간의 유사도 점수는 20이고, 특수집단 여부를 판별하기 위한 기준점수는 70임을 가정한다.The speaker recognition apparatus 300 may measure the similarity between the voice of the twin A and the voice of the pre-registered father (S520). In this embodiment, it is assumed that the similarity score between the voice of the twin A and the voice of the father is 20, and the reference score for determining whether a special group is 70.

화자 인식 장치(300)는 상기 측정된 유사도 점수가 특수집단 여부를 판별하기 위한 기준점수를 초과하는지 여부를 확인할 수 있다(S530). 상기 확인 결과, 쌍둥이 A와 아버지 간의 유사도 점수(20)가 특수집단 여부를 판별하기 위한 기준점수(70)보다 작기 때문에, 화자 인식 장치(300)는 쌍둥이 A와 아버지를 일반집단으로 인식하여 화자 식별을 위한 기준점수를 그대로 유지할 수 있다(S540).The speaker recognition apparatus 300 may check whether the measured similarity score exceeds a reference score for determining whether a special group is present (S530). As a result of the check, since the similarity score 20 between the twins A and the father is smaller than the reference score 70 for determining whether or not a special group, the speaker recognition device 300 recognizes the twins A and the father as a general group to identify the speaker The reference score for can be maintained as it is (S540).

이후, 화자 인식 장치(300)는 쌍둥이 A의 음성과 기 등록된 쌍둥이 B의 음성 간에 유사도를 측정할 수 있다(S550). 본 실시 예에서, 쌍둥이 A의 음성과 쌍둥이 B의 음성 간의 유사도 점수는 80임을 가정한다.Thereafter, the speaker recognition apparatus 300 may measure the similarity between the voice of the twin A and the voice of the previously registered twin B (S550). In this embodiment, it is assumed that the similarity score between the voice of twin A and the voice of twin B is 80.

화자 인식 장치(300)는 상기 측정된 유사도 점수가 특수집단 여부를 판별하기 위한 기준점수를 초과하는지 여부를 확인할 수 있다(S560). 상기 확인 결과, 쌍둥이 A와 쌍둥이 B 간의 유사도 점수(80)가 특수집단 여부를 판별하기 위한 기준점수(70)보다 크기 때문에, 화자 인식 장치(300)는 쌍둥이 A와 쌍둥이 B를 특수집단으로 인식하여 화자 식별을 위한 기준점수를 업데이트할 수 있다(S570). 이때, 상기 화자 인식 장치(300)는 쌍둥이 A와 쌍둥이 B 간의 유사도 점수(80)를 기반으로 화자 식별을 위한 기준점수를 업데이트할 수 있다. The speaker recognition device 300 may check whether the measured similarity score exceeds a reference score for determining whether a special group is present (S560). As a result of the check, since the similarity score 80 between the twins A and the twins B is larger than the reference score 70 for determining whether or not a special group, the speaker recognition device 300 recognizes the twins A and twins B as a special group. The reference score for speaker identification may be updated (S570). In this case, the speaker recognition apparatus 300 may update the reference score for speaker identification based on the similarity score 80 between twins A and twins B.

상술한 바와 같이, 기 등록된 댁 내 사용자들 중에서 한 명이라도 음성 등록을 요청하는 사용자와 특수집단 관계에 있다면, 화자 식별을 위한 기준점수를 업데이트할 수 있다. 상기 업데이트된 기준점수는 추후 화자 식별을 위한 새로운 기준점수로 사용될 수 있다.As described above, if one of the pre-registered in-house users is in a special group relationship with a user requesting voice registration, the reference score for speaker identification may be updated. The updated reference score may be used as a new reference score for future speaker identification.

도 6은 본 발명의 일 실시 예에 따른 화자 인식 방법을 설명하는 순서도이다. 6 is a flowchart illustrating a speaker recognition method according to an embodiment of the present invention.

도 6을 참조하면, 화자 인식 장치(300)는 AI 스피커(200)로부터 댁 내 사용자의 음성 신호를 수신할 수 있다(S610). Referring to FIG. 6, the speaker recognition device 300 may receive a voice signal of a user at home from the AI speaker 200 (S610).

화자 인식 장치(300)는 AI 스피커(200)로부터 수신된 음성 신호를 분석하여 해당 음성 신호의 특징 값(또는 특징 벡터)들을 추출할 수 있다(S620). 이때, 상기 화자 인식 장치(300)는 미리 결정된 음성 특징 추출 알고리즘을 이용하여 댁 내 사용자의 음성 특징 값들을 추출할 수 있다.The speaker recognition device 300 may analyze the voice signal received from the AI speaker 200 and extract feature values (or feature vectors) of the voice signal (S620). At this time, the speaker recognition apparatus 300 may extract voice feature values of the user at home using a predetermined voice feature extraction algorithm.

화자 인식 장치(300)는 미리 결정된 유사도 측정 방법을 이용하여 음성 인식을 요청하는 사용자의 음성 특징과 기 등록된 사용자들의 음성 특징 간의 유사도를 측정할 수 있다(S630).The speaker recognition apparatus 300 may measure the similarity between a voice feature of a user requesting speech recognition and a voice feature of pre-registered users using a predetermined similarity measurement method (S630).

화자 인식 장치(300)는 상기 측정된 유사도 점수와 화자 식별을 위한 기준점수를 서로 비교할 수 있다(S640). 여기서, 화자 식별을 위한 기준점수는, 기 등록된 댁 내 사용자들 중 적어도 둘 이상이 특수집단인 경우, 화자 인식의 정확도를 높이기 위해 최초 설정된 기준점수보다 더 큰 값으로 업데이트될 수 있다.The speaker recognition apparatus 300 may compare the measured similarity score with the reference score for speaker identification (S640). Here, the reference score for speaker identification may be updated to a value greater than the initially set reference score to increase the accuracy of speaker recognition when at least two of the pre-registered home users are a special group.

상기 640 단계의 비교 결과, 상기 측정된 유사도 점수가 화자 식별을 위한 기준점수를 초과하는 경우, 화자 인식 장치(300)는 음성 인식을 시도하는 댁 내 사용자를 기 등록된 특정 화자 모델에 대응하는 댁 내 사용자로 인식할 수 있다.As a result of the comparison in step 640, when the measured similarity score exceeds a reference score for speaker identification, the speaker recognition device 300 is a home corresponding to a specific speaker model that pre-registers a user who is attempting speech recognition. I can recognize it as my user.

이상, 상술한 바와 같이, 본 발명에 따른 화자 인식 장치는, 기 등록된 댁 내 사용자들이 특수집단인 경우, 화자 인식 알고리즘에 업데이트된 기준점수를 적용함으로써 상기 특수집단에 속하는 사용자들의 목소리를 정확하게 식별(구별)할 수 있다.As described above, the speaker recognition apparatus according to the present invention accurately identifies the voices of users belonging to the special group by applying the updated reference score to the speaker recognition algorithm when the pre-registered users in the house are a special group. (Distinction) can do it.

전술한 본 발명은, 프로그램이 기록된 매체에 컴퓨터가 읽을 수 있는 코드로서 구현하는 것이 가능하다. 컴퓨터가 읽을 수 있는 매체는, 컴퓨터 시스템에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치를 포함한다. 컴퓨터가 읽을 수 있는 매체의 예로는, HDD(Hard Disk Drive), SSD(Solid State Disk), SDD(Silicon Disk Drive), ROM, RAM, CD-ROM, 자기 테이프, 플로피 디스크, 광 데이터 저장 장치 등이 있다. 또한, 상기 컴퓨터는 단말기의 제어부를 포함할 수도 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.The above-described present invention can be embodied as computer readable codes on a medium on which a program is recorded. The computer-readable medium includes any kind of recording device in which data readable by a computer system is stored. Examples of computer-readable media include a hard disk drive (HDD), solid state disk (SSD), silicon disk drive (SDD), ROM, RAM, CD-ROM, magnetic tape, floppy disk, and optical data storage device. There is this. In addition, the computer may include a control unit of the terminal. Accordingly, the above detailed description should not be construed as limiting in all respects, but should be considered illustrative. The scope of the present invention should be determined by rational interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

100: 화자 인식 시스템 110: 공유기
120/200: AI 스피커 130: 인터넷 망
140: 인터넷 서비스 제공 서버 300: 화자 인식 장치
310: 음성 수신부 320: 음성특징 추출부
330: 화자모델 생성부 340: 유사도 측정부
350: 기준점수 설정부 360: 화자 식별부
370: 데이터베이스100: speaker recognition system 110: router
120/200: AI speaker 130: Internet network
140: Internet service providing server 300: speaker recognition device
310: voice receiving unit 320: voice feature extraction unit
330: speaker model generation unit 340: similarity measurement unit
350: reference point setting unit 360: speaker identification unit
370: database

Claims

Receiving a voice signal of a home user requesting voice registration;
Measuring similarity between a user voice requesting the voice registration and a pre-registered user voice;
Comparing the measured similarity score with a reference score for determining whether a special group exists; And
And setting a reference score for speaker identification based on the comparison result.

The method of claim 1, wherein the receiving step,
A speaker recognition method characterized by receiving the voice signal of the user at home from an AI speaker.

The method of claim 1, wherein the setting step,
If the measured similarity score is less than or equal to the reference score for determining whether or not a special group, the speaker recognition method characterized by maintaining the reference score for the speaker identification.

The method of claim 1, wherein the setting step,
If the measured similarity score is greater than the reference score for determining whether a special group, the speaker recognition method characterized in that for updating the reference score for the speaker identification.

According to claim 4,
The reference score for the speaker identification is updated based on the measured similarity score.

According to claim 1,
And extracting feature values of the received voice signal, and generating a speaker model corresponding to the user in the home based on the extracted feature values.

According to claim 1,
Receiving a voice signal of a home user requesting speech recognition; And
And measuring the similarity between the user voice requesting the voice recognition and the pre-registered user voice.

The method of claim 7,
And comparing the similarity score between the user's voice requesting the voice recognition and the pre-registered user's voice and the set reference score to identify a speaker.

A program stored on a computer-readable recording medium such that the method according to any one of claims 1 to 8 is executed on a computer.

A voice receiver that receives a voice signal of a user in the home requesting voice registration;
A similarity measuring unit measuring a similarity between the user voice requesting the voice registration and a pre-registered user voice; And
A speaker recognition apparatus including a reference score setting unit that compares the measured similarity score with a reference score for determining whether a special group is set, and sets a reference score for speaker identification.

The method of claim 10,
A speaker recognition apparatus further comprising a voice feature extraction unit for extracting feature values of the received voice signal using a predetermined voice feature extraction algorithm.

The method of claim 11,
A speaker recognition device further comprising a speaker model generator configured to generate a speaker model corresponding to the user in the home based on feature values of the voice signal.

The method of claim 10,
A speaker recognition device further comprising a speaker identification unit for identifying a speaker by comparing the similarity score between the user voice attempting speech recognition and the pre-registered user voices and the set reference score.