KR102075670B1

KR102075670B1 - Speaker rcognition methdo and system using age

Info

Publication number: KR102075670B1
Application number: KR1020180120691A
Authority: KR
Inventors: 허희수; 유하진; 양일호; 윤성현; 정지원; 심혜진
Original assignee: 서울시립대학교 산학협력단
Priority date: 2018-10-10
Filing date: 2018-10-10
Publication date: 2020-03-02

Abstract

According to an embodiment of the present invention, a method for recognizing a speaker capable of reducing misidentification between family members comprises the steps of: inputting a plurality of registration voices corresponding to a plurality of speakers; generating a plurality of speaker models corresponding to the speakers by applying an age estimation model to each of the registration voices; estimating age of each of the speakers using the age estimation model; registering the speaker models and estimated ages; inputting a novel voice; setting a candidate group consisting of part of the speakers using the novel voice and estimated ages; and identifying a speaker corresponding to the novel voice from the candidate group.

Description

Speaker recognition method and system using age information {SPEAKER RCOGNITION METHDO AND SYSTEM USING AGE}

본 발명은 화자 인식 방법 및 화자 인식 시스템에 관한 것이다.The present invention relates to a speaker recognition method and a speaker recognition system.

일반적으로 스마트폰 등과 같이 다양한 기능을 복합적으로 수행하는 전자 장치들이 개발됨에 따라, 조작성을 향상시키기 위하여 음성 인식 기능이 탑재된 전자 장치들이 출시되고 있다. 이러한 음성 인식을 이용한 화자 인식 기술은 음성 기반 서비스가 널리 보급됨에 따라 그 필요성이 증가하고 있으며, 별도의 버튼 조작 또는 터치 모듈의 접촉에 의하지 않고 화자를 인식함으로써 장치를 손쉽게 제어할 수 있는 장점을 가진다. 하지만, 종래 화자 인식 기술은 가족 구성원을 대상으로 수행되는 경우, 성능이 하락할 수 있다는 문제점이 있다. 구체적으로 가족 구성원의 세대 간 목소리 유사성에 의해 성능 하락이 발생할 수 있다. In general, as electronic devices, which perform various functions in combination with smart phones and the like, have been developed, electronic devices equipped with a voice recognition function have been released to improve operability. Speaker recognition technology using the speech recognition is increasing the need as the voice-based services are widely spread, has the advantage that can easily control the device by recognizing the speaker without the touch of a separate button operation or touch module. . However, the conventional speaker recognition technology has a problem that performance may be degraded when performed for a family member. Specifically, performance degradation may occur due to voice similarity between generations of family members.

본 발명은 상술한 문제점을 극복하기 위한 것으로서, 가족 구성원 중 화자를 정확하게 인식하기 위함이다.The present invention is to overcome the above-mentioned problems, to accurately recognize the speaker of the family members.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 본 발명의 기재로부터 당해 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.Technical problems to be achieved by the present invention are not limited to the above-mentioned technical problems, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description of the present invention. .

실시예에 따른 화자 인식 방법은, 복수의 화자에 대응하는 복수의 등록 발성 입력 단계; 상기 복수의 등록 발성 각각에 나이 추정 모델을 적용하여 상기 복수의 화자에 대응하는 복수의 화자 모델을 생성하는 단계; 상기 나이 추정 모델을 이용하여 상기 복수의 화자 각각의 나이를 추정하는 단계; 상기 복수의 화자 모델 및 상기 복수의 추정 나이를 등록하는 단계; 신규 발성 입력 단계; 상기 신규 발성과 상기 추정 나이를 이용하여 상기 복수의 화자의 일부로 구성된 후보군을 설정하는 단계; 및 상기 후보군 중 상기 신규 발성에 대응하는 화자를 식별하는 단계를 포함한다.A speaker recognition method according to an embodiment includes a plurality of registered speech input steps corresponding to a plurality of speakers; Generating a plurality of speaker models corresponding to the plurality of speakers by applying an age estimation model to each of the plurality of registered utterances; Estimating the age of each of the plurality of speakers using the age estimation model; Registering the plurality of speaker models and the plurality of estimated ages; New vocal input step; Setting a candidate group consisting of a part of the plurality of speakers using the new speech and the estimated age; And identifying a speaker corresponding to the new utterance in the candidate group.

또한, 실시예에 따른 화자 인식 방법의 상기 후보군 설정 단계가, 상기 신규 발성에 상기 나이 추정 모델을 적용하여 상기 신규 발성에 대응하는 제1 추정 나이를 생성하는 단계; 및 상기 제1 추정 나이와 상기 등록된 추정 나이와의 차가 소정의 임계값을 초과하는 등록된 추정 나이에 대응하는 화자를 상기 후보군으로 설정하는 단계를 포함한다.The candidate group setting of the speaker recognition method may further include generating a first estimated age corresponding to the new utterance by applying the age estimation model to the new utterance; And setting a speaker corresponding to the registered estimated age whose difference between the first estimated age and the registered estimated age exceeds a predetermined threshold as the candidate group.

또한, 실시예에 따른 화자 인식 방법의 상기 화자를 식별하는 단계는, 상기 등록된 화자 모델, 상기 등록된 추정 나이, 및 상기 제1 추정 나이를 이용하여 화자 스코어(score)를 생성하는 단계; 및 상기 복수의 화자 중 상기 화자 스코어가 가장 높은 화자를 상기 신규 발성에 대응하는 화자로 식별하는 단계를 포함한다.The identifying of the speaker of the speaker recognition method according to the embodiment may include generating a speaker score using the registered speaker model, the registered estimated age, and the first estimated age; And identifying the speaker with the highest speaker score among the plurality of speakers as the speaker corresponding to the new utterance.

또한, 실시예에 따른 화자 인식 방법의 상기 화자 스코어는, 상기 등록된 화자 모델과 신규 발성을 비교하여 상기 등록된 화자의 발성과 상기 신규 발성 사이의 유사도를 계산하고, 상기 유사도와 상기 제1 추정 나이와 상기 신규 화자에 대응하는 화자의 나이와의 차이가 발생할 확률을 이용하여 계산된다.In addition, the speaker score of the speaker recognition method according to the embodiment, comparing the registered speaker model and the new utterance to calculate the similarity between the registered speaker's utterance and the new utterance, the similarity and the first estimation The difference between the age and the age of the speaker corresponding to the new speaker is calculated using the probability of occurrence.

또한, 실시예에 따른 화자 인식 방법의 상기 확률은 상기 제1 추정 나이와 상기 신규 화자에 대응하는 화자의 나이와의 차이가 적을수록 높은 확률 값을 갖는다.In addition, the probability of the speaker recognition method according to the embodiment has a higher probability value as the difference between the first estimated age and the speaker's age corresponding to the new speaker decreases.

또한, 실시예에 따른 화자 인식 방법의 상기 후보군에 포함된 복수의 화자는 서로 성별이 상이하다.In addition, the plurality of speakers included in the candidate group of the speaker recognition method according to the embodiment are different from each other.

또한, 실시예에 따른 화자 인식 방법의 상기 신규 발성은 상기 복수의 화자 중 어느 한 화자의 발성이다.In addition, the novel voice of the speaker recognition method according to the embodiment is the voice of any one of the plurality of speakers.

또한, 다른 실시예에 따른 화자 인식 시스템은 상기 화자 인식 방법에 기재된 단계 중 어느 한 단계를 수행하도록 구성된다.Further, the speaker recognition system according to another embodiment is configured to perform any one of the steps described in the speaker recognition method.

본 발명에 따른 화자 인식 방법 및 화자 인식 시스템은, 가족 구성원 간의 오인식을 줄일 수 있는 효과가 있다. The speaker recognition method and the speaker recognition system according to the present invention have an effect of reducing misunderstanding among family members.

도 1은 실시예에 따른 화자 인식 방법을 나타내는 흐름도이다.
도 2는 실시예에 따른 화자 등록 단계를 나타내는 도면이다.
도 3은 실시예에 따른 후보군 설정 단계를 나타내는 도면이다.
도 4는 실시예에 따른 식별 단계를 나타내는 도면이다.1 is a flowchart illustrating a speaker recognition method according to an embodiment.
2 is a diagram illustrating a speaker registration step according to an embodiment.
3 is a diagram illustrating a candidate group setting step according to an embodiment.
4 is a diagram illustrating an identification step according to an embodiment.

이하, 첨부된 도면을 참조하여 본 명세서에 개시된 실시 예를 상세히 설명하되, 동일하거나 유사한 구성요소에는 동일, 유사한 도면 부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 이하의 설명에서 사용되는 구성요소에 대한 접미사 "모듈" 및 "부"는 명세서 작성의 용이함만이 고려되어 부여되거나 혼용되는 것으로서, 그 자체로 서로 구별되는 의미 또는 역할을 갖는 것은 아니다. 또한, 본 명세서에 개시된 실시 예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 본 명세서에 개시된 실시 예의 요지를 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 첨부된 도면은 본 명세서에 개시된 실시 예를 쉽게 이해할 수 있도록 하기 위한 것일 뿐, 첨부된 도면에 의해 본 명세서에 개시된 기술적 사상이 제한되지 않으며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. Hereinafter, exemplary embodiments disclosed herein will be described in detail with reference to the accompanying drawings, and the same or similar components will be given the same or similar reference numerals, and redundant description thereof will be omitted. The suffixes "module" and "unit" for components used in the following description are given or mixed in consideration of ease of specification, and do not have distinct meanings or roles. In addition, in the following description of the embodiments disclosed herein, when it is determined that the detailed description of the related known technology may obscure the gist of the embodiments disclosed herein, the detailed description thereof will be omitted. In addition, the accompanying drawings are only for easily understanding the embodiments disclosed in the present specification, the technical idea disclosed in the specification by the accompanying drawings are not limited, and all changes included in the spirit and scope of the present invention. It should be understood to include equivalents and substitutes.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되지는 않는다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다.Terms including ordinal numbers such as first and second may be used to describe various components, but the components are not limited by the terms. The terms are used only for the purpose of distinguishing one component from another.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다거나 "직접 접속되어" 있다고 언급된 때에는, 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다.When a component is said to be "connected" or "connected" to another component, it may be directly connected to or connected to that other component, but it may be understood that another component may be present in the middle. Should be. On the other hand, when a component is said to be "directly connected" or "directly connected" to another component, it should be understood that there is no other component in between.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. Singular expressions include plural expressions unless the context clearly indicates otherwise.

본 출원에서, "포함한다" 또는 "가지다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this application, the terms "comprises" or "having" are intended to indicate that there is a feature, number, step, operation, component, part, or combination thereof described in the specification, and one or more other features. It is to be understood that the present invention does not exclude the possibility of the presence or the addition of numbers, steps, operations, components, components, or a combination thereof.

이하, 도 1 내지 도 4를 참조하여 실시예에 따른 화자 인식 방법에 대해서 설명한다.Hereinafter, a speaker recognition method according to an embodiment will be described with reference to FIGS. 1 to 4.

도 1은 실시예에 따른 화자 인식 방법을 나타내는 흐름도이다.1 is a flowchart illustrating a speaker recognition method according to an embodiment.

도 1을 참조하면, 실시예에 따른 화자 인식 방법(1)은 화자 등록 단계(S10), 발성 입력 단계(S20), 및 식별 단계(S30)를 포함한다.Referring to FIG. 1, the speaker recognition method 1 according to the embodiment includes a speaker registration step S10, a speech input step S20, and an identification step S30.

구체적으로, 도 2를 참조하면, 화자 등록 단계(S10)는 등록 발성 입력 단계(S11), 모델링 단계(S12), 나이 추정 단계(S13), 및 등록 단계(S14)를 포함한다.Specifically, referring to FIG. 2, the speaker registration step S10 includes a registration utterance input step S11, a modeling step S12, an age estimation step S13, and a registration step S14.

등록 발성 입력 단계(S11)에서 복수의 화자(user₁ 내지 user_i)의 등록 발성(Vr₁ 내지 Vr_i)이 입력된다. 등록 발성이란 화자 인식 시스템에 처음으로 화자를 등록할 때 입력되어 활용하는 발성을 의미한다. In the registration utterance input step S11, registration utterances Vr ₁ to Vr _i of the plurality of speakers user ₁ to user _i are input. The registration utterance refers to a utterance input and used when registering a speaker for the first time in a speaker recognition system.

또한, 등록 발성 입력 단계(S11)에서 복수의 화자(user₁ 내지 user_i)의 나이 정보가 각각 입력 될 수 있다. In addition, in the registration utterance input step S11, age information of the plurality of speakers user ₁ to user _i may be input.

모델링 단계(S12)에서, 등록 발성(Vr₁ 내지 Vr_i)에 나이 추정 모델을 적용하여 복수의 화자(user₁ 내지 user_i) 각각의 나이 정보를 포함하는 복수의 화자 모델(userm₁ 내지userm_i)을 생성한다. 구체적으로, 나이 추정 모델은 등록 발성(Vr₁ 내지 Vr_i)의 음성 모델링을 수행하여 각 등록 발성에 대응하는i-vector(indentity vector)를 생성한다. 입력된 복수의 화자(user₁ 내지 user_i)의 발성은 각 화자의 고유한 i-vector로 표현될 수 있으며, i-vector에는 화자 음성에 포함된 화자 정보, 음소 정보 등의 다양한 특성을 포함된다. i-vector는 입력된 음성의 길이에 무관하게 동일한 차원의 특징을 추출하기 때문에, 서로 다른 길이의 음성 신호를 효과적으로 비교할 수 있고, 유사한 특성을 가지는 음성에서 추출한 i-vector들 간의 거리는 짧고, 서로 다른 특성의 음성에서 추출한 i-vector들은 거리가 먼 특성이 있다. 본 발명에서는 이와 같은 특성을 가지는 i-vector를 화자 모델(userm1 내지usermi)로 활용한다. In the modeling step S12, a plurality of speaker models userm ₁ to userm _i including age information of each of the plurality of speakers users ₁ to user _i are applied by applying an age estimation model to the registered utterances Vr ₁ to Vr _i . ). Specifically, the age estimation model performs voice modeling of registration utterances Vr ₁ to Vr _i to generate an i-vector (indentity vector) corresponding to each registration utterance. The voices of a plurality of input speakers (user ₁ to user _i ) may be expressed as unique i-vectors of each speaker, and i-vector includes various characteristics such as speaker information and phoneme information included in the speaker voice. . Since i-vectors extract features of the same dimension irrespective of the length of the input voice, it is possible to effectively compare voice signals of different lengths, and the distance between i-vectors extracted from voices having similar characteristics is short and different. The i-vectors extracted from the voices of the features have a distant characteristic. In the present invention, the i-vector having such characteristics is used as a speaker model (userm1 to usermi).

나이 추정 단계(S13)에서, 나이 추정 모델을 사용하여 복수의 화자(user₁ 내지 user_i) 각각의 나이를 추정한다.In the age estimating step (S13), an age estimation model is used to estimate the age of each of the plurality of speakers user ₁ to user _i .

구체적으로, 나이 추정 모델은i-vector를 입력해 화자 나이를 출력하는 심층 신경망으로 구성된다. 심층 신경망은 i-vector에 포함돼 있는 다양한 정보들을 활용해 화자의 나이를 추정할 수 있도록 미리 학습되어 있다. 나이 추정 모델은 학습한 심층 신경 망을 이용하여 복수의 화자(user₁ 내지 user_i) 각각의 나이를 추정한다.Specifically, the age estimation model consists of a deep neural network that inputs an i-vector and outputs the speaker's age. Deep neural networks have been pre-trained to estimate the speaker's age using a variety of information contained in the i-vector. The age estimation model estimates the age of each of the plurality of speakers (user ₁ to user _i ) using the trained deep neural network.

등록 단계(S14)에서, 화자 모델(userm₁ 내지userm_i)을 등록 화자로서 등록하고, 추정된 복수의 화자(user1 내지 useri) 각각의 나이를 등록 나이 정보(age_user1 내지 age_useri)로서 등록하거나 또는 복수의 화자(user₁ 내지 user_i)에 대응하여 입력된 나이 정보를 등록 나이 정보(age_user₁ 내지 age_user_i)로서 등록 한다. 즉, 등록 나이 정보(age_user₁ 내지 age_user_i)는 추정된 나이 정보 또는 입력된 나이 정보가 포함될 수 있다.In the registration step S14, the speaker models userm ₁ to userm _i are registered as registered speakers, and the ages of each of the estimated plurality of speakers user1 to useri are registered as the registered age information age_user1 to age_useri or a plurality of them. Age information input in correspondence with the speaker (users ₁ to user _i ) is registered as the registered age information (age_user ₁ to age_user _i ). That is, the registered age information age_user ₁ to age_user _i may include estimated age information or input age information.

등록 발성 입력 단계(S11)에서 복수의 화자(user₁ 내지 user_i) 각각의 나이 정보가 입력된 경우, 나이 추정 단계(S12)는 생략되고 등록 단계(S14)에서 입력된 복수의 나이 정보가 등록 나이 정보(age_user₁ 내지 age_user_i)로서 등록된다.When the age information of each of the plurality of speakers (user ₁ to user _i ) is input in the registration utterance input step S11, the age estimation step S12 is omitted and the plurality of age information input in the registration step S14 is registered. It is registered as age information age_user ₁ to age_user _i .

발성 입력 단계(S20)에서는, 등록 단계(S14)에서 등록된 복수의 화자(user₁ 내지 user_i) 중 인식이 필요한 특정 화자의 신규 발성(utt, 도 3 참조)이 입력된다.In the voice input step S20, a new voice (utt (see FIG. 3)) of a specific speaker that requires recognition among the plurality of speakers (user ₁ to user _i ) registered in the registration step S14 is input.

식별 단계(S30)에서 입력된 신규 발성과 등록된 화자의 등록 나이 정보(age_user₁ 내지 age_user_i)를 이용하여 후보군을 설정하고 후보군 내에서 화자를 판단한다.A candidate group is set using the new utterance inputted in the identification step S30 and registered age information (age_user ₁ to age_user _i ) of the registered speaker and the speaker is determined within the candidate group.

구체적으로, 식별 단계(S30)는 후보군 설정 단계(S31) 및 화자 식별 단계(S32)를 포함한다.Specifically, the identification step S30 includes a candidate group setting step S31 and a speaker identification step S32.

도 3을 참조하면, 후보군 설정 단계(S31)에서, 신규 발성(utt)에 나이 추정 모델을 사용하여 추정된 나이 정보(age_utt)를 생성하고, 등록 나이 정보(age_user₁ 내지 age_user_i) 중에서 추정된 나이 정보(age_utt)와의 차가 소정의 임계값 이하인 등록 나이 정보에 대응하는 화자로 구성된 후보군을 설정한다.Referring to FIG. 3, in the candidate group setting step S31, the estimated age information age_utt is generated using the age estimation model in the new utterance utt and estimated from the registered age information age_user ₁ to age_user _i . A candidate group composed of speakers corresponding to registered age information whose difference from the age information age_utt is equal to or less than a predetermined threshold value is set.

여기에서, 추정된 나이 정보(age_utt)는 도 2를 참조하여 설명한 나이 추정 모델을 사용하여 복수의 화자(user₁ 내지 user_i)의 나이 정보를 추정하는 방식과 동일하다.Here, the estimated age information age_utt is the same as the method of estimating the age information of the plurality of speakers user ₁ to user _i using the age estimation model described with reference to FIG. 2.

예를 들어, 화자 등록 단계(S10)에서, 네 명의 화자(user₁ 내지 user₄)의 등록 나이 정보(age_user₁ 내지 age_user₄)가 각각 25세, 46세, 27세, 38세이고, 발성 입력 단계(S20)에서 신규 발성(utt)을 이용하여 추정된 나이 정보(age_utt)가 45세이며, 소정의 임계값을 15세로 설정된 것으로 가정한다.For example, in the speaker registration step (S10), the four speaker (user ₁ to user ₄₎ registered age information (age_user ₁ to age_user ₄₎ are respectively 25 years, 46 years, 27 years, 38 years old, speech input step of It is assumed that the age information age_utt estimated using the new voice utt in S20 is 45 years old and the predetermined threshold is set to 15 years old.

화자(user₁)의 등록 나이 정보(age_user₁)와 추정된 나이 정보(age_utt)의 차(|25-45|=20)가 소정의 임계값(15)보다 크다. 따라서, 화자(user₁)는 후보군에서 제외된다. The difference (| 25-45 | = 20) between the registered age information age_user ₁ of the speaker user ₁ and the estimated age information age_utt is greater than the predetermined threshold value 15. Thus, the speaker user ₁ is excluded from the candidate group.

반면, 화자(user₂)의 등록 나이 정보(age_user₂)와 추정된 나이 정보(age_utt)의 차(|46-45|=1)가 소정의 임계값(15)보다 작다. 따라서, 화자(user₂)는 후보군에 포함된다.On the other hand, the difference (| 46-45 | = 1) between the registered age information age_user ₂ of the speaker user ₂ and the estimated age information age_utt is smaller than the predetermined threshold value 15. Thus, the speaker user ₂ is included in the candidate group.

이와 같은 방식으로, 후보군 설정 단계(S31)에서 네 명의 화자(user₁ 내지 user₄) 중 화자(user₂)와 화자(user₄) 만으로 후보군이 제한된다. 설명의 편의를 위해 소정의 임계 값이 15세인 것으로 가정 하였으나 실시예가 이에 제한되는 것은 아니며, 임계 값은 10세 내지 20세인 것으로 설정할 수 있으며, 임계 값은 가족 구성원들 중 각 세대를 구분하기 적당한 값으로 설정할 수 있다.In this manner, in the candidate group setting step S31, the candidate group is limited to only the speaker user ₂ and the speaker user ₄ among the four speakers user ₁ to user ₄ . For convenience of explanation, it is assumed that the predetermined threshold value is 15 years old, but the embodiment is not limited thereto. The threshold value may be set to 10 years old to 20 years old, and the threshold value is a value that is suitable for distinguishing each generation among family members. Can be set with

이와 같은 등록된 네 명의 화자(user₁ 내지 user₄)는 가족 구성원일 수 있으며, 예를 들어, 인공 지능 스피커에 가족 구성원 중 한 명이 "요즘 인기 있는 최신 곡 틀어줘"라고 명령하여 신규 발성(utt)으로 입력한 경우, 가족 구성원 중 명령한 구성원 목소리로 인식하여 대응하는 연령대나 관심사에 맞게 최진곡을 추천할 수 있다. These four registered speakers (users ₁ to ₄ ) can be family members. For example, one of the family members instructs the AI speaker to "play the latest popular song" and utt a new voice (utt ), You can recognize the voice of the commanded members of the family members to recommend the best song according to the corresponding age group or interests.

이때, 후보군을 가족 구성원 중 일부 구성원(예를 들어, 성별이 서로 다른 구성원)으로 한정하여, 화자가 오인식될 가능성이 낮아진다. 즉, 성별이 동일한 구성원들(예를 들어, 어머니와 딸 및 아버지와 아들)은 목소리가 유사하여 오인식될 가능성이 높으나, 성별이 서로 다른 구성원으로 후보군을 한정하여 오인식될 확률이 낮아질 수 있다.At this time, by limiting the candidate group to some members of the family members (for example, members having different genders), the possibility of the speaker being misidentified becomes low. That is, members of the same gender (for example, mothers and daughters and fathers and sons) are likely to be misidentified due to similar voices, but may be less likely to be misidentified by limiting candidate groups to members of different genders.

구체적으로, 상술한 예에서, 네 명의 화자(user₁ 내지 user₄) 중 화자(user₂)와 화자(user₄)만이 후보군으로 제한되며, 화자(user₂)와 화자(user₄)의 나이는 각각 46세 및 38세로서 화자(user₂)와 화자(user₄)는 각각 아버지와 어머니이므로, 동성간의 목소리 유사성에 의한 오인식의 가능성이 낮아진다.Specifically, in the above-described example, only four speakers (user ₁ to user ₄ ) of the speaker (user ₂ ) and the speaker (user ₄ ) are limited to the candidate group, and the age of the speaker (user ₂ ) and the speaker (user ₄ ) is 46 and 38, respectively, as the speaker (user ₂ ) and the speaker (user ₄ ) are the father and mother, respectively, the possibility of misrecognition by voice similarity between homosexuals is lowered.

도 4를 참조하면, 화자 식별 단계(S32)에서 후보군 중 신규 발성(utt)에 대응하는 화자가 식별된다.Referring to FIG. 4, in a speaker identification step S32, a speaker corresponding to a new utt among candidate groups is identified.

구체적으로, 화자 식별 단계(S32)는 스코어 계산 단계(S321) 및 화자 판단 단계(S322)를 포함한다.Specifically, the speaker identification step S32 includes a score calculation step S321 and a speaker determination step S322.

스코어 계산 단계(S321)에서, 복수의 화자(user₁ 내지 user_i) 각각의 나이 정보를 포함하는 i-vector가 포함된 복수의 등록된 화자 모델(userm₁ 내지userm_i) 각각, 및 복수의 화자(user1 내지 useri)에 대응하는 등록 나이 정보(age_user₁ 내지 age_user_i) 각각, 그리고 신규 발성(utt)을 이용하여 추정된 나이 정보(age_utt)를 이하의 수학식 1에 대입하여 화자 스코어(score)를 각각 계산한다. In the score calculation step S321, each of the plurality of registered speaker models userm ₁ to userm _i including an i-vector including age information of each of the plurality of speakers user ₁ to user _i , and the plurality of speakers Each of the registered age information (age_user ₁ to age_user _i ) corresponding to (user1 to useri) and age information (age_utt) estimated using the new utterance (utt) are substituted into Equation 1 below to give a speaker score. Calculate each.

[수학식 1][Equation 1]

score(user_i, utt) = α*p_spk(utt|userm_i) + β*p_age(age_user_i - age_utt)score (user _i , utt) = α * p _spk (utt | userm _i ) + β * p _age (age_user _i -age_utt)

score(user_i, utt)는 화자 식별 기준 점수를 계산하는 함수이고, user_i는 등록 단계(S14)에서 등록된 화자를 의미하며, p_spk(utt|userm_i)는 등록된 화자(user_i)에 대응하는 등록된 화자 모델(userm_i)과 신규 발성(utt)을 비교하여 등록된 화자(user_i)의 발성과 신규 발성(utt) 사이의 유사도를 계산하는 함수이다. age_user_i와 age_utt는 각각 등록된 나이 정보와 추정된 나이 정보를 의미하고, α 및 β는 음성의 유사도와 나이 정보의 기여도를 설정할 수 있는 상수이다. p_age()는 신규 발성(utt)을 이용하여 추정된 나이 정보(age_utt) 와 신규 발성(utt)에 대응하는 실제 화자의 나이 정보와 차이가 발생할 확률을 의미한다. 즉, p_age()는 신규 발성(utt)을 이용하여 추정한 나이 정보(age_utt)와 등록된 화자의 나이 정보(age_user_i)가 동일할 때 가장 높은 확률 값을 생성한다. 따라서, 이러한 스코어를 계산하는 함수에서 계산한 스코어의 값이 클수록 신규 발성(utt)이 등록된 화자로서 식별될 가능성이 커진다. score (user _i , utt) is a function for calculating the speaker identification criterion score, user _i means the speaker registered in the registration step (S14), p _spk (utt | userm _i ) is the registered speaker (user _i ) Comparing the registered speaker model (userm _i ) and the new utterance (utt) corresponding to the function to calculate the similarity between the utterance of the registered speaker (user _i ) and the new utterance (utt). age_user _i and age_utt mean registered age information and estimated age information, respectively, and α and β are constants that can set voice similarity and contribution of age information. p _age () denotes a probability that a difference occurs between age information estimated using a new utterance (age_utt) and actual speaker age information corresponding to the new utterance (utt). That is, p _age () generates the highest probability value when age information (age_utt) estimated by using a new utterance (utt) and age information (age_user _i ) of registered speakers are the same. Therefore, the larger the value of the score calculated by the function for calculating such a score, the greater the possibility that the new utt is identified as a registered speaker.

화자 판단 단계(S322)에서, 이하의 수학식 2에 화자 스코어(score)를 대입하여 후보군 중 신규 발성(utt)에 대응하는 화자(identified_user)가 판단된다.In the speaker determination step (S322), a speaker (identified_user) corresponding to a new utt among candidate groups is determined by substituting a speaker score into Equation 2 below.

[수학식 2][Equation 2]

identified_user = argmax(score(user_i, utt))identified_user = argmax (score (user _i , utt))

구체적으로, 화자 스코어(score(user_i, utt))가 가장 높은 값에 대응하는 화자를 신규 발성(utt)에 대응하는 화자(identified_user)로 판단한다.Specifically, the speaker corresponding to the highest speaker score score (user _i , utt) is determined as the speaker (identified_user) corresponding to the new utt.

예를 들어, 상술한 후보군에 포함된, 화자(user₂)와 화자(user₄) 중 화자 스코어(score(user_i, utt))가 가장 높은 값에 대응하는 화자를 신규 발성(utt)에 대응하는 화자(identified_user)로 판단한다. For example, the speaker corresponding to the highest value of the speaker scores (user _i and utt) among the speaker user ₂ and the speaker user ₄ included in the candidate group described above corresponds to the new utt. Determined by the speaker (identified_user).

이상에서 본 발명의 실시 예에 대하여 상세하게 설명하였지만 본 발명의 권리범위는 이에 한정되는 것은 아니고 다음의 청구범위에서 정의하고 있는 본 발명의 기본 개념을 이용한 당업자의 여러 변형 및 개량 형태 또한 본 발명의 권리범위에 속하는 것이다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다.Although the embodiments of the present invention have been described in detail above, the scope of the present invention is not limited thereto, and various modifications and improvements of those skilled in the art using the basic concepts of the present invention defined in the following claims are also provided. It belongs to the scope of rights. Accordingly, the above detailed description should not be construed as limited in every respect and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention.

Claims

A plurality of registered speech input steps corresponding to the plurality of speakers;
Generating a plurality of speaker models corresponding to the plurality of speakers by applying an age estimation model to each of the plurality of registered utterances;
Estimating the age of each of the plurality of speakers using the age estimation model;
Registering the plurality of speaker models and the plurality of estimated ages;
New voice input step;
Setting a candidate group consisting of a part of the plurality of speakers using the new speech and the estimated age; And
Identifying a speaker corresponding to the new utterance in the candidate group;
The candidate group setting step,
Generating a first estimated age corresponding to the new utterance by applying the age estimation model to the new utterance; And
Setting a speaker corresponding to the registered estimated age whose difference between the first estimated age and the registered estimated age falls below a predetermined threshold value as the candidate group;
Including, the speaker recognition method.

delete

The method of claim 1,
Identifying the speaker,
Generating a speaker score using the registered speaker model, the registered estimated age, and the first estimated age; And
Identifying the speaker with the highest speaker score among the plurality of speakers as the speaker corresponding to the new utterance;
Including, the speaker recognition method.

The method of claim 3,
The speaker score is,
Comparing the registered speaker model and the new utterance to calculate the similarity between the registered speaker's utterance and the new utterance, and a difference between the similarity and the first estimated age and the speaker's age corresponding to the new speaker may occur. Speaker recognition method, calculated using probability.

The method of claim 4, wherein
And the probability has a higher probability value as the difference between the first estimated age and the age of the speaker corresponding to the new speaker is smaller.

The method of claim 5,
And a plurality of speakers included in the candidate group have different genders.

The method of claim 6,
And said new voice is a voice of any one of said plurality of speakers.

delete