KR20230046946A

KR20230046946A - Electronic device for identifying a target speaker and an operating method thereof

Info

Publication number: KR20230046946A
Application number: KR1020220083342A
Authority: KR
Inventors: 카이 왕; 샤오레이 장; 미야오 장
Original assignee: 삼성전자주식회사
Priority date: 2021-09-30
Filing date: 2022-07-06
Publication date: 2023-04-06
Also published as: CN113870860A

Abstract

A method of operating an electronic device according to an embodiment includes the operations of: extracting voice characteristics of a target speaker based on a voice input; determining an utterance scenario of the voice input based on voice characteristics of the target speaker; obtaining final voice characteristics of the target speaker based on the determined utterance scenario; and determining whether the target speaker corresponds to a user based on the final voice characteristics of the target speaker and the final voice characteristics of the user. The utterance scenario may include a single speaker scenario and a multi-speaker scenario. In addition, various embodiments may be possible.

Description

Electronic device for identifying a target speaker and method for operating the same

본 발명의 다양한 실시예들은 타겟 화자를 식별하는 전자 장치 및 이의 동작 방법에 관한 것이다.Various embodiments of the present disclosure relate to an electronic device for identifying a target speaker and an operating method thereof.

다양한 전자 기기의 보급으로 인해, 전자 기기의 보안이 더욱 중요해지고 있다. 특히 전자 기기는 등록된 사용자가 사용하는 경우에만 잠금 해제되어, 전자 기기의 사용자가 아닌 다른 사람이 전자 기기를 사용하는 것을 방지해야 한다.Due to the spread of various electronic devices, the security of electronic devices is becoming more important. In particular, the electronic device should be unlocked only when a registered user uses the electronic device, preventing someone other than the user of the electronic device from using the electronic device.

전자 기기의 보안을 위해, 사용자의 음성이 사용자의 고유 정보로 사용될 수 있다. 예를 들어, 성문 인식 기술(voiceprint identification technology)(또는 화자 인식 기술(speaker verification technology))은 사용자의 음성을 사용하는 기술일 수 있다. 성문 인식 기술은 음성 입력에서 화자의 음성 특징을 추출하고, 추출된 음성 특징을 이용할 수 있다.For security of the electronic device, the user's voice may be used as the user's unique information. For example, voiceprint identification technology (or speaker verification technology) may be technology that uses a user's voice. The voiceprint recognition technology may extract voice features of a speaker from voice input and use the extracted voice features.

일반적으로 성문 인식 기술은 성문 등록과 성문 검증의 두 가지 프로세스로 나뉜다. 등록 과정에서, 전자 장치는 사용자의 음성을 통해 전자 장치에 사용자의 정보를 등록할 수 있다. 검증 과정에서, 전자 장치는 수신한 음성에서 음성 특징을 추출하고, 추출한 음성 특징을 사전 등록된 사용자의 음성 특징과 비교할 수 있다. 예를 들어, 전자 장치는 추출한 음성 특징과 사전 등록된 사용자의 음성 특징 사이의 유사도 점수를 계산할 수 있다. 전자 장치는 계산된 유사도 점수가 임계값보다 큰 경우 음성에서 인식된 화자와 등록된 사용자를 동일 인물로 판단할 수 있고, 계산된 유사도 점수가 임계값보다 작은 경우 음성에서 인식된 화자와 등록된 사용자를 상이한 인물로 판단할 수 있다.In general, voiceprint recognition technology is divided into two processes: voiceprint registration and voiceprint verification. During the registration process, the electronic device may register user information with the electronic device through the user's voice. During the verification process, the electronic device may extract voice features from the received voice and compare the extracted voice features with voice features of a pre-registered user. For example, the electronic device may calculate a similarity score between the extracted voice feature and the pre-registered voice feature of the user. The electronic device may determine that the speaker recognized from the voice and the registered user are the same person if the calculated similarity score is greater than the threshold value, and if the calculated similarity score is smaller than the threshold value, the speaker recognized from the voice and the registered user may be determined. can be judged as a different person.

성문 인식 기술은 독립적인 2가지 모델(음성 추출 모델 및 성문 인식 모델)을 이용하여 성문 검증을 수행할 수 있다. 성문 인식 기술은 음성 추출 모델을 통해 입력 음성으로부터 화자의 음성에 관한 정보를 추출한 다음, 성문 인식 모델을 통해 화자 음성에 관한 정보로부터 화자와 사용자의 대응관계를 획득한다. 2단계의 성문 검증 과정은 2가지의 독립적인 모델을 이용하는 것이고, 성문 검증 과정을 수행하는 2가지 모델을 복합적으로 조정하는 것은 쉽지 않을 수 있다. 독립적인 모델을 각각 구현해낸 성문 인식 기술은 복잡한 음성을 수신할 경우 낮은 인식 성능을 가질 수 있다. 서로 다른 동작을 수행하는 2가지 모델을 복합적으로 조정하여 복잡한 음성을 정확하게 인식하여 검증하는 기술이 요구될 수 있다.Voiceprint recognition technology can perform voiceprint verification using two independent models (a voice extraction model and a voiceprint recognition model). Voiceprint recognition technology extracts information about a speaker's voice from an input voice through a voice extraction model, and then obtains a correspondence between a speaker and a user from information about a speaker's voice through a voiceprint recognition model. The second-stage voiceprint verification process uses two independent models, and it may not be easy to adjust the two models that perform the voiceprint verification process in a complex manner. Voiceprint recognition technologies each implementing independent models may have low recognition performance when receiving complex voices. A technique for accurately recognizing and verifying complex speech by complexly adjusting two models that perform different operations may be required.

일 실시예들은 종단 간 모델을 통해 복잡한 음성을 정확하게 인식하여 검증하는 기술을 제공할 수 있고, 입력 음성이 지정된 키워드를 포함하지 않는 경우에도 타겟 화자를 인식하여 검증하는 기술을 제공할 수 있다.Embodiments may provide a technology for accurately recognizing and verifying a complex voice through an end-to-end model, and for recognizing and verifying a target speaker even when an input voice does not include a designated keyword.

본 문서에서 이루고자 하는 기술적 과제는 이상에서 언급한 기술적 과제로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problem to be achieved in this document is not limited to the technical problem mentioned above, and other technical problems not mentioned can be clearly understood by those skilled in the art from the description below. There will be.

일 실시예에 따른 전자 장치의 동작 방법은 음성 입력에 기초하여 타겟 화자의 음성 특징을 추출하는 동작; 상기 타겟 화자의 음성 특징에 기초하여 상기 음성 입력의 발화 시나리오(utterance scenario)를 결정하는 동작; 결정된 발화 시나리오에 기초하여 상기 타겟 화자의 최종 음성 특징을 획득하는 동작; 및 상기 타겟 화자의 최종 음성 특징 및 사용자의 최종 음성 특징에 기초하여 상기 타겟 화자가 상기 사용자에 대응되는지를 확인하는 동작을 포함하고, 상기 발화 시나리오는, 단일 화자 시나리오 및 다중 화자 시나리오를 포함할 수 있다.An operating method of an electronic device according to an embodiment includes extracting a voice feature of a target speaker based on a voice input; determining an utterance scenario of the voice input based on the voice characteristics of the target speaker; obtaining a final speech feature of the target speaker based on the determined speech scenario; and determining whether the target speaker corresponds to the user based on final voice characteristics of the target speaker and final voice characteristics of the user, wherein the speech scenario may include a single speaker scenario and a multi-speaker scenario. there is.

일 실시예에 따르면, 상기 추출하는 동작은, 상기 음성 입력에 기초하여 원본 음성 특징을 획득하는 동작; 및 제1 네트워크에 상기 원본 음성 특징 및 상기 사용자의 중간 임베딩 음성 특징을 입력하여 상기 타겟 화자의 음성 특징을 추출하는 동작을 포함할 수 있다.According to an embodiment, the extracting may include obtaining an original voice feature based on the voice input; and extracting the voice features of the target speaker by inputting the original voice features and the intermediate embedded voice features of the user to a first network.

일 실시예에 따르면, 상기 제1 네트워크는, 상기 원본 음성 특징을 수신하여, 상기 음성 입력에 포함된 화자의 음성 특징을 추출하기 위한 화자 추출 임베딩 특징(speaker extraction embedding feature)을 출력하는 제1 컨볼루션 레이어; 상기 화자 추출 임베딩 특징 및 상기 사용자의 중간 임베딩 음성 특징(middle embedding voice feature)을 수신하여, 스플라이싱 특징을 출력하는 스플라이싱 레이어; 상기 스플라이싱 특징을 수신하여, 마스크(mask)를 출력하는 제2 컨볼루션 레이어; 및 상기 마스크 및 상기 화자 추출 임베딩 특징을 수신하여, 상기 타겟 화자의 음성 특징을 출력하는 곱셈기(multiplier)를 포함할 수 있다.According to an embodiment, the first network receives the original speech feature and outputs a speaker extraction embedding feature for extracting the speech feature of the speaker included in the speech input. lution layer; a splicing layer that receives the speaker extraction embedding feature and the middle embedding voice feature of the user and outputs a splicing feature; a second convolution layer that receives the splicing feature and outputs a mask; and a multiplier receiving the mask and the speaker extraction embedding feature and outputting a speech feature of the target speaker.

일 실시예에 따르면, 상기 음성 입력의 발화 시나리오를 결정하는 동작은, 상기 원본 음성 특징 및 상기 타겟 화자의 음성 특징을 비교하여 상기 음성 입력의 발화 시나리오를 결정하는 동작을 포함할 수 있다.According to an embodiment, the determining of the speech scenario of the voice input may include determining the speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker.

일 실시예에 따르면, 상기 원본 음성 특징 및 상기 타겟 화자의 음성 특징을 비교하여 상기 음성 입력의 발화 시나리오를 결정하는 동작은, 상기 원본 음성 특징과 상기 타겟 화자의 음성 특징의 평균 제곱 오차(mean square error)가 임계값보다 작을 경우 상기 발화 시나리오를 단일 화자 시나리오로 결정하고, 상기 원본 음성 특징과 상기 타겟 화자의 음성 특징의 평균 제곱 오차가 임계값 이상인 경우 상기 발화 시나리오를 다중 화자 시나리오로 결정하는 동작을 포함할 수 있다.According to an embodiment, the operation of determining the speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker may include a mean square error between the original voice characteristics and the voice characteristics of the target speaker. error) is less than a threshold value, determining the speech scenario as a single speaker scenario, and determining the speech scenario as a multi-speaker scenario when the mean square error between the original speech feature and the target speaker's speech feature is greater than or equal to the threshold value. can include

일 실시예에 따르면, 상기 타겟 화자의 최종 음성 특징을 획득하는 동작은, 상기 단일 화자 시나리오에 대응하여, 제2 네트워크에 상기 원본 음성 특징을 입력하여 상기 타겟 화자의 최종 음성 특징을 획득하는 동작; 또는 상기 다중 화자 시나리오에 대응하여, 상기 제2 네트워크에 상기 타겟 화자의 음성 특징을 입력하여 상기 타겟 화자의 최종 음성 특징을 획득하는 동작을 포함할 수 있다.According to an embodiment, the acquiring of the final voice characteristics of the target speaker may include, in response to the single speaker scenario, acquiring the final voice characteristics of the target speaker by inputting the original voice characteristics to a second network; Alternatively, in response to the multi-speaker scenario, an operation of inputting the voice characteristics of the target speaker to the second network and acquiring final voice characteristics of the target speaker may be included.

일 실시예에 따르면, 상기 제2 네트워크는, 상기 원본 음성 특징 또는 상기 타겟 화자의 음성 특징을 수신하여 상기 타겟 화자의 중간 임베딩 음성 특징을 출력하는 화자 임베딩 레이어(speaker embedding lyaer) 및 상기 타겟 화자의 중간 임베딩 음성 특징을 수신하여 상기 타겟 화자의 최종 음성 특징을 출력하는 어텐션 통계 풀링 레이어(attentive statistics pooling layer)를 포함할 수 있다.According to an embodiment, the second network may include a speaker embedding layer that receives the original speech feature or the speech feature of the target speaker and outputs an intermediate embedded speech feature of the target speaker and a speaker embedding layer of the target speaker. An attention statistics pooling layer may be included that receives intermediate speech features and outputs final speech features of the target speaker.

일 실시예에 따르면, 상기 타겟 화자가 상기 사용자에 대응되는지를 확인하는 동작은, 상기 타겟 화자의 최종 음성 특징 및 상기 사용자의 최종 음성 특징의 유사도 값을 계산하는 동작; 및 계산 결과에 기초하여 상기 타겟 화자가 상기 사용자에 대응되는지를 확인하는 동작을 포함할 수 있다.According to an embodiment, the checking whether the target speaker corresponds to the user may include calculating a similarity value between a final voice feature of the target speaker and a final voice feature of the user; and determining whether the target speaker corresponds to the user based on a calculation result.

일 실시예에 따르면, 상기 사용자의 중간 임베딩 음성 특징은, 상기 사용자의 음성 입력에 기초하여 획득된 상기 사용자의 음성 특징을 상기 화자 임베딩 레이어에 입력한 결과로써 획득된 것이고, 상기 사용자의 최종 음성 특징은, 상기 사용자의 중간 임베딩 음성 특징을 상기 어탠션 통계 풀링 레이어에 입력한 결과로써 획득된 것일 수 있다.According to an embodiment, the user's intermediate voice feature is obtained as a result of inputting the user's voice feature obtained based on the user's voice input to the speaker embedding layer, and the user's final voice feature may be obtained as a result of inputting the user's intermediate embedded voice features to the attention statistics pooling layer.

일 실시예에 따르면, 상기 제1 네트워크 및 상기 제2 네트워크는, 화자의 중간 임베딩 음성 특징에 기초하여 화자의 음성 특징을 변환하는 제3 네트워크와 공동으로 학습된 것일 수 있다.According to an embodiment, the first network and the second network may be jointly learned with a third network that transforms a speaker's voice feature based on an intermediate embedded voice feature of the speaker.

일 실시예에 따른 전자 장치는 인스트럭션들을 포함하는 메모리; 및 상기 메모리와 전기적으로 연결되고, 상기 인스트럭션들을 실행하기 위한 프로세서를 포함하고, 상기 프로세서에 의해 상기 인스트럭션들이 실행될 때, 상기 프로세서는, 음성 입력에 기초하여 타겟 화자의 음성 특징을 추출하고, 상기 타겟 화자의 음성 특징에 기초하여 상기 음성 입력의 발화 시나리오를 결정하고, 결정된 발화 시나리오에 기초하여 상기 타겟 화자의 최종 음성 특징을 획득하고, 상기 타겟 화자의 최종 음성 특징 및 사용자의 최종 음성 특징에 기초하여 상기 타겟 화자가 상기 사용자에 대응되는지를 확인하고, 상기 발화 시나리오는, 단일 화자 시나리오 및 다중 화자 시나리오를 포함하는 것일 수 있다.An electronic device according to an embodiment includes a memory including instructions; and a processor electrically connected to the memory and configured to execute the instructions, wherein when the instructions are executed by the processor, the processor extracts voice characteristics of a target speaker based on a voice input, Determine a speech scenario of the voice input based on the voice characteristics of the speaker, obtain final voice characteristics of the target speaker based on the determined speech scenario, and obtain final voice characteristics of the target speaker and final voice characteristics of the user It is checked whether the target speaker corresponds to the user, and the speaking scenario may include a single speaker scenario and a multi-speaker scenario.

일 실시예에 따르면, 상기 프로세서는, 상기 음성 입력에 기초하여 원본 음성 특징을 획득하고, 제1 네트워크에 상기 원본 음성 특징 및 상기 사용자의 중간 임베딩 음성 특징을 입력하여 상기 타겟 화자의 음성 특징을 추출할 수 있다.According to an embodiment, the processor obtains an original speech feature based on the voice input, inputs the original voice feature and the intermediate embedded voice feature of the user to a first network, and extracts the voice feature of the target speaker. can do.

일 실시예에 따르면, 상기 제1 네트워크는, 상기 원본 음성 특징을 수신하여, 상기 음성 입력에 포함된 화자의 음성 특징을 추출하기 위한 화자 추출 임베딩 특징을 출력하는 제1 컨볼루션 레이어; 상기 화자 추출 임베딩 특징 및 상기 사용자의 중간 임베딩 음성 특징을 수신하여, 스플라이싱 특징을 출력하는 스플라이싱 레이어; 상기 스플라이싱 특징을 수신하여, 마스크를 출력하는 제2 컨볼루션 레이어; 및 상기 마스크 및 상기 화자 추출 임베딩 특징을 수신하여, 상기 타겟 화자의 음성 특징을 출력하는 곱셈기를 포함할 수 있다.According to an embodiment, the first network may include: a first convolution layer that receives the original speech features and outputs speaker extraction embedding features for extracting speech features of a speaker included in the speech input; a splicing layer that receives the speaker extraction embedding feature and the user's middle embedding voice feature and outputs a splicing feature; a second convolution layer that receives the splicing feature and outputs a mask; and a multiplier receiving the mask and the speaker extraction embedding feature and outputting a speech feature of the target speaker.

일 실시예에 따르면, 상기 프로세서는, 상기 원본 음성 특징 및 상기 타겟 화자의 음성 특징을 비교하여 상기 음성 입력의 발화 시나리오를 결정할 수 있다.According to an embodiment, the processor may determine a speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker.

일 실시예에 따르면, 상기 프로세서는, 상기 원본 음성 특징과 상기 타겟 화자의 음성 특징의 평균 제곱 오차가 임계값보다 작을 경우 상기 발화 시나리오를 단일 화자 시나리오로 결정하고, 상기 원본 음성 특징과 상기 타겟 화자의 음성 특징의 평균 제곱 오차가 임계값 이상인 경우 상기 발화 시나리오를 다중 화자 시나리오로 결정할 수 있다.According to an embodiment, the processor determines that the speech scenario is a single speaker scenario when the mean square error between the original speech feature and the speech feature of the target speaker is smaller than a threshold value, and the original speech feature and the target speaker When the mean square error of the speech feature of is greater than or equal to a threshold value, the speech scenario may be determined as a multi-speaker scenario.

일 실시예에 따르면, 상기 프로세서는, 상기 단일 화자 시나리오에 대응하여, 제2 네트워크에 상기 원본 음성 특징을 입력하여 상기 타겟 화자의 최종 음성 특징을 획득하거나, 또는 상기 다중 화자 시나리오에 대응하여, 상기 제2 네트워크에 상기 타겟 화자의 음성 특징을 입력하여 상기 타겟 화자의 최종 음성 특징을 획득할 수 있다.According to an embodiment, the processor obtains final voice features of the target speaker by inputting the original voice features to a second network in response to the single-speaker scenario, or in response to the multi-speaker scenario, the Final voice characteristics of the target speaker may be obtained by inputting the voice characteristics of the target speaker to the second network.

일 실시예에 따르면, 상기 제2 네트워크는, 상기 원본 음성 특징 또는 상기 타겟 화자의 음성 특징을 수신하여 상기 타겟 화자의 중간 임베딩 음성 특징을 출력하는 화자 임베딩 레이어; 및 상기 타겟 화자의 중간 임베딩 음성 특징을 수신하여 상기 타겟 화자의 최종 음성 특징을 출력하는 어텐션 통계 풀링 레이어를 포함할 수 있다.According to an embodiment, the second network may include: a speaker embedding layer that receives the original speech feature or the speech feature of the target speaker and outputs an intermediate embedded speech feature of the target speaker; and an attention statistics pooling layer that receives the middle embedded speech feature of the target speaker and outputs a final speech feature of the target speaker.

일 실시예에 따르면, 상기 프로세서는, 상기 타겟 화자의 최종 음성 특징 및 상기 사용자의 최종 음성 특징의 유사도 값을 계산하고, 계산 결과에 기초하여 상기 타겟 화자가 상기 사용자에 대응되는지를 확인할 수 있다.According to an embodiment, the processor may calculate a similarity value between a final voice feature of the target speaker and a final voice feature of the user, and determine whether the target speaker corresponds to the user based on the calculation result.

일 실시예에 따르면, 상기 제1 네트워크 및 상기 제2 네트워크는 화자의 중간 임베딩 음성 특징에 기초하여 화자의 음성 특징을 변환하는 제3 네트워크와 공동으로 학습된 것일 수 있다.According to an embodiment, the first network and the second network may be jointly learned with a third network that converts a speaker's voice feature based on an intermediate embedded voice feature of the speaker.

도 1은 일 실시예에 따른, 타겟 화자를 식별하는 전자 장치의 개략적인 블록도이다.
도 2는 일 실시예에 따른, 뉴럴 네트워크 기반 화자 음성 추출 모델의 일 예이다.
도 3은 일 실시예에 따른, 뉴럴 네트워크 기반 화자 인식 모델의 일 예이다.
도 4는 일 실시예에 따른, 음성 입력에 기초하여 타겟 화자를 식별하는 동작을 설명하기 위한 도면이다.
도 5는 일 실시예에 따른, 뉴럴 네트워크 기반 모델들의 학습을 설명하기 위한 일 예이다.
도 6은 일 실시예에 따른, 뉴럴 네트워크 기반 모델들의 학습을 설명하기 위한 다른 예이다.
도 7은 일 실시예에 따른, 타겟 화자를 식별하는 동작의 흐름도이다.1 is a schematic block diagram of an electronic device for identifying a target speaker, according to an embodiment.
2 is an example of a speaker voice extraction model based on a neural network, according to an embodiment.
3 is an example of a speaker recognition model based on a neural network, according to an embodiment.
4 is a diagram for describing an operation of identifying a target speaker based on a voice input, according to an exemplary embodiment.
5 is an example for explaining learning of neural network-based models according to an embodiment.
6 is another example for explaining learning of neural network-based models according to an embodiment.
7 is a flow diagram of an operation to identify a target speaker, according to one embodiment.

이하, 구체적인 실시예들은 본문에 설명된 방법, 장치, 및/또는 시스템에 대한 독자의 포괄적인 이해를 위해 제공된다. 본 출원의 개시가 이해된 후, 본 명세서에 기재된 방법, 장치, 및/또는 시스템의 다양한 변경, 수정, 및 균등물은 또한 명확해질 것이다. 본 명세서에서 설명되는 동작의 순서는 예시일 뿐이며, 이에 한정되지 않으며, 특정한 순서로 수행되어야 하는 동작을 제외하고는 순서가 변경될 수 있다. 또한, 보다 명확하고 간결하게 하기 위하여, 본 기술분야에 알려진 특징에 대한 설명은 생략될 수 있다.Hereinafter, specific embodiments are provided for the reader's comprehensive understanding of the method, apparatus, and/or system described herein. After the disclosure of this application is understood, various changes, modifications, and equivalents of the methods, apparatus, and/or systems described herein will also become apparent. The order of operations described in this specification is only an example, and is not limited thereto, and the order may be changed except for operations to be performed in a specific order. Also, for clarity and conciseness, descriptions of features known in the art may be omitted.

본문에 설명된 특징은 다른 형태로 구현될 수 있으며, 본문에 설명된 예시에 제한되는 것으로 해석되어서는 안 된다. 반대로, 본문에 설명된 예시는 본문에 설명된 방법, 장치 및/또는 시스템을 구현하는 많은 가능한 방법 중 일부만을 보여주기 위해 제공되었으며, 상기 많은 가능한 방법은 본 출원의 개시를 이해한 후 명확해질 것이다.Features described in the text may be implemented in other forms and should not be construed as being limited to the examples described in the text. To the contrary, the examples described herein are provided to illustrate only some of the many possible ways of implementing the methods, apparatus and/or systems described herein, many of which will become apparent after reading the present disclosure. .

본 명세서에 사용된 바와 같이, 용어 "및/또는"은 관련된 나열된 항목 중 임의의 하나 및 임의의 둘 이상의 임의의 조합을 포함한다.As used herein, the term "and/or" includes any one and any combination of any two or more of the associated listed items.

"제1", "제2" 및 "제3"과 같은 용어가 다양한 부재, 요소, 영역, 층, 또는 부분을 설명하기 위해 여기에서 사용될 수 있지만, 부재, 요소, 영역, 층, 또는 부분은 이러한 용어(예: "제1", "제2" 및 "제3"과 같은 용어)에 의해 제한되어서는 안 된다. "제1", "제2" 및 "제3"과 같은 용어는 하나의 부재, 요소, 영역, 층, 또는 부분을 다른 부재, 요소, 영역, 층 또는 부분과 구별하는 데만 사용된다. 따라서, 예시의 교시를 벗어나지 않는 선에서, 본 명세서에 기재된 예시에서 언급된 제1 부재, 제1 요소, 제1 영역, 제1 층, 또는 제1 부분은 제2 부재, 제2 요소, 제2 영역, 제2 층, 또는 제2 부분으로도 지칭될 수 있다.Although terms such as “first,” “second,” and “third” may be used herein to describe various elements, elements, regions, layers, or sections, elements, elements, regions, layers, or sections It should not be limited by these terms (eg, terms such as "first", "second" and "third"). Terms such as “first,” “second,” and “third” are only used to distinguish one element, element, region, layer, or section from another element, element, region, layer, or section. Accordingly, without departing from the teachings of the examples, the first member, the first element, the first region, the first layer, or the first part mentioned in the examples described herein may be the second member, the second element, the second It may also be referred to as a region, a second layer, or a second portion.

명세서에서, 요소(예, 층, 영역, 또는 기판)가 다른 요소 위에 "존재"하거나, 다른 요소에 "연결" 또는 "결합"된 것으로 설명될 때, 해당 요소는 다른 요소 위에 직접 "존재"하거나, 다른 요소에 직접 "연결" 또는 "결합"될 수 있고, 또는 그 사이에 하나 이상의 다른 요소가 존재할 수 있다. 반대로, 요소가 다른 요소 위에 "직접 존재"하거나, 다른 요소에 "직접 연결" 또는 "직접 결합"된 것으로 설명될 때, 그 사이에는 다른 요소가 없을 수 있다.In the specification, when an element (eg, layer, region, or substrate) is described as “existing” on, “connected to,” or “coupled to” another element, that element may “exist” directly on the other element or , may be directly "connected" or "coupled" to another element, or may have one or more other elements in between. Conversely, when an element is described as being “directly present” on, “directly connected to” or “directly coupled to” another element, there may be no other elements in between.

본문에서 사용된 용어는 단지 다양한 예시를 설명하기 위해 사용된 것으로, 개시를 제한하기 위해 사용된 것은 아니다. 문맥에서 명확하게 달리 나타내지 않는 한, 단수 형태는 복수 형태를 포함하도록 의도된다. "포함하다", "포괄하다" 및 "갖다"라는 용어는 설명된 특징, 수량, 작업, 구성 요소, 요소, 및/또는 이들의 조합의 존재를 설명하지만, 하나 이상의 다른 특징, 수량, 작업, 구성 요소, 요소, 및/또는 이들의 조합의 존재 또는 추가를 배제하지 않는다. Terms used in the text are only used to describe various examples, and are not used to limit the disclosure. The singular forms are intended to include the plural forms unless the context clearly dictates otherwise. The terms “comprise,” “include,” and “have” describe the presence of the described feature, quantity, operation, component, element, and/or combination thereof, but describe the presence of one or more other features, quantities, operations, The presence or addition of components, elements, and/or combinations thereof is not excluded.

다른 정의가 없는 한, 본문에서 사용되는 모든 용어(기술 용어 및 과학 용어 포함)는 본 발명이 속한 기술 분야에서 통상의 지식을 가진 자가 본 개시를 이해한 후 일반적으로 이해하는 것과 동일한 의미를 갖는다. 본문에서 명시적으로 정의되지 않는 한, 용어(예: 일반 사전에 정의된 용어)는 해당 분야의 맥락 및 본 개시 내용에서 그 의미와 일치하는 의미를 갖는 것으로 해석되어야 한다.Unless otherwise defined, all terms (including technical terms and scientific terms) used in the text have the same meaning as commonly understood by a person of ordinary skill in the art to which the present invention belongs after understanding the present disclosure. Unless explicitly defined in the text, terms (eg, terms defined in a general dictionary) are to be interpreted as having a meaning consistent with their meaning in the context of the field and in this disclosure.

또한, 실시예를 설명함에 있어서, 공지된 관련 구조 또는 기능에 대한 구체적인 설명이 본 발명의 설명을 모호하게 할 수 있다고 판단되는 경우, 그 상세 설명을 생략하기로 한다.In addition, in describing the embodiments, if it is determined that a detailed description of a known related structure or function may obscure the description of the present invention, the detailed description will be omitted.

이하, 실시예들을 첨부된 도면들을 참조하여 상세하게 설명한다. 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조 부호를 부여하고, 이에 대한 중복되는 설명은 생략하기로 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. In the description with reference to the accompanying drawings, the same reference numerals are given to the same components regardless of reference numerals, and overlapping descriptions thereof will be omitted.

도 1은 일 실시예에 따른, 타겟 화자를 식별하는 전자 장치의 개략적인 블록도이다.1 is a schematic block diagram of an electronic device for identifying a target speaker, according to an embodiment.

일 실시예에 따르면, 전자 장치(100)는 보안을 위해 잠금된 장치일 수 있다. 전자 장치(100)는 사전 등록된 사용자의 음성에 관한 정보 및 새롭게 획득된 음성에 관한 정보를 비교하여 전자 장치(100)의 잠금을 해제할 수 있다. 전자 장치(100)는 음성 입력에 기초하여 타겟 화자의 음성에 관한 정보를 추출하고, 타겟 화자의 음성에 관한 정보에 기초하여 타겟 화자가 사전 등록된 사용자와 동일한지를 판단할 수 있다. 타겟 화자는 음성 입력에서 인식되어, 검증될 대상일 수 있다. 전자 장치(100)는 종단 간 모델을 통해 복잡한 음성을 정확하게 인식하여 검증하는 기술을 제공할 수 있다. 전자 장치(100)는 입력 음성이 지정된 키워드를 포함하지 않는 경우에도, 타겟 화자를 인식하여 검증하는 기술을 제공할 수 있다.According to an embodiment, the electronic device 100 may be a locked device for security. The electronic device 100 may unlock the electronic device 100 by comparing previously registered user voice information and newly acquired voice information. The electronic device 100 may extract information about the target speaker's voice based on the voice input, and determine whether the target speaker is the same as a pre-registered user based on the information about the target speaker's voice. A target speaker may be an object to be recognized and verified in a voice input. The electronic device 100 may provide a technology for accurately recognizing and verifying complex speech through an end-to-end model. The electronic device 100 may provide a technique for recognizing and verifying a target speaker even when an input voice does not include a designated keyword.

도 1을 참조하면, 일 실시예에 따르면, 전자 장치(100)는 프로세서(110) 및 메모리(130)를 포함할 수 있다. 프로세서(110)는 메모리(130)에 저장된 데이터를 처리할 수 있다. 프로세서(110)는 메모리(130)에 저장된 컴퓨터로 읽을 수 있는 코드(예: 소프트웨어) 및 프로세서(110)에 의해 유발된 인스트럭션(instruction)들을 실행할 수 있다. 프로세서(110)는 목적하는 동작들(desired operations)을 실행시키기 위한 물리적인 구조를 갖는 회로를 가지는 하드웨어로 구현된 데이터 처리 장치일 수 있다. 예를 들어, 목적하는 동작들은 프로그램에 포함된 코드(code) 또는 인스트럭션들(instructions)을 포함할 수 있다. 예를 들어, 하드웨어로 구현된 데이터 처리 장치는 마이크로프로세서(microprocessor), 중앙 처리 장치(central processing unit), 프로세서 코어(processor core), 멀티-코어 프로세서(multi-core processor), 멀티프로세서(multiprocessor), ASIC(Application-Specific Integrated Circuit), FPGA(Field Programmable Gate Array)를 포함할 수 있다.Referring to FIG. 1 , according to an embodiment, an electronic device 100 may include a processor 110 and a memory 130 . The processor 110 may process data stored in the memory 130 . The processor 110 may execute computer readable codes (eg, software) stored in the memory 130 and instructions triggered by the processor 110 . The processor 110 may be a data processing device implemented in hardware having a circuit having a physical structure for executing desired operations. For example, desired operations may include codes or instructions included in a program. For example, a data processing unit implemented in hardware includes a microprocessor, a central processing unit, a processor core, a multi-core processor, and a multiprocessor. , Application-Specific Integrated Circuit (ASIC), and Field Programmable Gate Array (FPGA).

일 실시예에 따르면, 메모리(130)는 연산을 위한 데이터 또는 연산 결과를 저장할 수 있다. 메모리(130)는 프로세서(110)에 의해 실행가능한 인스트럭션들(또는 프로그램)을 저장할 수 있다. 예를 들어, 인스트럭션들은 프로세서(110)의 동작 및/또는 프로세서(110)의 각 구성의 동작을 실행하기 위한 인스트럭션들을 포함할 수 있다. 메모리(130)는 휘발성 메모리 장치 또는 비휘발성 메모리 장치로 구현될 수 있다. 휘발성 메모리 장치는 DRAM(dynamic random access memory), SRAM(static random access memory), T-RAM(thyristor RAM), Z-RAM(zero capacitor RAM), 또는 TTRAM(Twin Transistor RAM)으로 구현될 수 있다. 비휘발성 메모리 장치는 EEPROM(Electrically Erasable Programmable Read-Only Memory), 플래시(flash) 메모리, MRAM(Magnetic RAM), 스핀전달토크 MRAM(Spin-Transfer Torque(STT)-MRAM), Conductive Bridging RAM(CBRAM), FeRAM(Ferroelectric RAM), PRAM(Phase change RAM), 저항 메모리(Resistive RAM(RRAM)), 나노 튜브 RRAM(Nanotube RRAM), 폴리머 RAM(Polymer RAM(PoRAM)), 나노 부유 게이트 메모리(Nano Floating Gate Memory(NFGM)), 홀로그래픽 메모리(holographic memory), 분자 전자 메모리 소자(Molecular Electronic Memory Device), 또는 절연 저항 변화 메모리(Insulator Resistance Change Memory)로 구현될 수 있다. 메모리(130)는 데이터를 저장할 수 있다.According to an embodiment, the memory 130 may store data for calculation or calculation results. Memory 130 may store instructions (or programs) executable by processor 110 . For example, the instructions may include instructions for executing an operation of the processor 110 and/or an operation of each component of the processor 110 . The memory 130 may be implemented as a volatile memory device or a non-volatile memory device. The volatile memory device may be implemented as dynamic random access memory (DRAM), static random access memory (SRAM), thyristor RAM (T-RAM), zero capacitor RAM (Z-RAM), or twin transistor RAM (TTRAM). Non-volatile memory devices include electrically erasable programmable read-only memory (EEPROM), flash memory, magnetic RAM (MRAM), spin-transfer torque (STT)-MRAM (conductive bridging RAM), and conductive bridging RAM (CBRAM). , FeRAM (Ferroelectric RAM), PRAM (Phase change RAM), Resistive RAM (RRAM), Nanotube RRAM (Polymer RAM (PoRAM)), Nano Floating Gate Memory Memory (NFGM)), holographic memory, molecular electronic memory device (Molecular Electronic Memory Device), or Insulator Resistance Change Memory. The memory 130 may store data.

일 실시예에 따르면, 메모리(130)는 전자 장치(100)의 사용자와 관련된 정보를 저장할 수 있다. 전자 장치(100)의 사용자와 관련된 정보는 사용자의 중간 임베딩 음성 특징 및 사용자의 최종 음성 특징을 포함할 수 있다. 이하에서는 프로세서(110)의 타겟 화자 식별 동작에 대하여 구체적으로 설명하도록 한다.According to an embodiment, the memory 130 may store information related to the user of the electronic device 100 . Information related to the user of the electronic device 100 may include the user's intermediate voice feature and the user's final voice feature. Hereinafter, a target speaker identification operation of the processor 110 will be described in detail.

일 실시예에 따르면, 프로세서(110)는 음성 입력에 기초하여 타겟 화자의 음성 특징을 추출할 수 있다. 프로세서(110)는 음성 입력에 기초하여 원본 음성 특징을 획득할 수 있고, 프로세서(110)는 제1 네트워크에 원본 음성 특징 및 사용자의 중간 임베딩 음성 특징(예: 기저장된 사용자의 중간 임베딩 음성 특징)을 입력하여 타겟 화자의 음성 특징을 추출할 수 있다.According to an embodiment, the processor 110 may extract a voice feature of a target speaker based on a voice input. The processor 110 may obtain an original voice feature based on the voice input, and the processor 110 may obtain the original voice feature and the user's intermediate voice feature (eg, the user's intermediate embedded voice feature pre-stored) in the first network. It is possible to extract the speech characteristics of the target speaker by inputting .

일 실시예에 따르면, 음성 입력은 전자 장치(100)의 사용을 위해 입력되는 음성일 수 있다. 음성 입력은 다양한 발화 시나리오에서 생성된 것일 수 있다. 예를 들어, 음성 입력은 단일 화자 시나리오에서 단일 화자에 의해 생성된 것일 수 있다. 다른 예를 들어, 음성 입력은 다중 화자 시나리오에서 다중 화자에 의해 생성된 것일 수 있다. 음성 입력은 단일 화자의 음성 또는 다중 화자의 음성을 포함할 수 있다.According to an embodiment, the voice input may be a voice input for use of the electronic device 100 . Voice input may be generated in various speech scenarios. For example, voice input may be generated by a single speaker in a single speaker scenario. For another example, voice input may be generated by multiple speakers in a multi-speaker scenario. The voice input may include a single speaker's voice or multi-speaker's voice.

일 실시예에 따르면, 타겟 화자는 음성 입력에서 인식되어, 검증될 화자일 수 있다. 타겟 화자는 전자 장치(100)의 사용자와 동일한지 검증되는 대상에 해당할 수 있다. 예를 들어, 단일 화자 시나리오에서 타겟 화자는 해당 1인일 수 있다. 다른 예를 들어, 다중 화자 시나리오에서 타겟 화자는 복수의 화자 중 어느 1인일 수 있다. 타겟 화자는 검증의 대상이자, 등록된 사용자의 특징과 유사한 특징을 갖는 화자일 수 있다.According to one embodiment, the target speaker may be a speaker to be recognized and verified in the voice input. The target speaker may correspond to a subject to be verified whether or not the same as the user of the electronic device 100 . For example, in a single speaker scenario, the target speaker may be the corresponding one person. For another example, in a multi-speaker scenario, a target speaker may be any one of a plurality of speakers. The target speaker may be a speaker who is subject to verification and has characteristics similar to those of registered users.

일 실시예에 따르면, 사용자의 중간 임베딩 음성 특징은 기저장된 사용자의 특징일 수 있다. 사용자의 중간 임베딩 음성 특징은 음성 입력으로부터 타겟 화자에 관한 정보를 추출하기 위한 것일 수 있다. 원본 음성 특징 및 타겟 화자의 음성 특징은 프레임(예: 시간의 흐름에 따라 생성된 프레임) 별로 구성된 음성 특징일 수 있다. 음성의 길이(예: 시간 축에서의 길이)가 긴 경우, 음성 특징에 대응되는 데이터의 크기도 클 수 있다. 예를 들어, 원본 음성 특징 및 타겟 화자의 음성 특징은 멜 스케일 주파수 셉스트럴 계수(MFCC, Mel-scale Frequency Cepstral Coefficients)일 수 있다. 다른 예를 들어, 원본 음성 특징 및 타겟 화자의 음성 특징은 필터뱅크(FilterBank) 특징일 수 있다. 다만 상기 예시는 예시일 뿐, 본 발명이 이에 한정되는 것은 아니다. 프로세서(110)가 타겟 화자의 음성 특징을 획득하는 동작은 도 2를 통해 자세히 설명하도록 한다.According to an embodiment, the middle embedding speech characteristics of the user may be pre-stored characteristics of the user. The user's intermediate embedding voice feature may be for extracting information about the target speaker from the voice input. The original voice feature and the voice feature of the target speaker may be voice features configured frame by frame (eg, frames generated over time). When the length of the voice (eg, the length on the time axis) is long, the size of data corresponding to the voice feature may also be large. For example, the original speech feature and the speech feature of the target speaker may be Mel-scale Frequency Cepstral Coefficients (MFCC). For another example, the original voice features and the target speaker's voice features may be FilterBank features. However, the above examples are only examples, and the present invention is not limited thereto. An operation of acquiring voice characteristics of the target speaker by the processor 110 will be described in detail with reference to FIG. 2 .

일 실시예에 따르면, 프로세서(110)는 타겟 화자의 음성 특징에 기초하여 음성 입력의 발화 시나리오를 결정할 수 있다. 음성 입력의 발화 시나리오는 단일 화자 시나리오 및/또는 다중 화자 시나리오를 포함할 수 있다. 프로세서(110)는 원본 음성 특징 및 타겟 화자의 음성 특징을 비교하여 음성 입력의 발화 시나리오를 결정할 수 있다. 프로세서(110)는 원본 음성 특징과 타겟 화자의 음성 특징의 평균 제곱 오차(mean square error)가 임계값보다 작은 경우 발화 시나리오를 단일 화자 시나리오로 결정할 수 있다. 프로세서(110)는 원본 음성 특징과 타겟 화자의 음성 특징의 평균 제곱 오차가 임계값 이상인 경우 발화 시나리오를 다중 화자 시나리오로 결정할 수 있다. 프로세서(110)는 타겟 화자의 음성 특징과 원본 음성 특징 간의 차이를 기반으로 입력 음성의 발화 시나리오를 결정할 수 있으므로, 음성 입력의 발화 시나리오를 용이하게 결정할 수 있다. 프로세서(110)는 음성 입력의 발화 시나리오가 단일 화자 시나리오인지 다중 화자 시나리오인지 판단하여, 제2 네트워크(300)에 적합한 입력을 선택할 수 있다. 프로세서(110)는 다중 화자 시나리오에서의 성문 검증 성능을 보장할 수 있고, 단일 화자 시나리오에서의 성문 검증 성능 또한 보장할 수 있다. 프로세서(110)눈 사용자에게 고성능의 성문 검증 서비스를 제공할 수 있다.According to an embodiment, the processor 110 may determine a speech scenario of the voice input based on voice characteristics of the target speaker. Speech scenarios of voice input may include a single speaker scenario and/or a multi-speaker scenario. The processor 110 may determine an utterance scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker. The processor 110 may determine an utterance scenario as a single speaker scenario when a mean square error between the original voice feature and the target speaker's voice feature is smaller than a threshold value. The processor 110 may determine the speech scenario as a multi-speaker scenario when the mean square error between the original speech feature and the target speaker's speech feature is greater than or equal to a threshold value. Since the processor 110 may determine a speech scenario of the input voice based on the difference between the voice characteristics of the target speaker and the original voice characteristics, the speech scenario of the voice input may be easily determined. The processor 110 may select an input suitable for the second network 300 by determining whether a speech input scenario is a single speaker scenario or a multi-speaker scenario. The processor 110 can guarantee voiceprint verification performance in a multi-speaker scenario, and can also guarantee voiceprint verification performance in a single speaker scenario. The processor 110 may provide a high-performance voiceprint verification service to the eye user.

일 실시예에 따르면, 프로세서(110)는 결정된 발화 시나리오에 기초하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 프로세서(110)는 단일 화자 시나리오에 대응하여, 제2 네트워크에 원본 음성 특징을 입력하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 프로세서(110)는 다중 화자 시나리오에 대응하여, 제2 네트워크에 타겟 화자의 음성 특징을 입력하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 최종 음성 특징은 음성의 길이와 상관없이 고정된 크기를 가질 수 있다. 최종 음성 특징은 1차원 벡터일 수 있다. 예를 들어, 최종 음성 특징의 크기는 1x128, 1x256, 또는 1x512로 설정될 수 있다. 타겟 화자의 최종 음성 특징을 획득하는 동작을 도 3을 통해 자세히 설명하도록 한다.According to an embodiment, the processor 110 may obtain final speech characteristics of the target speaker based on the determined speech scenario. In response to a single speaker scenario, the processor 110 may acquire final voice features of a target speaker by inputting original voice features to the second network. In response to a multi-speaker scenario, the processor 110 may acquire final voice characteristics of the target speaker by inputting voice characteristics of the target speaker to the second network. The final speech feature may have a fixed size regardless of the length of the speech. The final speech feature may be a one-dimensional vector. For example, the size of the final voice feature may be set to 1x128, 1x256, or 1x512. The operation of obtaining the final speech feature of the target speaker will be described in detail with reference to FIG. 3 .

일 실시예에 따르면, 프로세서(110)는 타겟 화자의 최종 음성 특징 및 사용자의 최종 음성 특징에 기초하여, 타겟 화자가 사용자에 대응되는지를 확인할 수 있다. 프로세서(110)는 타겟 화자의 최종 음성 특징과 사용자의 최종 음성 특징의 유사도 값을 계산하고, 계산된 유사도 값에 기초하여 타겟 화자가 사용자에 대응되는지를 확인할 수 있다. 예를 들어, 프로세서(110)는 유사도 값(예: 코사인 유사도 값)이 임계값보다 큰 경우 타겟 화자가 사용자와 대응된다고 판단할 수 있다. 다른 예를 들어, 유사도 값이 임계값 이하인 경우 타겟 화자가 사용자와 대응되지 않는다고 판단할 수 있다. 이하에서는 제1 네트워크 및 제2 네트워크의 구성 및 동작을 구체적으로 설명하도록 한다.According to an embodiment, the processor 110 may determine whether the target speaker corresponds to the user based on the final voice characteristics of the target speaker and the final voice characteristics of the user. The processor 110 may calculate a similarity value between the final voice feature of the target speaker and the final voice feature of the user, and determine whether the target speaker corresponds to the user based on the calculated similarity value. For example, the processor 110 may determine that the target speaker corresponds to the user when a similarity value (eg, a cosine similarity value) is greater than a threshold value. For another example, when the similarity value is less than or equal to the threshold value, it may be determined that the target speaker does not correspond to the user. Hereinafter, configurations and operations of the first network and the second network will be described in detail.

도 2는 일 실시예에 따른, 뉴럴 네트워크 기반 화자 음성 추출 모델의 일 예이다.2 is an example of a speaker voice extraction model based on a neural network, according to an embodiment.

도 2를 참조하면, 일 실시예에 따르면, 프로세서(예: 도 1의 프로세서(110))는 제1 네트워크(200)에 기초하여 음성 입력으로부터 타겟 화자의 음성 특징을 추출할 수 있다. 제1 네트워크(200)는 제1 컨볼루션 레이어(210), 스플라이싱 레이어(220), 제2 컨볼루션 레이어(230), 및 곱셈기(240)를 포함할 수 있다.Referring to FIG. 2 , according to an embodiment, a processor (eg, the processor 110 of FIG. 1 ) may extract a voice feature of a target speaker from a voice input based on the first network 200 . The first network 200 may include a first convolution layer 210 , a splicing layer 220 , a second convolution layer 230 , and a multiplier 240 .

일 실시예에 따르면, 제1 컨볼루션 레이어(210)는 원본 음성 특징을 수신하여, 음성 입력에 포함된 화자의 음성 특징을 추출하기 위한 화자 추출 임베딩 특징을 출력할 수 있다. 스플라이싱 레이어(220)는 화자 추출 임베딩 특징 및 사용자의 중간 임베딩 음성 특징을 수신하여, 스플라이싱 특징을 출력할 수 있다. 사용자의 중간 임베딩 음성 특징은 기저장된 사용자의 특징일 수 있고, 음성 입력으로부터 타겟 화자에 관한 정보를 추출하기 위한 것일 수 있다. 제2 컨볼루션 레이어(230)는 스플라이싱 특징을 수신하여, 마스크를 출력할 수 있다. 제2 컨볼루션 레이어(230)는 완전 컨볼루션 레이어(fully convolutional layer)일 수 있고, 복수의 1D-컨볼루션 레이어를 포함할 수 있다. 선택적으로, 스플라이스 특징을 정규화 처리하기 위해 스플라이싱 레이어(220)와 제2 컨볼루션 레이어(230) 사이에 정규화 레이어가 추가될 수 있다. 곱셈기(240)는 마스크 및 화자 추출 임베딩 특징을 수신하여, 타겟 화자의 음성 특징을 출력할 수 있다. 타겟 화자는 음성 입력에서 인식되어, 검증될 화자일 수 있고, 타겟 화자의 음성 특징은 프레임(예: 시간의 흐름에 따라 생성된 프레임) 별로 구성된 음성 특징일 수 있다.According to an embodiment, the first convolution layer 210 may receive original speech features and output speaker extraction embedding features for extracting speech features of a speaker included in the speech input. The splicing layer 220 may receive the speaker extraction embedding feature and the user's middle embedding voice feature, and output the splicing feature. The middle embedding voice feature of the user may be a pre-stored user feature, and may be for extracting information about a target speaker from a voice input. The second convolution layer 230 may receive a splicing feature and output a mask. The second convolution layer 230 may be a fully convolutional layer and may include a plurality of 1D-convolutional layers. Optionally, a normalization layer may be added between the splicing layer 220 and the second convolution layer 230 to normalize splice features. The multiplier 240 may receive the mask and speaker extraction embedding features and output voice features of the target speaker. The target speaker may be a speaker to be recognized and verified in the voice input, and the voice feature of the target speaker may be a voice feature configured frame by frame (eg, frames generated over time).

도 3은 일 실시예에 따른, 뉴럴 네트워크 기반 화자 인식 모델의 일 예이다.3 is an example of a speaker recognition model based on a neural network, according to an embodiment.

도 3을 참조하면, 일 실시예에 따르면, 프로세서(예: 도 1의 프로세서(110))는 제2 네트워크(300)에 기초하여 원본 음성 특징 또는 타겟 화자의 음성 특징에 기초하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 제2 네트워크(300)는 화자 임베딩 레이어(310)(speaker embedding lyaer) 및 어텐션 통계 풀링 레이어(320)(attentive statistics pooling layer)를 포함할 수 있다.Referring to FIG. 3 , according to an embodiment, a processor (eg, the processor 110 of FIG. 1 ) determines the target speaker's final voice characteristics based on the original voice characteristics or the target speaker's voice characteristics based on the second network 300 . Voice characteristics can be obtained. The second network 300 may include a speaker embedding layer 310 and an attention statistics pooling layer 320.

일 실시예에 따르면, 화자 임베딩 레이어(310)는 원본 음성 특징 또는 타겟 화자의 음성 특징을 수신하여 타겟 화자의 중간 임베딩 음성 특징을 출력할 수 있다. 화자 임베딩 레이어(310)는 SE 블록(Sequeze and Excitation Block), ResNet 블록, 또는 TDNN(Time delay neural network)을 사용하여 구성될 수 있다. 다만 상기 예시는 예시일 뿐, 본 발명이 이에 한정되는 것은 아니다. 화자 임베딩 레이어(310)는 일반적으로 사용되는 다른 신경망 또는 이들의 조합일 수 있다.According to an embodiment, the speaker embedding layer 310 may receive an original voice feature or a voice feature of a target speaker and output an intermediate embedded voice feature of the target speaker. The speaker embedding layer 310 may be constructed using a Sequeze and Excitation Block (SE) block, a ResNet block, or a Time Delay Neural Network (TDNN). However, the above examples are only examples, and the present invention is not limited thereto. The speaker embedding layer 310 may be another commonly used neural network or a combination thereof.

어텐션 통계 풀링 레이어(320)는 타겟 화자의 중간 임베딩 음성 특징을 수신하여 타겟 화자의 최종 음성 특징을 출력할 수 있다. 최종 음성 특징은 음성의 길이와 상관없이 고정된 크기를 가질 수 있다. 예를 들어, 최종 음성 특징은 1차원 벡터일 수 있다.The attention statistics pooling layer 320 may receive the middle embedded speech feature of the target speaker and output the final speech feature of the target speaker. The final speech feature may have a fixed size regardless of the length of the speech. For example, the final speech feature may be a one-dimensional vector.

일 실시예에 따르면, 프로세서(110)는 음성 입력의 발화 시나리오가 단일 화자 시나리오인지 다중 화자 시나리오인지 판단하여, 제2 네트워크(300)에 적합한 입력을 선택할 수 있다. 프로세서(110)는 다중 화자 시나리오에서의 성문 검증 성능을 보장할 수 있고, 단일 화자 시나리오에서의 성문 검증 성능 또한 보장할 수 있다. 프로세서(110)눈 사용자에게 고성능의 성문 검증 서비스를 제공할 수 있다.According to an embodiment, the processor 110 may select an input suitable for the second network 300 by determining whether a speech input scenario is a single speaker scenario or a multi-speaker scenario. The processor 110 can guarantee voiceprint verification performance in a multi-speaker scenario, and can also guarantee voiceprint verification performance in a single speaker scenario. The processor 110 may provide a high-performance voiceprint verification service to the eye user.

일 실시예에 따르면, 프로세서(110)는 제2 네트워크(300)에 기초하여 사용자의 중간 임베딩 음성 특징 및 사용자의 최종 음성 특징을 획득할 수 있다. 사용자의 중간 임베딩 음성 특징 및 사용자의 최종 음성 특징은 성문 등록 단계에서 획득된 것일 수 있다. 사용자의 중간 임베딩 음성 특징은 사용자의 음성 특징을 화자 임베딩 레이어(310)에 입력한 결과로써 획득된 것일 수 있다. 사용자의 중간 임베딩 음성 특징은 타겟 화자의 음성 특징을 추출할 때 사용되는 것일 수 있다. 사용자의 최종 음성 특징은 사용자의 중간 임베딩 음성 특징을 어탠션 통계 풀링 레이어(320)에 입력한 결과로써 획득된 것일 수 있다. 사용자의 최종 음성 특징은 음성 입력이 사용자의 음성을 포함하는지를 검증할 때 사용되는 것일 수 있다.According to an embodiment, the processor 110 may obtain an intermediate voice feature of the user and a final voice feature of the user based on the second network 300 . The user's intermediate voice feature and the user's final voice feature may be acquired in the voiceprint registration step. The middle embedding voice feature of the user may be obtained as a result of inputting the user's voice feature to the speaker embedding layer 310 . The middle embedded speech feature of the user may be used when extracting the speech feature of the target speaker. The user's final voice feature may be obtained as a result of inputting the user's middle embedded voice feature to the attention statistics pooling layer 320 . The final voice feature of the user may be the one used when verifying whether the voice input contains the user's voice.

도 4는 일 실시예에 따른, 음성 입력에 기초하여 타겟 화자를 식별하는 동작을 설명하기 위한 도면이다.4 is a diagram for describing an operation of identifying a target speaker based on a voice input, according to an exemplary embodiment.

도 4를 참조하면, 일 실시예에 따르면, 프로세서(예: 도 1의 프로세서(110))는 특징 추출 모듈(410)에 기초하여 음성 입력으로부터 원본 음성 특징을 획득할 수 있다. 음성 입력은 전자 장치(100)의 사용을 위해 입력되는 음성일 수 있다. 음성 입력은 다양한 발화 시나리오에서 생성된 것일 수 있고, 음성 입력은 단일 화자의 음성 또는 다중 화자의 음성을 포함할 수 있다.Referring to FIG. 4 , according to an embodiment, a processor (eg, the processor 110 of FIG. 1 ) may obtain original voice features from a voice input based on the feature extraction module 410 . The voice input may be a voice input for use of the electronic device 100 . The voice input may be generated in various speech scenarios, and may include a single speaker's voice or multiple speakers' voices.

일 실시예에 따르면, 프로세서(110)는 제1 네트워크(200)에 기초하여 원본 음성 특징 및 사용자의 중간 임베딩 음성 특징으로부터 타겟 화자의 음성 특징을 획득할 수 있다. 타겟 화자는 음성 입력에서 인식되어, 검증될 화자일 수 있다. 타겟 화자는 전자 장치(100)의 사용자와 동일한지 검증되는 대상에 해당할 수 있다. 타겟 화자는 검증의 대상이자, 등록된 사용자의 특징과 유사한 특징을 갖는 화자일 수 있다. 원본 음성 특징 및 타겟 화자의 음성 특징은 프레임(예: 시간의 흐름에 따라 생성된 프레임) 별로 구성된 음성 특징일 수 있다. 음성의 길이(예: 시간 축에서의 길이)가 긴 경우, 음성 특징에 대응되는 데이터의 크기도 클 수 있다. 예를 들어, 원본 음성 특징 및 타겟 화자의 음성 특징은 멜 스케일 주파수 셉스트럴 계수(MFCC, Mel-scale Frequency Cepstral Coefficients), 또는 필터뱅크(FilterBank) 특징일 수 있다. 다만 상기 예시는 예시일 뿐, 본 발명이 이에 한정되는 것은 아니다.According to an embodiment, the processor 110 may obtain voice features of the target speaker from the original voice features and the intermediate embedded voice features of the user based on the first network 200 . A target speaker may be a speaker to be recognized and verified in the voice input. The target speaker may correspond to a subject to be verified whether or not the same as the user of the electronic device 100 . The target speaker may be a speaker who is subject to verification and has characteristics similar to those of registered users. The original voice feature and the voice feature of the target speaker may be voice features configured frame by frame (eg, frames generated over time). When the length of the voice (eg, the length on the time axis) is long, the size of data corresponding to the voice feature may also be large. For example, the original voice feature and the voice feature of the target speaker may be Mel-scale Frequency Cepstral Coefficients (MFCC) or FilterBank features. However, the above examples are only examples, and the present invention is not limited thereto.

일 실시예에 따르면, 제1 네트워크(200)는 제1 컨볼루션 레이어(210), 스플라이싱 레이어(220), 제2 컨볼루션 레이어(230), 및 곱셈기(240)를 포함할 수 있다. 제1 컨볼루션 레이어(210)는 원본 음성 특징을 수신하여, 음성 입력에 포함된 화자의 음성 특징을 추출하기 위한 화자 추출 임베딩 특징을 출력할 수 있다. 스플라이싱 레이어(220)는 화자 추출 임베딩 특징 및 사용자의 중간 임베딩 음성 특징을 수신하여, 스플라이싱 특징을 출력할 수 있다. 사용자의 중간 임베딩 음성 특징은 기저장된 사용자의 특징일 수 있고, 음성 입력으로부터 타겟 화자에 관한 정보를 추출하기 위한 것일 수 있다. 제2 컨볼루션 레이어(230)는 스플라이싱 특징을 수신하여, 마스크를 출력할 수 있다. 곱셈기(240)는 마스크 및 화자 추출 임베딩 특징을 수신하여, 타겟 화자의 음성 특징을 출력할 수 있다.According to an embodiment, the first network 200 may include a first convolution layer 210 , a splicing layer 220 , a second convolution layer 230 , and a multiplier 240 . The first convolution layer 210 may receive original speech features and output speaker extraction embedding features for extracting speech features of a speaker included in the speech input. The splicing layer 220 may receive the speaker extraction embedding feature and the user's middle embedding voice feature, and output the splicing feature. The middle embedding voice feature of the user may be a pre-stored user feature, and may be for extracting information about a target speaker from a voice input. The second convolution layer 230 may receive a splicing feature and output a mask. The multiplier 240 may receive the mask and speaker extraction embedding features and output voice features of the target speaker.

일 실시예에 따르면, 프로세서(110)는 원본 음성 특징 및 타겟 화자의 음성 특징에 기초하여 음성 입력의 발화 시나리오를 결정할 수 있다. 음성 입력의 발화 시나리오는 단일 화자 시나리오 및/또는 다중 화자 시나리오를 포함할 수 있다. 프로세서(110)는 원본 음성 특징 및 타겟 화자의 음성 특징을 비교하여 음성 입력의 발화 시나리오를 결정할 수 있다. 프로세서(110)는 원본 음성 특징과 타겟 화자의 음성 특징의 평균 제곱 오차(mean square error)가 임계값보다 작은 경우 발화 시나리오를 단일 화자 시나리오로 결정할 수 있다. 프로세서(110)는 원본 음성 특징과 타겟 화자의 음성 특징의 평균 제곱 오차가 임계값 이상인 경우 발화 시나리오를 다중 화자 시나리오로 결정할 수 있다. 프로세서(110)는 타겟 화자의 음성 특징과 원본 음성 특징 간의 차이를 기반으로 입력 음성의 발화 시나리오를 결정할 수 있으므로, 음성 입력의 발화 시나리오를 용이하게 결정할 수 있다.According to an embodiment, the processor 110 may determine a speech scenario of the voice input based on original voice characteristics and voice characteristics of the target speaker. Speech scenarios of voice input may include a single speaker scenario and/or a multi-speaker scenario. The processor 110 may determine an utterance scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker. The processor 110 may determine an utterance scenario as a single speaker scenario when a mean square error between the original voice feature and the target speaker's voice feature is smaller than a threshold value. The processor 110 may determine the speech scenario as a multi-speaker scenario when the mean square error between the original speech feature and the target speaker's speech feature is greater than or equal to a threshold value. Since the processor 110 may determine a speech scenario of the input voice based on the difference between the voice characteristics of the target speaker and the original voice characteristics, the speech scenario of the voice input may be easily determined.

일 실시예에 따르면, 프로세서(110)는 결정된 발화 시나리오에 기초하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 프로세서(110)는 단일 화자 시나리오에 대응하여, 제2 네트워크(300)에 원본 음성 특징을 입력하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 프로세서(110)는 다중 화자 시나리오에 대응하여, 제2 네트워크(300)에 타겟 화자의 음성 특징을 입력하여 타겟 화자의 최종 음성 특징을 획득할 수 있다. 최종 음성 특징은 음성의 길이와 상관없이 고정된 크기를 가질 수 있다. 최종 음성 특징은 1차원 벡터일 수 있다. 예를 들어, 최종 음성 특징의 크기는 1x128, 1x256, 또는 1x512로 설정될 수 있다.According to an embodiment, the processor 110 may obtain final speech characteristics of the target speaker based on the determined speech scenario. In response to a single speaker scenario, the processor 110 may acquire final voice features of a target speaker by inputting original voice features to the second network 300 . In response to a multi-speaker scenario, the processor 110 may obtain final voice characteristics of the target speaker by inputting voice characteristics of the target speaker to the second network 300 . The final speech feature may have a fixed size regardless of the length of the speech. The final speech feature may be a one-dimensional vector. For example, the size of the final voice feature may be set to 1x128, 1x256, or 1x512.

일 실시예에 따르면, 제2 네트워크(300)는 화자 임베딩 레이어(310) 및 어텐션 통계 풀링 레이어(320)를 포함할 수 있다. 화자 임베딩 레이어(310)는 원본 음성 특징 또는 타겟 화자의 음성 특징을 수신하여 타겟 화자의 중간 임베딩 음성 특징을 출력할 수 있다. 어텐션 통계 풀링 레이어(320)는 타겟 화자의 중간 임베딩 음성 특징을 수신하여 타겟 화자의 최종 음성 특징을 출력할 수 있다. 최종 음성 특징은 음성의 길이와 상관없이 고정된 크기를 가질 수 있다.According to an embodiment, the second network 300 may include a speaker embedding layer 310 and an attention statistics pooling layer 320 . The speaker embedding layer 310 may receive original voice features or voice features of the target speaker and output intermediate embedded voice features of the target speaker. The attention statistics pooling layer 320 may receive the middle embedded speech feature of the target speaker and output the final speech feature of the target speaker. The final speech feature may have a fixed size regardless of the length of the speech.

일 실시예에 따르면, 프로세서(110)는 타겟 화자의 최종 음성 특징 및 사용자의 최종 음성 특징에 기초하여 타겟 화자가 사용자에 대응되는지를 확인할 수 있다. 프로세서(110)는 타겟 화자의 최종 음성 특징과 사용자의 최종 음성 특징의 유사도 값을 계산하고, 계산된 유사도 값에 기초하여 타겟 화자가 상기 사용자에 대응되는지를 확인할 수 있다. 예를 들어, 프로세서(110)는 유사도 값(예: 코사인 유사도 값)이 임계값보다 큰 경우 타겟 화자가 사용자와 대응된다고 판단할 수 있다. 다른 예를 들어, 유사도 값이 임계값 이하인 경우 타겟 화자가 사용자와 대응되지 않는다고 판단할 수 있다. 프로세서(110)는 종단 간 모델을 통해 복잡한 음성을 정확하게 인식하여 검증하는 기술을 제공할 수 있다. 종단 간 모델은 전술한 제1 네트워크(200) 및 제2 네트워크(300)를 포함할 수 있다. 제1 네트워크(200) 및 제2 네트워크(300)는, 화자의 중간 임베딩 음성 특징에 기초하여 화자의 음성 특징을 변환하는 제3 네트워크와 공동으로 학습된 것일 수 있다. 이하에서는 네트워크들의 학습 동작을 설명하도록 한다.According to an embodiment, the processor 110 may determine whether the target speaker corresponds to the user based on the final voice characteristics of the target speaker and the final voice characteristics of the user. The processor 110 may calculate a similarity value between the final voice feature of the target speaker and the final voice feature of the user, and determine whether the target speaker corresponds to the user based on the calculated similarity value. For example, the processor 110 may determine that the target speaker corresponds to the user when a similarity value (eg, a cosine similarity value) is greater than a threshold value. For another example, when the similarity value is less than or equal to the threshold value, it may be determined that the target speaker does not correspond to the user. The processor 110 may provide a technique for accurately recognizing and verifying complex speech through an end-to-end model. The end-to-end model may include the first network 200 and the second network 300 described above. The first network 200 and the second network 300 may be jointly learned with a third network that converts the speaker's voice feature based on the speaker's intermediate embedded voice feature. Hereinafter, the learning operation of the networks will be described.

도 5는 일 실시예에 따른, 뉴럴 네트워크 기반 모델들의 학습을 설명하기 위한 일 예이고, 도 6은 일 실시예에 따른, 뉴럴 네트워크 기반 모델들의 학습을 설명하기 위한 다른 예이다.5 is an example for explaining learning of neural network-based models according to an embodiment, and FIG. 6 is another example for explaining learning of neural network-based models according to an embodiment.

도 5를 참조하면, 제1 네트워크(510), 제2 네트워크(520), 및 제3 네트워크(530)는 공동으로 학습된 것일 수 있다. 제1 네트워크(510) 및 제2 네트워크(520)는 지도 학습에 의해 훈련되고, 제3 네트워크(530)는 자기 지도 학습에 의해 훈련된 것일 수 있다. 제1 네트워크(510)는 화자의 음성을 추출하기 위한 모델일 수 있다. 제2 네트워크(520)는 화자 인식을 수행하기 위한 모델일 수 있다. 제3 네트워크는 화자의 음성을 변환하기 위한 모델일 수 있다. 훈련 과정에서. 제1 네트워크(510)는 화자 분리 손실 함수를 이용할 수 있고, 제2 네트워크(520)는 화자 인식 손실 함수를 이용할 수 있고, 제3 네트워크(530)는 음성 변환 손실 함수를 이용할 수 있다. 제1 네트워크(510), 제2 네트워크(520), 및 제3 네트워크(530)는 제1 네트워크(510)의 화자 분리 손실, 제2 네트워크(520)의 화자 인식 손실, 및 제3 네트워크(530)의 음성 변환 손실이 최소화되도록 훈련될 수 있다. 화자 분리 손실, 화자 인식 손실, 및 음성 변환 손실을 고려하여 종단간 딥 러닝 네트워크가 업데이트되므로, 훈련된 제1 네트워크(예: 도 2의 제1 네트워크(200)) 및 훈련된 제2 네트워크(예: 도 3의 제2 네트워크(300))는 높은 인식 성능 및 일반화 성능을 가질 수 있다. 제3 네트워크(530)는 제1 네트워크(510) 및 제2 네트워크(520)의 성능을 향상시키기 위한 훈련 과정에서 사용되지만, 제3 네트워크(530)는 인식 과정에서 사용되지 않는다는 점을 이해해야 한다.Referring to FIG. 5 , the first network 510 , the second network 520 , and the third network 530 may be jointly learned. The first network 510 and the second network 520 may be trained by supervised learning, and the third network 530 may be trained by self-supervised learning. The first network 510 may be a model for extracting a speaker's voice. The second network 520 may be a model for performing speaker recognition. The third network may be a model for converting a speaker's voice. in the training process. The first network 510 may use a speaker separation loss function, the second network 520 may use a speaker recognition loss function, and the third network 530 may use a speech conversion loss function. The first network 510 , the second network 520 , and the third network 530 include speaker separation loss of the first network 510 , speaker recognition loss of the second network 520 , and third network 530 ) can be trained to minimize speech conversion loss. Since the end-to-end deep learning network is updated in consideration of speaker separation loss, speaker recognition loss, and speech conversion loss, the trained first network (eg, first network 200 in FIG. 2) and the trained second network (eg, : The second network 300 of FIG. 3 may have high recognition performance and generalization performance. It should be understood that the third network 530 is used in a training process to improve the performance of the first network 510 and the second network 520, but the third network 530 is not used in a recognition process.

프로세서(예: 도 1의 프로세서(110))는 원본 데이터 세트 및 추가 데이터 세트를 이용하여 제1 네트워크(510), 제2 네트워크(520), 및 제3 네트워크(530)를 공동으로 학습시킬 수 있다. 원본 데이터 세트는 복수의 음성 및 복수의 음성에 각각 대응되는 화자 태그를 포함할 수 있다. 복수의 음성 각각은 단일 화자의 음성을 포함할 수 있다.A processor (eg, processor 110 of FIG. 1 ) may jointly train the first network 510 , the second network 520 , and the third network 530 using the original data set and the additional data set. there is. The original data set may include a plurality of voices and speaker tags respectively corresponding to the plurality of voices. Each of the plurality of voices may include a single speaker's voice.

프로세서(110)는 원본 데이터 세트에 기초하여 추가 데이터 세트를 생성할 수 있다. 추가 데이터 세트는 원본 데이터 세트에 포함된 복수의 음성 중에서 적어도 2개가 결합된 것일 수 있다. 예를 들어, 프로세서(110)는 원본 데이터 세트에서 제1 화자의 음성 C와 제2 화자의 음성 D를 선택하고, 두 음성을 결합하여 음성 M을 획득할 수 있다. 예를 들어, 음성 M은 음성 C와 음성 D를 직접 융합한 것일 수 있으며, 음성 C는 음성 M의 태그로 사용될 수 있다. 새롭게 생성된 음성에 대응되는 정보를 태그하는 동작은 자동으로 수행되는 것일 수 있다. 즉, 추가 데이터 세트는 자가 지도 학습 과정에서 자동으로 생성되는 것일 수 있다. 따라서, 프로세서(110)는 데이터의 양과 어노테이션(Annotation)을 늘리지 않고도 네트워크를 학습시킬 수 있다.Processor 110 may generate additional data sets based on the original data set. The additional data set may be a combination of at least two of a plurality of voices included in the original data set. For example, the processor 110 may select a first speaker's voice C and a second speaker's voice D from the original data set, and obtain voice M by combining the two voices. For example, voice M may be a direct fusion of voice C and voice D, and voice C may be used as a tag for voice M. An operation of tagging information corresponding to the newly generated voice may be automatically performed. That is, the additional data set may be automatically generated in the self-supervised learning process. Accordingly, the processor 110 can train the network without increasing the amount of data and annotations.

선택적으로, 프로세서(110)는 생성된 추가 데이터 세트에 대해 전처리(예: 랜덤 클리핑(random clipping), 노이즈 추가, 잔향(reverberation) 추가, 볼륨 향상 등) 동작을 수행할 수 있다.Optionally, the processor 110 may perform preprocessing (eg, random clipping, noise addition, reverberation addition, volume enhancement, etc.) on the generated additional data set.

프로세서(110)는 특징 추출 모듈(501)을 통해 음성 특징 추출을 수행할 수 있다. 예를 들어, 프로세서(110)는 특징 추출 모듈(501)을 통해 원본 데이터 세트의 원본 음성 특징을 추출할 수 있고, 특징 추출 모듈(501)을 통해 추가 데이터 세트의 원본 음성 특징을 추출할 수 있다.The processor 110 may perform voice feature extraction through the feature extraction module 501 . For example, the processor 110 may extract original speech features of the original data set through the feature extraction module 501 and extract original speech features of the additional data set through the feature extraction module 501. .

프로세서(110)는 제2 네트워크(520)를 통해 화자 인식 작업을 수행할 수 있다. 제2 네트워크(520)의 입력은 화자의 음성(예: 원본 데이터 세트에 포함된 음성)에서 추출된 원본 음성 특징일 수 있다. 화자 인식 손실 함수는 메트릭 학습 손실(metric learning loss)에 기초한 것일 수 있다. 프로세서(110)는 클래스 내 간격을 조정함으로써 화자 인식 손실을 최소화하도록 제2 네트워크(520)를 업데이트할 수 있다.The processor 110 may perform speaker recognition through the second network 520 . An input of the second network 520 may be an original voice feature extracted from a speaker's voice (eg, a voice included in the original data set). The speaker recognition loss function may be based on a metric learning loss. The processor 110 may update the second network 520 to minimize speaker recognition loss by adjusting the intra-class spacing.

프로세서(110)는 제1 네트워크(510)를 통해 화자 음성 추출 작업을 수행할 수 있다. 제1 네트워크(510)의 입력은 다중 화자의 음성(예: 추가 데이터 세트에 포함된 음성)에서 추출된 원본 음성 특징 및 제2 네트워크(520)가 출력한 화자의 중간 임베딩 음성 특징일 수 있다. 예를 들어, 원본 음성 특징은 제1 화자 및 제2 화자의 결합된 음성에서 추출된 것일 수 있고, 화자의 중간 임베딩 음성 특징은 제1 화자의 음성에 기반한 벡터일 수 있다. 이 때, 타겟 화자는 제1 화자일 수 있다. 프로세서(110)는 역전파를 수행함으로써, 화자 분리 손실을 최소화하도록 제2 네트워크(520)를 업데이트할 수 있다.The processor 110 may perform a speaker voice extraction task through the first network 510 . Inputs of the first network 510 may include original voice features extracted from multiple speakers' voices (eg, voices included in the additional data set) and intermediate embedded voice features of speakers output by the second network 520 . For example, the original speech feature may be extracted from the combined speech of the first speaker and the second speaker, and the intermediate embedded speech feature of the speaker may be a vector based on the first speaker's speech. In this case, the target speaker may be the first speaker. The processor 110 may update the second network 520 to minimize speaker separation loss by performing backpropagation.

프로세서(110)는 제3 네트워크(530)를 통해 음성 변환 작업을 수행할 수 있다. 제3 네트워크는 음성을 변환하기 위한 적응형 인스턴스 정규화(AdaIN, Adaptive Instance Normalization) 네트워크를 포함할 수 있다. 적응형 인스턴스 정규화 네트워크는 두 개의 입력을 수신할 수 있다. 입력 중 하나는 변환의 대상인 음성 특징이고, 입력 중 다른 하나는 변환의 기준이 되는 음성 특징일 수 있다. 적응형 인스턴스 정규화 네트워크는 수학식 1을 통해 표현될 수 있다.The processor 110 may perform voice conversion through the third network 530 . The third network may include an Adaptive Instance Normalization (AdaIN) network for converting voice. An adaptive instance normalization network can receive two inputs. One of the inputs may be a voice feature to be converted, and the other of the inputs may be a voice feature to be converted. The adaptive instance normalization network can be expressed through Equation 1.

[수학식 1][Equation 1]

수학식 1에서,

,

는 각각 변환의 대상인 음성 특징의 평균과 표준편차이고,

,

는 각각 변환의 기준이 되는 음성 특징의 평균과 표준편차일 수 있다. 예를 들어, 제3 네트워크(530)는 제1 화자의 음성에 기반한 벡터(예: 제1 화자의 중간 임베딩 음성 특징) 및 제1 화자의 음성 특징에 기초하여, 변환된 제1 화자의 음성 특징을 출력할 수 있다. 제1 화자의 음성에 기반한 벡터는 변환의 기준이 되는 음성 특징일 수 있고, 제1 화자의 음성 특징은 변환의 대상인 음성 특징일 수 있다. 프로세서(110)는 역전파를 수행함으로써, 음성 변환 손실을 최소화하도록 제3 네트워크(530)를 업데이트할 수 있다. 제3 네트워크(530)는 제2 네트워크(520)와 공동으로 업데이트될 수 있다.In Equation 1,

,

are the average and standard deviation of the speech features that are the target of transformation, respectively,

,

may be the average and standard deviation of speech features that are standards for conversion, respectively. For example, the third network 530 converts the voice feature of the first speaker based on the vector based on the voice of the first speaker (eg, the intermediate embedded voice feature of the first speaker) and the voice feature of the first speaker. can output The vector based on the voice of the first speaker may be a voice feature that is a standard for conversion, and the voice feature of the first speaker may be a voice feature that is a target for conversion. The processor 110 may update the third network 530 to minimize voice conversion loss by performing back propagation. The third network 530 may be jointly updated with the second network 520 .

도 5를 참조하여 설명된 각 네트워크는 도 6의 각 네트워크와 실질적으로 유사할 수 있다. 즉, 도 5를 참조하여 서술된 각 네트워크에 대한 설명은 도 6의 각 네트워크에도 적용될 수 있다. 따라서 중복을 피하기 위해 중복되는 설명은 생략한다.Each network described with reference to FIG. 5 may be substantially similar to each network of FIG. 6 . That is, the description of each network described with reference to FIG. 5 may also be applied to each network of FIG. 6 . Therefore, redundant descriptions are omitted to avoid redundancy.

도 6을 참조하면, 도 6의 네트워크 구조는 병렬 형태의 네트워크 구조일 수 있다. 도 5의 네트워크 구조는 직렬 형태의 네트워크 구조일 수 있다. 도 5 및 도 6을 통해 네트워크 구조의 예시적 구성(즉, 직렬 및 병렬)이 개시되었으나, 본 발명은 이에 제한되지 않으며, 각 네트워크는 다른 방식으로 연결 및/또는 구성될 수도 있다.Referring to FIG. 6 , the network structure of FIG. 6 may be a parallel network structure. The network structure of FIG. 5 may be a serial network structure. Exemplary configurations of network structures (ie, series and parallel) are disclosed through FIGS. 5 and 6 , but the present invention is not limited thereto, and each network may be connected and/or configured in different ways.

도 7은 일 실시예에 따른, 타겟 화자를 식별하는 동작의 흐름도이다.7 is a flow diagram of an operation to identify a target speaker, according to one embodiment.

동작 710 내지 동작 770은 순차적으로 수행될 수도 있으나, 반드시 순차적으로 수행되는 것은 아니다. 예를 들어, 각 동작(710~770)의 순서가 변경될 수도 있으며, 적어도 두 동작들이 병렬적으로 수행될 수도 있다. 동작(710~770)은 전술한 프로세서(예: 도 1의 프로세서(110))에 의해 수행될 수 있다. 따라서 중복되는 내용은 생략하도록 한다.Operations 710 to 770 may be sequentially performed, but are not necessarily sequentially performed. For example, the order of each operation 710 to 770 may be changed, or at least two operations may be performed in parallel. Operations 710 to 770 may be performed by the aforementioned processor (eg, processor 110 of FIG. 1 ). Therefore, duplicate contents are omitted.

동작 710에서, 프로세서(110)는 음성 입력에 기초하여 타겟 화자의 음성 특징을 추출할 수 있다.In operation 710, the processor 110 may extract voice features of the target speaker based on the voice input.

동작 730에서, 프로세서(110)는 타겟 화자의 음성 특징에 기초하여 음성 입력의 발화 시나리오를 결정할 수 있다.In operation 730, the processor 110 may determine an utterance scenario of the voice input based on the voice characteristics of the target speaker.

동작 750에서, 프로세서(110)는 결정된 발화 시나리오에 기초하여 상기 타겟 화자의 최종 음성 특징을 획득할 수 있다.In operation 750, the processor 110 may obtain final speech characteristics of the target speaker based on the determined speech scenario.

동작 770에서, 프로세서(110)는 타겟 화자의 최종 음성 특징 및 사용자의 최종 음성 특징에 기초하여 타겟 화자가 사용자에 대응되는지를 확인할 수 있다.In operation 770, the processor 110 may determine whether the target speaker corresponds to the user based on the final voice characteristics of the target speaker and the final voice characteristics of the user.

본 개시는 특정 예시를 포함하지만, 청구범위 및 그 균등물의 사상 및 범위를 벗어나지 않는 선에서, 이러한 예시에서 형태 및 세부사항의 다양한 변경이 이루어질 수 있음은 본 기술분야의 통상의 지식을 가진 자에게 있어 명백할 것이다. 본문에 설명된 예시는 설명을 위한 것으로, 제한적인 목적이 아닌 것으로 간주되어야 한다. 각 예시의 특징 또는 측면에 대한 설명은 다른 예시의 유사한 특징 또는 측면에 적용 가능한 것으로 간주되어야 한다. 설명된 기술이 다른 순서로 수행되는 경우, 및/또는 설명된 시스템, 아키텍처, 장치 또는 회로의 요소가 다른 방식으로 결합되고/되거나 다른 요소 또는 그 등가물로 대체되거나 보완되는 경우, 적절한 결과를 달성할 수 있다. 따라서, 본 발명의 범위는 특정 실시예에 의해 제한되는 것이 아니라, 특허청구범위 및 그 균등물에 의해 제한되며, 특허청구범위 및 그 균등물의 범위 내의 모든 변경은 본 발명에 포함된 것으로 해석되어야 한다.Although this disclosure includes specific examples, it is to those skilled in the art that various changes in form and detail can be made in these examples without departing from the spirit and scope of the claims and equivalents thereof. It will be clear. The examples described in the text are to be regarded for illustrative purposes only and not for limiting purposes. A description of a feature or aspect of each example is to be construed as applicable to similar features or aspects of other examples. Appropriate results may not be achieved if the described techniques are performed in a different order and/or if elements of the described system, architecture, device or circuit are otherwise combined and/or substituted or supplemented by other elements or their equivalents. can Therefore, the scope of the present invention is not limited by the specific examples, but is limited by the claims and their equivalents, and all changes within the scope of the claims and their equivalents should be construed as being included in the present invention. .

Claims

In the operating method of the electronic device,
extracting speech characteristics of a target speaker based on a speech input;
determining a speech scenario of the voice input based on the voice characteristics of the target speaker;
obtaining a final speech feature of the target speaker based on the determined speech scenario; and
An operation of determining whether the target speaker corresponds to the user based on the final voice characteristics of the target speaker and the final voice characteristics of the user.
including,
The ignition scenario is
Including single speaker scenarios and multi-speaker scenarios,
Methods of operating electronic devices.

According to claim 1,
The extraction operation is
obtaining an original speech feature based on the speech input; and
An operation of extracting voice features of the target speaker by inputting the original voice features and the intermediate embedded voice features of the user to a first network.
Including, the operating method of the electronic device.

According to claim 2,
The first network,
a first convolution layer that receives the original speech features and outputs speaker extraction embedding features for extracting speech features of a speaker included in the speech input;
a splicing layer that receives the speaker extraction embedding feature and the user's middle embedding voice feature and outputs a splicing feature;
a second convolution layer that receives the splicing feature and outputs a mask; and
A multiplier receiving the mask and the speaker extraction embedding features and outputting speech features of the target speaker
Including, the operating method of the electronic device.

According to claim 3,
The operation of determining the speech scenario of the voice input,
An operation of determining a speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker
Including, the operating method of the electronic device.

According to claim 4,
The operation of determining a speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker,
determining the speech scenario as a single speaker scenario when a mean square error between the original speech feature and the target speaker's speech feature is less than a threshold value;
determining the speech scenario as a multi-speaker scenario when a mean square error between the original voice feature and the target speaker's voice feature is greater than or equal to a threshold value;
Including, the operating method of the electronic device.

According to claim 5,
The operation of obtaining the final speech feature of the target speaker,
corresponding to the single speaker scenario, acquiring final voice features of the target speaker by inputting the original voice features to a second network; or
In response to the multi-speaker scenario, obtaining a final voice feature of the target speaker by inputting the voice feature of the target speaker to the second network.
Including, the operating method of the electronic device.

According to claim 6,
The second network,
a speaker embedding layer which receives the original speech feature or the speech feature of the target speaker and outputs an intermediate embedded speech feature of the target speaker; and
An attention statistics pooling layer that receives the middle embedded speech feature of the target speaker and outputs the final speech feature of the target speaker.
Including, the operating method of the electronic device.

According to claim 7,
The operation of checking whether the target speaker corresponds to the user,
calculating a similarity value between the final voice feature of the target speaker and the final voice feature of the user; and
An operation of determining whether the target speaker corresponds to the user based on a calculation result
Including, the operating method of the electronic device.

According to claim 8,
The user's middle embedding speech characteristics are:
Acquired as a result of inputting the user's voice characteristics obtained based on the user's voice input to the speaker embedding layer,
The final voice characteristics of the user are:
Obtained as a result of inputting the user's intermediate embedding speech features to the attention statistics pooling layer,
Methods of operating electronic devices.

According to claim 9,
The first network and the second network
Learned jointly with a third network that transforms the speaker's speech features based on the speaker's intermediate embedded speech features;
Methods of operating electronic devices.

In electronic devices,
memory containing instructions; and
a processor electrically connected to the memory and configured to execute the instructions;
When the instructions are executed by the processor, the processor:
Extracting voice features of a target speaker based on the voice input;
determining an utterance scenario of the voice input based on the voice characteristics of the target speaker;
Obtain final speech characteristics of the target speaker based on the determined speech scenario;
determine whether the target speaker corresponds to the user based on the final voice characteristics of the target speaker and the final voice characteristics of the user;
The ignition scenario is
Including single speaker scenarios and multi-speaker scenarios,
electronic device.

According to claim 11,
the processor,
Obtain original speech features based on the speech input;
extracting voice features of the target speaker by inputting the original voice features and the intermediate embedded voice features of the user to a first network;
electronic device.

According to claim 12,
The first network,
a first convolution layer that receives the original speech features and outputs speaker extraction embedding features for extracting speech features of a speaker included in the speech input;
a splicing layer that receives the speaker extraction embedding feature and the user's middle embedding voice feature and outputs a splicing feature;
a second convolution layer that receives the splicing feature and outputs a mask; and
A multiplier receiving the mask and the speaker extraction embedding features and outputting speech features of the target speaker
Including, electronic device.

According to claim 13,
the processor,
determining a speech scenario of the voice input by comparing the original voice characteristics and the voice characteristics of the target speaker;
electronic device.

According to claim 14,
the processor,
determining the speech scenario as a single speaker scenario when a mean square error between the original speech feature and the target speaker's speech feature is less than a threshold value;
determining the speech scenario as a multi-speaker scenario when a mean square error between the original speech feature and the speech feature of the target speaker is greater than or equal to a threshold value;
electronic device.

According to claim 15,
the processor,
Corresponding to the single speaker scenario, the original speech feature is input to a second network to obtain the final speech feature of the target speaker; or
Corresponding to the multi-speaker scenario, inputting voice characteristics of the target speaker to the second network to obtain final voice characteristics of the target speaker.
electronic device.

According to claim 16,
The second network,
a speaker embedding layer which receives the original speech feature or the speech feature of the target speaker and outputs an intermediate embedded speech feature of the target speaker; and
An attention statistics pooling layer that receives the middle embedded speech feature of the target speaker and outputs the final speech feature of the target speaker.
Including, electronic device.

According to claim 17,
the processor,
calculating a similarity value between a final voice feature of the target speaker and a final voice feature of the user;
determining whether the target speaker corresponds to the user based on a calculation result;
electronic device.

According to claim 18,
The user's middle embedding speech characteristics are:
Acquired as a result of inputting the user's voice characteristics obtained based on the user's voice input to the speaker embedding layer,
The final voice characteristics of the user are:
Obtained as a result of inputting the user's intermediate embedding speech features to the attention statistics pooling layer,
electronic device.

According to claim 19,
The first network and the second network
Learned jointly with a third network that transforms the speaker's speech features based on the speaker's intermediate embedded speech features;
electronic device.