KR20230166189A

KR20230166189A - Electronic appparatus for recommending voice preferred by user based on feature vector of speaker, and control method thereof

Info

Publication number: KR20230166189A
Application number: KR1020220065801A
Authority: KR
Inventors: 최자인; 박현아
Original assignee: 이어가다 주식회사
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2023-12-07

Abstract

전자 장치가 개시된다. 본 전자 장치는, 복수의 화자 각각의 음성과 매칭되는 특징 벡터에 대한 정보를 포함하는 메모리, 복수의 화자 중 제1 화자의 음성을 선택하는 사용자 입력이 획득되면, 제1 화자의 음성의 제1 특징 벡터를 식별하고, 식별된 제1 특징 벡터와 유사한 제2 특징 벡터에 매칭되는 제2 화자를 선택하는, 프로세서를 포함한다.An electronic device is disclosed. The electronic device includes a memory containing information about a feature vector matching the voice of each of the plurality of speakers, and when a user input for selecting the voice of the first speaker among the plurality of speakers is obtained, the first speaker's voice is and a processor that identifies the feature vector and selects a second speaker that matches a second feature vector that is similar to the identified first feature vector.

Description

Electronic device for recommending a user's preferred voice based on a feature vector for each speaker, and control method { ELECTRONIC APPPARATUS FOR RECOMMENDING VOICE PREFERRED BY USER BASED ON FEATURE VECTOR OF SPEAKER, AND CONTROL METHOD THEREOF }

본 개시는 음성 기반의 화자 인식과 관련된 전자 장치에 관한 것으로, 보다 상세하게는 화자 별로 저장된 특징 벡터를 바탕으로 사용자가 선호하는 음성을 제공하는 전자 장치에 관한 것이다.This disclosure relates to an electronic device related to voice-based speaker recognition, and more specifically, to an electronic device that provides a user's preferred voice based on feature vectors stored for each speaker.

보이스 폰트(Voice Font)는 화자(Speaker)의 목소리를 AI 기술을 기반으로 합성하여 텍스트를 획득하는 TTS(Text-to-Speech) 기술이다.Voice Font is a TTS (Text-to-Speech) technology that obtains text by synthesizing the speaker's voice based on AI technology.

목소리/음성은 글씨체나 지문과 같이 개개인의 고유한 특징에 해당하며, 개개인의 특성이 담긴 오디오 신호가 생성되어 제공되는 경우, 개개인의 개성, 감정, 휴머니티가 담긴 사운드가 완성될 수 있다.Voice/voice corresponds to an individual's unique characteristics, such as handwriting or fingerprints, and when audio signals containing individual characteristics are generated and provided, a sound containing the individual's personality, emotions, and humanity can be completed.

등록특허공보 제10-1040585호(ＴＴＳ 서버를 이용한 웹 리더 시스템 및 그 방법)Registered Patent Publication No. 10-1040585 (Web reader system and method using TTS server)

본 개시는 사용자가 선호하는 음성과 특징 벡터가 유사한 음성으로 구성된 오디오 콘텐츠를 추천하는 전자 장치 및 제어 방법을 제공한다.The present disclosure provides an electronic device and control method for recommending audio content composed of voices with similar feature vectors to a user's preferred voice.

본 개시의 목적들은 이상에서 언급한 목적으로 제한되지 않으며, 언급되지 않은 본 개시의 다른 목적 및 장점들은 하기의 설명에 의해서 이해될 수 있고, 본 개시의 실시 예에 의해 보다 분명하게 이해될 것이다. 또한, 본 개시의 목적 및 장점들은 특허 청구 범위에 나타낸 수단 및 그 조합에 의해 실현될 수 있음을 쉽게 알 수 있을 것이다.The objects of the present disclosure are not limited to the purposes mentioned above, and other objects and advantages of the present disclosure that are not mentioned can be understood by the following description and will be more clearly understood by the examples of the present disclosure. Additionally, it will be readily apparent that the objects and advantages of the present disclosure can be realized by the means and combinations thereof indicated in the patent claims.

본 개시의 일 실시 예에 따른 전자 장치는, 복수의 화자 각각의 음성과 매칭되는 특징 벡터에 대한 정보를 포함하는 메모리, 적어도 하나의 사용자 입력에 따라 상기 복수의 화자 중 제1 화자의 음성을 선택하고, 상기 선택된 제1 화자의 음성의 제1 특징 벡터를 식별하고, 상기 식별된 제1 특징 벡터와 유사한 제2 특징 벡터에 매칭되는 제2 화자를 선택하는, 프로세서를 포함한다.An electronic device according to an embodiment of the present disclosure includes a memory including information about a feature vector matching the voice of each of a plurality of speakers, and selecting the voice of a first speaker among the plurality of speakers according to at least one user input. and a processor that identifies a first feature vector of the voice of the selected first speaker and selects a second speaker that matches a second feature vector similar to the identified first feature vector.

상기 메모리는, 상기 복수의 화자 각각에 대하여, 화자의 음성이 포함된 오디오 콘텐츠를 포함할 수 있다. 상기 프로세서는, 상기 오디오 콘텐츠를 구성하는 오디오 신호를 분석하여 상기 특징 벡터를 획득할 수 있다.The memory may include audio content including the speaker's voice for each of the plurality of speakers. The processor may acquire the feature vector by analyzing audio signals constituting the audio content.

이 경우, 상기 프로세서는, 상기 선택된 제2 화자의 음성이 포함된 오디오 콘텐츠를 추천할 수 있다.In this case, the processor may recommend audio content including the voice of the selected second speaker.

이때, 상기 프로세서는, 상기 선택된 제2 화자의 음성이 포함된 오디오 콘텐츠에 대한 추천이 수행된 이후 일정 기간 동안 상기 선택된 제2 화자의 음성이 포함된 오디오 콘텐츠가 제공된 시간이 일정 시간 미만인 경우, 상기 복수의 화자의 특징 벡터 중 상기 제1 특징 벡터에 대한 유사도가 임계치 이상이고 상기 제2 특징 벡터에 대한 유사도가 상기 임계치 미만인 적어도 하나의 제3 벡터를 선택하고, 상기 선택된 제3 특징 벡터에 매칭되는 제3 화자의 음성이 포함된 오디오 콘텐츠를 추천할 수 있다.At this time, if the audio content including the voice of the selected second speaker is provided for less than a certain period of time after the recommendation for the audio content including the voice of the selected second speaker is performed, the processor Selecting at least one third vector whose similarity to the first feature vector is more than a threshold and whose similarity to the second feature vector is less than the threshold among a plurality of speaker feature vectors, and matching the selected third feature vector Audio content containing the voice of a third speaker can be recommended.

한편, 상기 메모리는, 적어도 하나의 텍스트를 상기 복수의 화자 각각의 음성에 매칭되는 오디오 신호로 변환하기 위한 복수의 음성 생성 모델을 포함할 수 있다.Meanwhile, the memory may include a plurality of voice generation models for converting at least one text into an audio signal matching the voice of each of the plurality of speakers.

이 경우, 상기 프로세서는, 상기 메모리에 저장된 적어도 하나의 음성 생성 모델 중 상기 선택된 제2 화자에 매칭되는 음성 생성 모델을 식별하고, 상기 식별된 음성 생성 모델에 적어도 하나의 텍스트를 입력하여, 상기 텍스트에 대응되는 상기 제2 화자의 음성을 포함하는 오디오 신호를 획득하고, 상기 획득된 오디오 신호를 포함하는 오디오 콘텐츠를 추천할 수도 있다.In this case, the processor identifies a speech production model that matches the selected second speaker among at least one speech production model stored in the memory, inputs at least one text into the identified speech production model, and generates the text. An audio signal including the voice of the second speaker corresponding to can be obtained, and audio content including the obtained audio signal can be recommended.

상기 메모리는, 상기 복수의 화자가 상기 복수의 화자의 특징 벡터 간의 유사도를 바탕으로 구분된 복수의 그룹에 대한 정보를 포함할 수도 있다. 이 경우, 상기 프로세서는, 상기 복수의 그룹 중 상기 제1 화자가 속한 그룹을 식별하고, 상기 식별된 그룹에 포함된 상기 제2 화자를 선택할 수 있다.The memory may include information about a plurality of groups into which the plurality of speakers are divided based on similarity between feature vectors of the plurality of speakers. In this case, the processor may identify the group to which the first speaker belongs among the plurality of groups and select the second speaker included in the identified group.

상기 메모리는, 오디오 신호로부터 화자의 특징 벡터를 획득하기 위한 화자 식별 모델을 포함할 수도 있다. 이때, 상기 프로세서는, 임의의 화자의 음성이 포함된 오디오 신호가 획득되면, 상기 획득된 오디오 신호를 상기 화자 식별 모델에 입력하여 상기 임의의 화자의 특징 벡터를 획득하고, 상기 획득된 특징 벡터를 상기 복수의 화자 각각의 특징 벡터와 비교하여 상기 임의의 화자를 식별할 수 있다.The memory may include a speaker identification model for obtaining the speaker's feature vector from the audio signal. At this time, when an audio signal containing the voice of a random speaker is acquired, the processor inputs the acquired audio signal into the speaker identification model to obtain a feature vector of the random speaker, and The arbitrary speaker can be identified by comparing the feature vectors of each of the plurality of speakers.

더하여, 상기 메모리는, 상기 복수의 화자 각각의 음성에 매칭되는 오디오 신호를 인식하여 텍스트로 변환하기 위한 복수의 음성 인식 모델을 포함할 수 있다. 이 경우, 상기 프로세서는, 상기 임의의 화자가 상기 복수의 화자에 포함되는 제3 화자인 것으로 식별되면, 상기 복수의 음성 인식 모델 중 상기 제3 화자에 매칭되는 음성 인식 모델을 바탕으로 상기 임의의 화자의 음성이 포함된 오디오 신호를 텍스트로 변환하고, 상기 임의의 화자가 상기 복수의 화자에 포함되지 않는 것으로 식별된 경우, 상기 복수의 화자의 특징 벡터 중 상기 임의의 화자의 특징 벡터와 유사한 특징 벡터에 매칭되는 제4 화자를 식별하고, 상기 복수의 음성 인식 모델 중 상기 제4 화자에 매칭되는 음성 인식 모델을 바탕으로 상기 임의의 화자의 음성이 포함된 오디오 신호를 텍스트로 변환할 수 있다.In addition, the memory may include a plurality of voice recognition models for recognizing audio signals matching the voices of each of the plurality of speakers and converting them into text. In this case, if the random speaker is identified as a third speaker included in the plurality of speakers, the processor selects the random speaker based on a speech recognition model matching the third speaker among the plurality of speech recognition models. An audio signal containing the speaker's voice is converted into text, and when the random speaker is identified as not included in the plurality of speakers, a feature similar to the feature vector of the random speaker among the feature vectors of the plurality of speakers A fourth speaker matching the vector can be identified, and an audio signal containing the voice of an arbitrary speaker can be converted into text based on a voice recognition model matching the fourth speaker among the plurality of voice recognition models.

본 개시의 일 실시 예에 따라 복수의 화자 각각의 음성과 매칭되는 특징 벡터에 대한 정보를 포함하는 전자 장치의 제어 방법은, 사용자 입력에 따라 상기 복수의 화자 중 제1 화자의 음성을 선택하는 단계, 상기 선택된 제1 화자의 음성의 제1 특징 벡터와 유사한 제2 특징 벡터에 매칭되는 제2 화자를 선택하는 단계를 포함한다.According to an embodiment of the present disclosure, a method of controlling an electronic device including information about a feature vector matching the voice of each of a plurality of speakers includes selecting the voice of a first speaker among the plurality of speakers according to a user input. , including selecting a second speaker matching a second feature vector similar to the first feature vector of the voice of the selected first speaker.

본 개시에 따른 전자 장치 및 제어 방법은, 화자 별 특징 벡터를 비교함으로써 사용자가 선호하는 일 화자의 음성과 유사한 다른 화자의 음성을 포함하는 오디오 콘텐츠를 추천할 수 있다.The electronic device and control method according to the present disclosure can recommend audio content including the voice of another speaker that is similar to the voice of a speaker preferred by the user by comparing feature vectors for each speaker.

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 구성을 설명하기 위한 블록도,
도 2는 본 개시의 일 실시 예에 따른 전자 장치의 제어 방법을 설명하기 위한 흐름도,
도 3은 본 개시의 일 실시 예에 따른 전자 장치가 복수의 화자에 대한 정보를 특징 벡터의 유사도에 따라 복수의 그룹으로 구분하여 관리하는 동작을 설명하기 위한 블록도,
도 4는 본 개시의 일 실시 예에 따른 전자 장치가 화자 별로 음성 인식 모델 및/또는 음성 생성 모델을 각각 독립적으로 운영하는 동작을 설명하기 위한 블록도, 그리고
도 5는 본 개시의 다양한 실시 예에 따른 전자 장치의 구성을 설명하기 위한 블록도이다.1 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure;
2 is a flowchart illustrating a control method of an electronic device according to an embodiment of the present disclosure;
3 is a block diagram illustrating an operation in which an electronic device divides and manages information about a plurality of speakers into a plurality of groups according to the similarity of feature vectors, according to an embodiment of the present disclosure;
4 is a block diagram illustrating an operation in which an electronic device independently operates a voice recognition model and/or a voice generation model for each speaker according to an embodiment of the present disclosure;
Figure 5 is a block diagram for explaining the configuration of an electronic device according to various embodiments of the present disclosure.

본 개시에 대하여 구체적으로 설명하기에 앞서, 본 명세서 및 도면의 기재 방법에 대하여 설명한다.Before explaining the present disclosure in detail, the description method of the present specification and drawings will be explained.

먼저, 본 명세서 및 청구범위에서 사용되는 용어는 본 개시의 다양한 실시 예들에서의 기능을 고려하여 일반적인 용어들을 선택하였다. 하지만, 이러한 용어들은 당해 기술 분야에 종사하는 기술자의 의도나 법률적 또는 기술적 해석 및 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 일부 용어는 출원인이 임의로 선정한 용어도 있다. 이러한 용어에 대해서는 본 명세서에서 정의된 의미로 해석될 수 있으며, 구체적인 용어 정의가 없으면 본 명세서의 전반적인 내용 및 당해 기술 분야의 통상적인 기술 상식을 토대로 해석될 수도 있다. First, the terms used in the specification and claims are general terms selected in consideration of their functions in various embodiments of the present disclosure. However, these terms may vary depending on the intention of technicians working in the relevant technical field, legal or technical interpretation, and the emergence of new technologies. Additionally, some terms are arbitrarily selected by the applicant. These terms may be interpreted as defined in this specification, and if there is no specific term definition, they may be interpreted based on the overall content of this specification and common technical knowledge in the relevant technical field.

또한, 본 명세서에 첨부된 각 도면에 기재된 동일한 참조번호 또는 부호는 실질적으로 동일한 기능을 수행하는 부품 또는 구성요소를 나타낸다. 설명 및 이해의 편의를 위해서 서로 다른 실시 예들에서도 동일한 참조번호 또는 부호를 사용하여 설명한다. 즉, 복수의 도면에서 동일한 참조 번호를 가지는 구성요소를 모두 도시되어 있다고 하더라도, 복수의 도면들이 하나의 실시 예를 의미하는 것은 아니다. In addition, the same reference numbers or symbols in each drawing attached to this specification indicate parts or components that perform substantially the same function. For convenience of explanation and understanding, the same reference numerals or symbols are used in different embodiments. That is, even if all components having the same reference number are shown in multiple drawings, the multiple drawings do not represent one embodiment.

또한, 본 명세서 및 청구범위에서는 구성요소들 간의 구별을 위하여 "제1", "제2" 등과 같이 서수를 포함하는 용어가 사용될 수 있다. 이러한 서수는 동일 또는 유사한 구성요소들을 서로 구별하기 위하여 사용하는 것이며 이러한 서수 사용으로 인하여 용어의 의미가 한정 해석되어서는 안 된다. 일 예로, 이러한 서수와 결합된 구성요소는 그 숫자에 의해 사용 순서나 배치 순서 등이 제한되어서는 안 된다. 필요에 따라서는, 각 서수들은 서로 교체되어 사용될 수도 있다. Additionally, in this specification and claims, terms including ordinal numbers such as “first”, “second”, etc. may be used to distinguish between components. These ordinal numbers are used to distinguish identical or similar components from each other, and the meaning of the term should not be interpreted limitedly due to the use of these ordinal numbers. For example, the order of use or arrangement of components combined with these ordinal numbers should not be limited by the number. If necessary, each ordinal number may be used interchangeably.

본 명세서에서 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구성되다" 등의 용어는 명세서상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.In this specification, singular expressions include plural expressions, unless the context clearly dictates otherwise. In this application, terms such as “comprise” or “consist of” are intended to designate the presence of features, numbers, steps, operations, components, parts, or combinations thereof described in the specification, but are intended to indicate the presence of one or more other It should be understood that this does not exclude in advance the presence or addition of features, numbers, steps, operations, components, parts, or combinations thereof.

본 개시의 실시 예에서 "모듈", "유닛", "부(part)" 등과 같은 용어는 적어도 하나의 기능이나 동작을 수행하는 구성요소를 지칭하기 위한 용어이며, 이러한 구성요소는 하드웨어 또는 소프트웨어로 구현되거나 하드웨어 및 소프트웨어의 결합으로 구현될 수 있다. 또한, 복수의 "모듈", "유닛", "부(part)" 등은 각각이 개별적인 특정한 하드웨어로 구현될 필요가 있는 경우를 제외하고는, 적어도 하나의 모듈이나 칩으로 일체화되어 적어도 하나의 프로세서로 구현될 수 있다.In embodiments of the present disclosure, terms such as “module”, “unit”, “part”, etc. are terms to refer to components that perform at least one function or operation, and these components are either hardware or software. It may be implemented or may be implemented through a combination of hardware and software. In addition, a plurality of "modules", "units", "parts", etc. are integrated into at least one module or chip, except in cases where each needs to be implemented with individual specific hardware, and is integrated into at least one processor. It can be implemented as:

또한, 본 개시의 실시 예에서, 어떤 부분이 다른 부분과 연결되어 있다고 할 때, 이는 직접적인 연결뿐 아니라, 다른 매체를 통한 간접적인 연결의 경우도 포함한다. 또한, 어떤 부분이 어떤 구성요소를 포함한다는 의미는, 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Additionally, in an embodiment of the present disclosure, when a part is connected to another part, this includes not only direct connection but also indirect connection through other media. In addition, the meaning that a part includes a certain component does not mean that other components are excluded, but that it may further include other components, unless specifically stated to the contrary.

도 1은 본 개시의 일 실시 예에 따른 전자 장치의 구성을 설명하기 위한 블록도이다.1 is a block diagram for explaining the configuration of an electronic device according to an embodiment of the present disclosure.

도 1을 참조하면, 전자 장치(100)는 메모리(110) 및 프로세서(120)를 포함한다.Referring to FIG. 1, the electronic device 100 includes a memory 110 and a processor 120.

전자 장치(100)는 서버로 구현될 수 있다. 서버는, 하나 이상의 컴퓨터를 포함하는 시스템으로 구성될 수 있으며, 적어도 하나의 애플리케이션 및/또는 웹 페이지를 통해 보이스 폰트 서비스를 제공하는 서버에 해당할 수 있다.The electronic device 100 may be implemented as a server. The server may be configured as a system including one or more computers, and may correspond to a server that provides a voice font service through at least one application and/or web page.

이 경우, 전자 장치(100)는 다양한 사용자의 사용자 단말(ex. 스마트폰, 태블릿 PC, 이어폰/헤드폰 등)과 통신을 수행함으로써 후술할 다양한 동작을 수행할 수 있다.In this case, the electronic device 100 can perform various operations to be described later by communicating with various users' user terminals (e.g. smartphones, tablet PCs, earphones/headphones, etc.).

또는, 전자 장치(100)는 데스크탑 PC, 스마트폰, 태블릿 PC, 이어폰/헤드폰 등 각 사용자의 단말 기기에 해당할 수도 있다.Alternatively, the electronic device 100 may correspond to each user's terminal device, such as a desktop PC, smartphone, tablet PC, or earphone/headphone.

메모리(110)는 전자 장치(100)의 구성요소들의 전반적인 동작을 제어하기 위한 운영체제(OS: Operating System) 및 전자 장치(100)의 구성요소와 관련된 적어도 하나의 인스트럭션 또는 데이터를 저장하기 위한 구성이다.The memory 110 is configured to store an operating system (OS) for controlling the overall operation of the components of the electronic device 100 and at least one instruction or data related to the components of the electronic device 100. .

메모리(110)는 ROM, 플래시 메모리 등의 비휘발성 메모리를 포함할 수 있으며, DRAM 등으로 구성된 휘발성 메모리를 포함할 수 있다. 또한, 메모리(110)는 하드 디스크, SSD(Solid state drive) 등을 포함할 수도 있다.The memory 110 may include non-volatile memory, such as ROM or flash memory, and may include volatile memory, such as DRAM. Additionally, the memory 110 may include a hard disk, solid state drive (SSD), etc.

메모리(110)는 복수의 화자(speaker)에 대한 정보를 포함할 수 있다. 구체적으로, 메모리(110)는 복수의 화자 각각의 이름, 나이, 성별 등에 대한 정보를 포함할 수 있다. Memory 110 may include information about a plurality of speakers. Specifically, the memory 110 may include information about the name, age, gender, etc. of each of the plurality of speakers.

또한, 메모리(110)는, 화자 별로, 각 화자의 음성을 포함하는 오디오 신호로 구성된 오디오 콘텐츠를 포함할 수도 있다.Additionally, the memory 110 may include, for each speaker, audio content consisting of an audio signal including the voice of each speaker.

도 1을 참조하면, 메모리(110)는 복수의 화자 각각의 음성과 매칭되는 특징 벡터에 대한 정보를 포함할 수 있다. 특징 벡터는, 각 화자의 음성을 포함하는 오디오 신호로부터 추출된 것으로, 각 화자의 오디오 신호의 고유한 특징이 수치화된 것이다. 특징 벡터는, 오디오 신호를 분석하도록 훈련된 적어도 하나의 인공지능 모델을 바탕으로 파라미터 별로 획득될 수 있다. Referring to FIG. 1, the memory 110 may include information about feature vectors matching the voices of each of a plurality of speakers. The feature vector is extracted from an audio signal containing the voice of each speaker, and is a quantified value of the unique characteristics of the audio signal of each speaker. Feature vectors may be obtained for each parameter based on at least one artificial intelligence model trained to analyze audio signals.

예를 들어, 특징 벡터는, 오디오 신호에 대한 시계열 분석 내지는 주파수 변환 분석 등을 바탕으로 획득된 것일 수 있으며, 본 과정은 프로세서(120)에 의해 수행되거나 또는 적어도 하나의 외부 장치를 통해 수행된 것일 수 있다. 구체적인 예로, 특징 벡터는 MFCC(Mel-Frequency Cepstral Coeficient) 기법에 따라 추출된 음향과 관련된 다양한 파라미터의 특징 벡터에 해당할 수 있으나, 이에 한정되지는 않는다.For example, the feature vector may be obtained based on time series analysis or frequency conversion analysis of the audio signal, and this process may be performed by the processor 120 or through at least one external device. You can. As a specific example, the feature vector may correspond to a feature vector of various parameters related to sound extracted according to a Mel-Frequency Cepstral Coefficiency (MFCC) technique, but is not limited to this.

프로세서(120)는 전자 장치(100)를 전반적으로 제어하기 위한 구성이다. 구체적으로, 프로세서(130)는 메모리(110)와 연결되는 한편 메모리(110)에 저장된 적어도 하나의 인스트럭션을 실행함으로써 본 개시의 다양한 실시 예들에 따른 동작을 수행할 수 있다.The processor 120 is configured to overall control the electronic device 100. Specifically, the processor 130 may perform operations according to various embodiments of the present disclosure by being connected to the memory 110 and executing at least one instruction stored in the memory 110.

프로세서(120)는 CPU, AP, DSP(Digital Signal Processor) 등과 같은 범용 프로세서, GPU, VPU(Vision Processing Unit) 등과 같은 그래픽 전용 프로세서 또는 NPU와 같은 인공지능 전용 프로세서 등을 포함할 수 있다. 인공지능 전용 프로세서는, 특정 인공지능 모델의 훈련 내지는 이용에 특화된 하드웨어 구조로 설계될 수 있다.The processor 120 may include a general-purpose processor such as a CPU, AP, or DSP (Digital Signal Processor), a graphics-specific processor such as a GPU, a VPU (Vision Processing Unit), or an artificial intelligence-specific processor such as an NPU. An artificial intelligence-specific processor may be designed with a hardware structure specialized for training or use of a specific artificial intelligence model.

도 1을 참조하면, 프로세서(120)는 화자 관리 모듈(121), 콘텐츠 관리 모듈(122), 콘텐츠 추천 모듈(123) 등을 제어할 수 있다. 본 모듈들은 소프트웨어 및/또는 하드웨어로 구성될 수 있으며 기능 단위로 구분된 구성들에 해당한다.Referring to FIG. 1, the processor 120 can control the speaker management module 121, content management module 122, content recommendation module 123, etc. These modules may be composed of software and/or hardware and correspond to configurations divided into functional units.

화자 관리 모듈(121)은 각 화자의 정보(ex. 이름, 나이, 성별) 외에 각 화자의 음성과 관련된 특징 벡터를 관리하기 위한 구성이다.The speaker management module 121 is configured to manage feature vectors related to each speaker's voice in addition to each speaker's information (ex. name, age, gender).

예를 들어, 제1 화자의 음성을 포함하는 오디오 신호(오디오 데이터)가 획득된 경우, 화자 관리 모듈(121)은 획득된 오디오 신호로부터 추출된 특징 벡터를 제1 화자에 대하여 저장할 수 있다.For example, when an audio signal (audio data) including the voice of the first speaker is acquired, the speaker management module 121 may store a feature vector extracted from the acquired audio signal for the first speaker.

또한, 화자 관리 모듈(121)은 서로 유사한 특징 벡터를 가지는 화자들이 동일한 그룹에 포함되도록 지정된 그룹 정보를 관리할 수 있는 바, 도 3을 통해 후술한다.Additionally, the speaker management module 121 can manage designated group information so that speakers with similar feature vectors are included in the same group, which will be described later with reference to FIG. 3 .

임의의 화자의 음성을 포함하는 오디오 신호가 수신되는 경우, 화자 관리 모듈(121)은 수신된 오디오 신호의 특징 벡터를 추출하여, 기저장된 복수의 화자의 특징 벡터 각각과 비교할 수 있다. 여기서, 일치하는 특징 벡터가 존재하는 경우, 화자 관리 모듈(121)은 수신된 오디오 신호 내 음성을 발화한 화자가 누구인지 인식할 수 있다.When an audio signal containing the voice of an arbitrary speaker is received, the speaker management module 121 may extract a feature vector of the received audio signal and compare it with each of the pre-stored feature vectors of a plurality of speakers. Here, if a matching feature vector exists, the speaker management module 121 can recognize the speaker who uttered the voice in the received audio signal.

콘텐츠 관리 모듈(122)은 복수의 화자의 음성을 포함하는 오디오 콘텐츠를 관리하기 위한 모듈이다. 콘텐츠 관리 모듈(122)은 오디오 콘텐츠 내 음성이 나타내는 텍스트에 대한 정보를 포함할 수 있다.The content management module 122 is a module for managing audio content including the voices of multiple speakers. The content management module 122 may include information about the text indicated by the voice in the audio content.

콘텐츠 관리 모듈(122)은 각 오디오 콘텐츠에 대한 사용자의 선호도를 식별할 수 있다. 구체적으로, 콘텐츠 관리 모듈(122)은, 오디오 콘텐츠 별로, 사용자에게 제공된 시간, 제공된 횟수, 사용자의 선택 횟수 등에 대한 정보를 식별할 수 있다. 예를 들어, 특정 오디오 콘텐츠가 제공된 시간(ex. 스피커나 이어폰/헤드폰 단자를 통해 출력된 시간)이 많을수록 해당 오디오 콘텐츠에 대한 사용자의 선호도가 더 높은 것으로 산출될 수 있다.Content management module 122 may identify the user's preferences for each audio content. Specifically, the content management module 122 may identify information about the time provided to the user, the number of times it was provided, and the number of times the user selected it, for each audio content. For example, the more time a specific audio content is provided (e.g., the time it is output through a speaker or earphone/headphone terminal), the higher the user's preference for that audio content can be calculated.

또한, 각 오디오 콘텐츠에 포함된 음성에 매칭되는 화자에 대한 정보를 바탕으로, 콘텐츠 관리 모듈(122)은 각 화자의 음성에 대한 사용자의 선호도를 식별할 수도 있다. 예를 들어, 특정 화자의 음성을 포함하는 오디오 콘텐츠가 자주 선택될수록, 해당 화자의 음성에 대한 사용자의 선호도가 높게 산출될 수 있다.Additionally, based on information about the speaker matching the voice included in each audio content, the content management module 122 may identify the user's preference for the voice of each speaker. For example, the more frequently audio content containing the voice of a specific speaker is selected, the higher the user's preference for that speaker's voice may be calculated.

콘텐츠 추천 모듈(123)은 사용자에게 적어도 하나의 오디오 콘텐츠를 추천하기 위한 구성이다.The content recommendation module 123 is configured to recommend at least one audio content to the user.

예를 들어, 콘텐츠 추천 모듈(123)은 사용자의 선호도가 비교적 높은 오디오 콘텐츠를 식별할 수 있다. 이때, 콘텐츠 추천 모듈(123)은 식별된 오디오 콘텐츠에 포함된 음성에 매칭되는 화자를 식별할 수 있으며, 해당 화자의 음성이 포함된 다른 오디오 콘텐츠를 사용자에게 추천할 수 있다.For example, the content recommendation module 123 may identify audio content that the user has a relatively high preference for. At this time, the content recommendation module 123 can identify a speaker matching the voice included in the identified audio content and recommend other audio content including the voice of the speaker to the user.

또는, 콘텐츠 추천 모듈(123)은 사용자의 선호도가 높은 화자의 음성을 식별하는 한편, 해당 화자의 음성과 특징 벡터가 유사한 다른 화자의 음성을 포함하는 오디오 콘텐츠를 추천할 수도 있다.Alternatively, the content recommendation module 123 may identify the voice of a speaker highly preferred by the user and recommend audio content including the voice of another speaker whose feature vector is similar to that of the speaker.

관련하여, 도 2는 본 개시의 일 실시 예에 따른 전자 장치의 제어 방법을 설명하기 위한 흐름도이다.In relation to this, FIG. 2 is a flowchart for explaining a method of controlling an electronic device according to an embodiment of the present disclosure.

도 2를 참조하면, 전자 장치(100)는 적어도 하나의 사용자 입력에 따라 복수의 화자 중 제1 화자의 음성을 선택할 수 있다(S210).Referring to FIG. 2, the electronic device 100 may select the voice of a first speaker among a plurality of speakers according to at least one user input (S210).

예를 들어, 사용자 입력에 따라 제1 화자가 선택된 경우, 사용자 입력에 따라 제1 화자의 음성이 포함된 오디오 콘텐츠의 재생이 요청된 경우 등에 있어, 전자 장치(100)는 제1 화자의 음성을 선택할 수 있다.For example, when the first speaker is selected according to a user input, or when playback of audio content including the voice of the first speaker is requested according to the user input, the electronic device 100 selects the voice of the first speaker. You can choose.

다른 예로, 제1 화자의 음성을 포함하는 적어도 하나의 오디오 콘텐츠의 재생 시간/재생 빈도가 다른 오디오 콘텐츠의 재생 시간/재생 빈도보다 길거나/잦은 경우, 전자 장치(100)는 제1 화자의 음성을 선택할 수 있다.As another example, when the playback time/playback frequency of at least one audio content including the voice of the first speaker is longer or/more frequent than the playback time/playback frequency of other audio content, the electronic device 100 may play the voice of the first speaker. You can choose.

또한, 전자 장치(100)는 각 오디오 콘텐츠의 재생 시간/재생 빈도에 따라 각 화자에 대한 사용자의 선호도를 산출할 수 있으며, 사용자의 선호도가 높은 제1 화자의 음성을 선택할 수도 있다.Additionally, the electronic device 100 may calculate the user's preference for each speaker according to the playback time/playback frequency of each audio content, and may select the voice of the first speaker for which the user has a high preference.

그리고, 전자 장치(100)는 선택된 제1 화자의 음성의 제1 특징 벡터를 식별하고, 식별된 제1 특징 벡터와 유사한 제2 특징 벡터에 매칭되는 제2 화자를 선택할 수 있다(S220).Additionally, the electronic device 100 may identify a first feature vector of the voice of the selected first speaker and select a second speaker that matches a second feature vector similar to the identified first feature vector (S220).

특징 벡터 간의 유사 여부를 판단함에 있어 유사도의 개념이 활용될 수 있다.The concept of similarity can be used to determine whether feature vectors are similar.

특징 벡터 간의 유사도는, 파라미터 별 특징 벡터의 수치 차이에 따라 정의될 수 있다. 예를 들어, 파라미터 별 특징 벡터 간의 수치 차이가 적을수록 유사도가 높게 산출될 수 있다.Similarity between feature vectors can be defined according to the numerical difference between feature vectors for each parameter. For example, the smaller the numerical difference between feature vectors for each parameter, the higher the similarity can be calculated.

일 실시 예로, 전자 장치(100)는 제1 특징 벡터에 대한 유사도가 임계치 이상인 제2 특징 벡터를 식별하고, 제2 특징 벡터에 매칭되는 제2 화자를 선택할 수 있다.In one embodiment, the electronic device 100 may identify a second feature vector whose similarity to the first feature vector is greater than or equal to a threshold and select a second speaker matching the second feature vector.

이 경우, 전자 장치(100)는 제2 화자의 음성이 포함된 오디오 콘텐츠를 추천할 수 있다. 또는, 전자 장치(100)는 적어도 하나의 텍스트에 대하여 TTS(Text-to-Speech) 기능을 수행하여 제2 화자의 음성으로 변환함으로써 오디오 콘텐츠를 생성하고, 생성된 오디오 콘텐츠를 제공할 수 있다.In this case, the electronic device 100 may recommend audio content including the second speaker's voice. Alternatively, the electronic device 100 may generate audio content by performing a text-to-speech (TTS) function on at least one text and converting it into a second speaker's voice, and provide the generated audio content.

다만, 전자 장치(100)는 추천된 오디오 콘텐츠에 대하여 제공 이력을 모니터링함으로써 다른 오디오 콘텐츠를 추천할 수 있다.However, the electronic device 100 may recommend other audio content by monitoring the provision history of the recommended audio content.

구체적으로, 제2 화자의 음성이 포함된 오디오 콘텐츠에 대한 추천이 수행된 이후 일정 기간 동안, 전자 장치(100)는 선택된 제2 화자의 음성이 포함된 오디오 콘텐츠가 제공된 시간을 식별할 수 있다.Specifically, for a certain period of time after the recommendation for the audio content including the voice of the second speaker is performed, the electronic device 100 may identify the time when the audio content including the voice of the selected second speaker was provided.

만약, 오디오 콘텐츠가 제공된 시간이 일정 시간 미만인 경우, 전자 장치(100)는 복수의 화자의 특징 벡터 중 제1 특징 벡터에 대한 유사도가 임계치 이상이고 제2 특징 벡터에 대한 유사도가 임계치 미만인 적어도 하나의 제3 벡터를 선택할 수 있다.If the time at which the audio content is provided is less than a certain amount of time, the electronic device 100 selects at least one of the plurality of speaker feature vectors whose similarity to the first feature vector is greater than or equal to the threshold and whose similarity to the second feature vector is less than the threshold. A third vector can be selected.

여기서, 전자 장치(100)는, 선택된 제3 특징 벡터에 매칭되는 제3 화자의 음성이 포함된 다른 오디오 콘텐츠를 추천할 수 있다. 마찬가지로, 다른 오디오 콘텐츠가 추천된 이후에도 제공 시간이 모니터링된 결과, 또 다른 오디오 콘텐츠가 추천될 수 있음은 물론이다.Here, the electronic device 100 may recommend other audio content that includes the voice of a third speaker matching the selected third feature vector. Likewise, it goes without saying that even after other audio content is recommended, another audio content may be recommended as a result of monitoring the provision time.

반면, 오디오 콘텐츠가 제공된 시간이 일정 시간 이상인 경우, 전자 장치(100)는 제2 특징 벡터에 대한 유사도가 임계치 이상인 적어도 하나의 특징 벡터를 선택하고, 선택된 특징 벡터에 매칭되는 화자의 음성이 포함된 추가적인 오디오 콘텐츠를 추천할 수 있다.On the other hand, when the audio content is provided for a certain amount of time or more, the electronic device 100 selects at least one feature vector whose similarity to the second feature vector is more than a threshold, and includes the speaker's voice matching the selected feature vector. Additional audio content can be recommended.

한편, 일 실시 예로, 전자 장치(100)는 복수의 화자를 복수의 그룹으로 구분할 수 있다. 이때, 전자 장치(100)는 특징 벡터가 서로 유사한 화자를 동일한 그룹으로 지정할 수 있다.Meanwhile, in one embodiment, the electronic device 100 may divide a plurality of speakers into a plurality of groups. At this time, the electronic device 100 may designate speakers with similar feature vectors to the same group.

관련하여, 도 3은 본 개시의 일 실시 예에 따른 전자 장치가 복수의 화자에 대한 정보를 특징 벡터의 유사도에 따라 복수의 그룹으로 구분하여 관리하는 동작을 설명하기 위한 블록도이다.In relation to this, FIG. 3 is a block diagram illustrating an operation in which an electronic device divides and manages information about a plurality of speakers into a plurality of groups according to the similarity of feature vectors, according to an embodiment of the present disclosure.

도 3을 참조하면, 메모리(110)는 복수의 화자가 복수의 화자의 특징 벡터 간의 유사도를 바탕으로 구분된 복수의 그룹 각각에 대한 정보(301, 302)를 포함할 수 있다. 각 그룹은 한 명 이상의 화자를 포함할 수 있으며, 메모리(110)는 그룹 별로 평균적인 특징 벡터의 수치에 대한 정보를 포함할 수도 있다.Referring to FIG. 3 , the memory 110 may include information 301 and 302 for each of a plurality of groups in which a plurality of speakers are divided based on the similarity between the feature vectors of the plurality of speakers. Each group may include one or more speakers, and the memory 110 may include information about the value of the average feature vector for each group.

일 실시 예로, 상술한 S210 단계와 같이 사용자 입력 내지는 사용자의 선호도에 따라 상술한 제1 화자가 선택된 경우를 가정한다. 이 경우, S220 단계에 있어서, 전자 장치(100)는 제1 화자가 속한 그룹에 포함된 제2 화자를 선택할 수 있다.As an example, assume that the above-described first speaker is selected according to user input or user preference as in step S210 described above. In this case, in step S220, the electronic device 100 may select the second speaker included in the group to which the first speaker belongs.

한편, 도 4는 본 개시의 일 실시 예에 따른 전자 장치가 화자 별로 음성 인식 모델 및/또는 음성 생성 모델을 각각 독립적으로 운영하는 동작을 설명하기 위한 블록도이다. 도 4를 참조하면, 메모리(110)는 화자 식별 모델(410), 음성 인식 모델(420), 음성 생성 모델(430) 등 오디오 신호와 관련된 다양한 기능의 인공지능 모델을 포함할 수 있다.Meanwhile, FIG. 4 is a block diagram illustrating an operation in which an electronic device independently operates a voice recognition model and/or a voice generation model for each speaker according to an embodiment of the present disclosure. Referring to FIG. 4, the memory 110 may include artificial intelligence models with various functions related to audio signals, such as a speaker identification model 410, a speech recognition model 420, and a speech generation model 430.

화자 식별 모델(410)은 오디오 신호로부터 화자의 특징 벡터를 획득하기 위한 모델이다. 화자 식별 모델(410)은, 예를 들어 시계열 분석 내지는 주파수 변환 분석을 통해 하나 이상의 파라미터와 관련된 특징 벡터를 추출하도록 훈련될 수 있다.The speaker identification model 410 is a model for obtaining a speaker's feature vector from an audio signal. The speaker identification model 410 may be trained to extract feature vectors related to one or more parameters through, for example, time series analysis or frequency transform analysis.

음성 인식 모델(420)은 오디오 신호에 포함된 화자의 음성을 인식하여 텍스트를 획득하기 위한 구성이다. 음성 인식 모델(420)은 각 화자 별로 구비될 수 있다. 예를 들어, 제1 화자의 음성을 인식하기 위한 제1 인식 모델 및 제2 화자의 음성을 인식하기 위한 제2 인식 모델이 메모리(110) 상에 각각 저장될 수 있다.The voice recognition model 420 is a configuration for obtaining text by recognizing the speaker's voice included in the audio signal. A voice recognition model 420 may be provided for each speaker. For example, a first recognition model for recognizing the voice of a first speaker and a second recognition model for recognizing the voice of a second speaker may be respectively stored in the memory 110.

일 실시 예로, 오디오 신호가 입력되면, 화자 식별 모델(410)에 의해 오디오 신호에 포함된 음성의 화자가 식별될 수 있다. 오디오 신호 내 음성의 화자가 제1 화자인 것으로 식별되면, 전자 장치(100)는 식별된 제1 화자의 음성을 인식하기 위한 제1 인식 모델에 오디오 신호를 입력하여 텍스트를 획득할 수 있다.In one embodiment, when an audio signal is input, the speaker of the voice included in the audio signal can be identified by the speaker identification model 410. If the speaker of the voice in the audio signal is identified as the first speaker, the electronic device 100 may obtain text by inputting the audio signal into a first recognition model for recognizing the voice of the identified first speaker.

음성 인식 모델(420)은, 음소 등의 단위음을 추출하기 위한 음향 모델(Acoustic Model) 및 단위음을 조합하여 문자(letter) 내지 단어(word)를 생성하기 위한 언어 모델을 포함할 수 있다.The speech recognition model 420 may include an acoustic model for extracting unit sounds such as phonemes and a language model for generating letters or words by combining unit sounds.

음성 인식 모델(420)은 화자 별로 음향 모델을 별도로 구비할 수 있다. 음성 인식 모델(420)은 화자 별로 언어 모델(Language Model)을 별도로 구비할 수 있으나, 하나의 언어 모델을 복수의 화자에 대하여 공통으로 활용할 수도 있다. 이는, 각 음소가 조합되는 방식 자체는 화자에 따라 차이를 갖지 않을 수 있다는 점이 고려된 것이다.The speech recognition model 420 may have separate acoustic models for each speaker. The speech recognition model 420 may have a separate language model for each speaker, but one language model may be commonly used for a plurality of speakers. This takes into account the fact that the way each phoneme is combined may not differ depending on the speaker.

예를 들어, 제1 인식 모델은, 제1 화자의 음성을 포함하는 오디오 신호를 바탕으로 훈련된 제1 음향 모델 및 (공통의) 언어 모델을 활용하여 동작할 수 있다. 제2 인식 모델은 제2 화자의 음성을 포함하는 오디오 신호를 바탕으로 훈련된 제2 음향 모델 및 (공통의) 언어 모델을 활용하여 동작할 수 있다.For example, the first recognition model may operate using a first acoustic model and a (common) language model trained based on an audio signal including the voice of the first speaker. The second recognition model may operate using a second acoustic model and a (common) language model trained based on an audio signal including the second speaker's voice.

한편, 일 실시 예로, 임의의 화자의 음성이 포함된 오디오 신호가 획득되면, 전자 장치(100)는 획득된 오디오 신호를 화자 식별 모델(410)에 입력하여 임의의 화자의 특징 벡터를 획득할 수 있다. 이 경우, 전자 장치(100)는 획득된 특징 벡터를 복수의 화자 각각의 특징 벡터와 비교하여 임의의 화자를 식별할 수 있다.Meanwhile, in one embodiment, when an audio signal containing the voice of a random speaker is acquired, the electronic device 100 may input the acquired audio signal into the speaker identification model 410 to obtain a feature vector of the random speaker. there is. In this case, the electronic device 100 may identify a random speaker by comparing the acquired feature vector with the feature vectors of each of the plurality of speakers.

여기서, 상술한 임의의 화자가 메모리(110)에 저장된 복수의 화자에 포함되는 제3 화자인 것으로 식별되면, 전자 장치(100)는 복수의 음성 인식 모델(ex. 제1 인식 모델, 제2 인식 모델, … ) 중 제3 화자에 매칭되는 음성 인식 모델(ex. 제3 인식 모델)을 바탕으로 상술한 임의의 화자의 음성이 포함된 오디오 신호를 텍스트로 변환할 수 있다.Here, if the above-mentioned arbitrary speaker is identified as a third speaker included in the plurality of speakers stored in the memory 110, the electronic device 100 uses a plurality of voice recognition models (ex. first recognition model, second recognition model). An audio signal containing the voice of an arbitrary speaker described above can be converted into text based on a voice recognition model (ex. third recognition model) that matches the third speaker among the models,...).

반면, 상술한 임의의 화자가 메모리(110)에 저장된 복수의 화자에 포함되지 않는 것으로 식별된 경우, 전자 장치(100)는 복수의 화자의 특징 벡터 중 상술한 임의의 화자의 특징 벡터와 유사한 특징 벡터에 매칭되는 제4 화자를 식별할 수 있다. 이 경우, 전자 장치(100)는 복수의 음성 인식 모델 중 제4 화자에 매칭되는 음성 인식 모델(ex. 제4 인식 모델)을 바탕으로 상술한 임의의 화자의 음성이 포함된 오디오 신호를 텍스트로 변환할 수 있다.On the other hand, if the above-described arbitrary speaker is identified as not included in the plurality of speakers stored in the memory 110, the electronic device 100 may display features similar to the above-mentioned arbitrary speaker feature vectors among the plurality of speaker feature vectors. The fourth speaker matching the vector can be identified. In this case, the electronic device 100 converts an audio signal containing the voice of an arbitrary speaker described above into text based on a voice recognition model (ex. fourth recognition model) matching the fourth speaker among a plurality of voice recognition models. It can be converted.

이렇듯, 기존에 음성 인식 모델이 확보되지 않은 새로운 화자의 음성을 포함하는 오디오 신호가 수신되더라도, 전자 장치(100)는 유사한 특징 벡터를 가지는 다른 화자의 음성 인식 모델을 활용하여 음성 인식의 정확도를 확보할 수 있다.In this way, even if an audio signal containing the voice of a new speaker for which a voice recognition model has not been secured is received, the electronic device 100 secures the accuracy of voice recognition by utilizing the voice recognition model of another speaker having a similar feature vector. can do.

한편, 도 4를 참조하면, 메모리(110)는 텍스트를 오디오 신호로 변환하기 위한 음성 생성 모델을 포함할 수 있다. 음성 생성 모델은, TTS 기능을 수행하기 위한 모델에 해당한다.Meanwhile, referring to FIG. 4, the memory 110 may include a voice generation model for converting text into an audio signal. The voice generation model corresponds to a model for performing the TTS function.

구체적으로, 메모리(110)는, 화자 별로, 텍스트를 오디오 신호로 변환하기 위한 복수의 음성 생성 모델을 포함할 수 있다. 예를 들어, 텍스트를 제1 화자의 음성으로 변환하기 위한 제1 생성 모델, 텍스트를 제2 화자의 음성으로 변환하기 위한 제2 생성 모델 등이 독립적으로 구비될 수 있다.Specifically, the memory 110 may include a plurality of speech generation models for converting text into audio signals for each speaker. For example, a first generation model for converting text into the voice of the first speaker, a second generation model for converting the text into the voice of the second speaker, etc. may be independently provided.

상술한 음성 생성 모델을 바탕으로, 전자 장치(100)는 텍스트를 구성하는 단위음(ex. 음소) 각각에 매칭되는 오디오 데이터를 획득하여 조합할 수 있다. 이를 위해, 화자 별로 별도의 오디오 데이터가 구비될 수 있으며, 상술한 화자 별 음성 생성 모델(ex. 제1 생성 모델, 제2 생성 모델)은, 단위음 각각에 매칭되는 화자 별 오디오 데이터를 활용할 수 있다.Based on the above-described speech generation model, the electronic device 100 can obtain and combine audio data matching each unit sound (ex. phoneme) constituting the text. For this purpose, separate audio data may be provided for each speaker, and the above-described speech generation model for each speaker (ex. first generation model, second generation model) may utilize audio data for each speaker matching each unit sound. there is.

관련하여, 상술한 S220 단계에 있어서, 전자 장치(100)는 선택된 제2 화자에 매칭되는 음성 생성 모델(ex. 제2 생성 모델)을 식별하고, 식별된 음성 생성 모델에 적어도 하나의 텍스트를 입력하여, 텍스트에 대응되는 제2 화자의 음성을 포함하는 오디오 신호를 획득할 수 있다. 즉, 전자 장치(100)는 사용자가 선호할 만한 제2 화자의 음성을 바탕으로 오디오 콘텐츠를 직접 생성할 수도 있다. 이 경우, 전자 장치(100)는 해당 오디오 신호를 포함하는 오디오 콘텐츠를 사용자에게 추천할 수 있다.Relatedly, in step S220 described above, the electronic device 100 identifies a speech production model (ex. second production model) matching the selected second speaker, and inputs at least one text into the identified speech production model. Thus, an audio signal containing the voice of the second speaker corresponding to the text can be obtained. That is, the electronic device 100 may directly generate audio content based on the voice of a second speaker that the user may prefer. In this case, the electronic device 100 may recommend audio content including the corresponding audio signal to the user.

한편, 도 5는 본 개시의 다양한 실시 예에 따른 전자 장치의 구성을 설명하기 위한 블록도이다.Meanwhile, FIG. 5 is a block diagram for explaining the configuration of an electronic device according to various embodiments of the present disclosure.

도 5를 참조하면, 전자 장치(100)는 메모리(110), 프로세서(120) 외에 통신부(130), 마이크(140), 사용자 입력부(150), 출력부(160) 등을 포함할 수 있다.Referring to FIG. 5 , the electronic device 100 may include a communication unit 130, a microphone 140, a user input unit 150, an output unit 160, etc. in addition to the memory 110 and the processor 120.

통신부(130)는 다양한 유무선 통신방식으로 적어도 하나의 외부 장치와 통신을 수행하기 위한 회로, 모듈, 칩 등을 포함할 수 있다.The communication unit 130 may include circuits, modules, chips, etc. for communicating with at least one external device through various wired or wireless communication methods.

통신부(130)는 다양한 네트워크를 통해 외부 장치와 연결될 수 있다.The communication unit 130 may be connected to external devices through various networks.

네트워크는 영역 또는 규모에 따라 개인 통신망(PAN; Personal Area Network), 근거리 통신망(LAN; Local Area Network), 광역 통신망(WAN; Wide Area Network) 등일 수 있으며, 네트워크의 개방성에 따라 인트라넷(Intranet), 엑스트라넷(Extranet), 또는 인터넷(Internet) 등일 수 있다.Depending on the area or size, the network may be a personal area network (PAN), a local area network (LAN), or a wide area network (WAN). Depending on the openness of the network, it may be an intranet, It may be an extranet, or the Internet.

통신부(130)는 LTE(long-term evolution), LTE-A(LTE Advance), 5G(5th Generation) 이동통신, CDMA(code division multiple access), WCDMA(wideband CDMA), UMTS(universal mobile telecommunications system), WiBro(Wireless Broadband), GSM(Global System for Mobile Communications), DMA(Time Division Multiple Access), WiFi(Wi-Fi), WiFi Direct, Bluetooth, NFC(near field communication), Zigbee 등 다양한 무선 통신 방식을 통해 외부 장치들과 연결될 수 있다. The communication unit 130 supports long-term evolution (LTE), LTE Advance (LTE-A), 5th Generation (5G) mobile communication, code division multiple access (CDMA), wideband CDMA (WCDMA), and universal mobile telecommunications system (UMTS). , WiBro (Wireless Broadband), GSM (Global System for Mobile Communications), DMA (Time Division Multiple Access), WiFi (Wi-Fi), WiFi Direct, Bluetooth, NFC (near field communication), Zigbee, etc. It can be connected to external devices.

또한, 통신부(130)는 이더넷(Ethernet), 광 네트워크(optical network), USB(Universal Serial Bus), 선더볼트(ThunderBolt) 등의 유선 통신 방식을 통해 외부 장치들과 연결될 수도 있다.Additionally, the communication unit 130 may be connected to external devices through a wired communication method such as Ethernet, optical network, USB (Universal Serial Bus), or Thunderbolt.

전자 장치(100)가 서버인 경우, 전자 장치(100)는 사용자 단말(ex. 스마트폰)을 통해 입력된 사용자의 음성에 매칭되는 오디오 신호를 통신부(130)를 통해 사용자 단말로부터 수신할 수 있다. 또한, 전자 장치(100)는 일 화자의 음성을 포함하는 오디오 콘텐츠를 통신부(130)를 통해 사용자 단말로 전송할 수 있다. 또한, 전자 장치(100)는 적어도 하나의 텍스트가 일 화자의 음성으로 변환된 오디오 신호(오디오 콘텐츠)를 통신부(130)를 통해 사용자 단말로 전송할 수 있다.When the electronic device 100 is a server, the electronic device 100 may receive an audio signal matching the user's voice input through the user terminal (ex. smartphone) from the user terminal through the communication unit 130. . Additionally, the electronic device 100 may transmit audio content including a speaker's voice to the user terminal through the communication unit 130. Additionally, the electronic device 100 may transmit an audio signal (audio content) in which at least one text is converted into a speaker's voice to the user terminal through the communication unit 130.

마이크(140)는 사운드를 전기적 신호로 변경하기 위한 구성으로, 사용자의 음성을 입력 받아 오디오 신호를 획득할 수 있다. 마이크(140)는 다이나믹 마이크, 콘덴서 마이크, 리본 마이크 등 다양한 방식으로 구현될 수 있으나 이에 한정될 필요는 없다.The microphone 140 is configured to change sound into an electrical signal and can obtain an audio signal by receiving the user's voice. The microphone 140 may be implemented in various ways, such as a dynamic microphone, condenser microphone, or ribbon microphone, but is not limited to these.

일 예로, 전자 장치(100)가 사용자 단말(ex. 스마트폰)인 경우, 전자 장치(100)는 마이크(140)를 통해 사용자 내지는 다양한 화자의 음성을 수신하여 오디오 신호를 획득할 수 있다.For example, when the electronic device 100 is a user terminal (ex. smartphone), the electronic device 100 may receive the voice of the user or various speakers through the microphone 140 and obtain an audio signal.

사용자 입력부(150)는 사용자의 명령 내지는 다양한 정보를 입력 받기 위한 구성이다. 사용자 입력부(150)는 적어도 하나의 버튼, 터치 패드, 카메라, 마이크 등을 포함할 수 있다. 또한, 전자 장치(100)는 적어도 하나의 사용자 입력 장치(ex. 키보드, 마우스 등)를 통해 사용자의 명령 내지는 정보를 입력 받을 수도 있다.The user input unit 150 is configured to receive user commands or various information. The user input unit 150 may include at least one button, a touch pad, a camera, a microphone, etc. Additionally, the electronic device 100 may receive a user's command or information through at least one user input device (ex. keyboard, mouse, etc.).

일 예로, 전자 장치(100)는 사용자 입력부(150)를 통해 수신된 사용자 명령에 따라 마이크(140)를 활성화하여 사용자의 음성을 입력 받을 수 있다. 이 경우, 전자 장치(100)는 사용자의 음성을 포함하는 오디오 신호의 특징 벡터를 바탕으로 사용자(: 화자)가 누구인지 식별할 수 있으며, 음성 인식을 통해 사용자의 음성에 매칭되는 텍스트를 획득할 수도 있다.As an example, the electronic device 100 may activate the microphone 140 according to a user command received through the user input unit 150 to receive the user's voice input. In this case, the electronic device 100 can identify who the user (speaker) is based on the feature vector of the audio signal including the user's voice, and obtain text matching the user's voice through voice recognition. It may be possible.

출력부(160)는 다양한 정보를 청각적으로 출력하기 위한 구성이다. 출력부(160)는 스피커, 이어폰/헤드폰 단자 등을 포함할 수 있다.The output unit 160 is configured to output various information audibly. The output unit 160 may include a speaker, an earphone/headphone terminal, etc.

예를 들어, 전자 장치(100)는 사용자 입력에 따라 요청된 오디오 콘텐츠를 출력부(160)를 통해 출력할 수 있다.For example, the electronic device 100 may output audio content requested according to user input through the output unit 160.

한편, 이상에서 설명된 다양한 실시 예들은 서로 저촉되거나 모순되지 않는 한 두 개 이상의 실시 예가 서로 결합되어 구현될 수 있다.Meanwhile, the various embodiments described above may be implemented by combining two or more embodiments as long as they do not conflict or contradict each other.

한편, 이상에서 설명된 다양한 실시 예들은 소프트웨어(software), 하드웨어(hardware) 또는 이들의 조합된 것을 이용하여 컴퓨터(computer) 또는 이와 유사한 장치로 읽을 수 있는 기록 매체 내에서 구현될 수 있다.Meanwhile, the various embodiments described above may be implemented in a recording medium that can be read by a computer or similar device using software, hardware, or a combination thereof.

하드웨어적인 구현에 의하면, 본 개시에서 설명되는 실시 예들은 ASICs(Application Specific Integrated Circuits), DSPs(digital signal processors), DSPDs(digital signal processing devices), PLDs(programmable logic devices), FPGAs(field programmable gate arrays), 프로세서(processors), 제어기(controllers), 마이크로 컨트롤러(micro-controllers), 마이크로 프로세서(microprocessors), 기타 기능 수행을 위한 전기적인 유닛(unit) 중 적어도 하나를 이용하여 구현될 수 있다. According to hardware implementation, embodiments described in this disclosure include application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), and field programmable gate arrays (FPGAs). ), processors, controllers, micro-controllers, microprocessors, and other electrical units for performing functions.

일부의 경우에 본 명세서에서 설명되는 실시 예들이 프로세서 자체로 구현될 수 있다. 소프트웨어적인 구현에 의하면, 본 명세서에서 설명되는 절차 및 기능과 같은 실시 예들은 별도의 소프트웨어 모듈들로 구현될 수 있다. 상술한 소프트웨어 모듈들 각각은 본 명세서에서 설명되는 하나 이상의 기능 및 작동을 수행할 수 있다.In some cases, embodiments described herein may be implemented in the processor itself. According to software implementation, embodiments such as procedures and functions described in this specification may be implemented as separate software modules. Each of the software modules described above may perform one or more functions and operations described herein.

한편, 상술한 본 개시의 다양한 실시 예들에 따른 전자 장치(100)에서의 처리동작을 수행하기 위한 컴퓨터 명령어(computer instructions) 또는 컴퓨터 프로그램은 비일시적 컴퓨터 판독 가능 매체(non-transitory computer-readable medium)에 저장될 수 있다. 이러한 비일시적 컴퓨터 판독 가능 매체에 저장된 컴퓨터 명령어 또는 컴퓨터 프로그램은 특정 기기의 프로세서에 의해 실행되었을 때 상술한 다양한 실시 예에 따른 전자 장치(100)에서의 처리 동작을 상술한 특정 기기가 수행하도록 한다. Meanwhile, computer instructions or computer programs for performing processing operations in the electronic device 100 according to various embodiments of the present disclosure described above are non-transitory computer-readable medium. It can be saved in . Computer instructions or computer programs stored in such non-transitory computer-readable media, when executed by a processor of a specific device, cause the specific device to perform processing operations in the electronic device 100 according to the various embodiments described above.

비일시적 컴퓨터 판독 가능 매체란 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 비일시적 컴퓨터 판독 가능 매체의 구체적인 예로는, CD, DVD, 하드 디스크, 블루레이 디스크, USB, 메모리카드, ROM 등이 있을 수 있다.A non-transitory computer-readable medium refers to a medium that stores data semi-permanently and can be read by a device, rather than a medium that stores data for a short period of time, such as registers, caches, and memories. Specific examples of non-transitory computer-readable media may include CD, DVD, hard disk, Blu-ray disk, USB, memory card, ROM, etc.

이상에서는 본 개시의 바람직한 실시 예에 대하여 도시하고 설명하였지만, 본 개시는 상술한 특정의 실시 예에 한정되지 아니하며, 청구범위에서 청구하는 본 개시의 요지를 벗어남이 없이 당해 개시에 속하는 기술분야에서 통상의 지식을 가진 자에 의해 다양한 변형실시가 가능한 것은 물론이고, 이러한 변형실시들은 본 개시의 기술적 사상이나 전망으로부터 개별적으로 이해되어서는 안될 것이다.In the above, preferred embodiments of the present disclosure have been shown and described, but the present disclosure is not limited to the specific embodiments described above, and may be used in the technical field pertaining to the disclosure without departing from the gist of the disclosure as claimed in the claims. Of course, various modifications can be made by those skilled in the art, and these modifications should not be understood individually from the technical ideas or perspectives of the present disclosure.

100: 전자 장치 110: 메모리
120: 프로세서100: electronic device 110: memory
120: processor

Claims

In electronic devices,
A memory containing information about feature vectors matching the voices of each of the plurality of speakers; and
Selecting a voice of a first speaker among the plurality of speakers according to at least one user input, identifying a first feature vector of the voice of the selected first speaker, and generating a second feature vector similar to the identified first feature vector. An electronic device comprising a processor that selects a second speaker that matches.

According to paragraph 1,
The memory is,
For each of the plurality of speakers, audio content including the speaker's voice is included,
The processor,
An electronic device that acquires the feature vector by analyzing audio signals constituting the audio content.

According to paragraph 2,
The processor,
An electronic device that recommends audio content including the voice of the selected second speaker.

According to paragraph 3,
The processor,
If the audio content including the voice of the selected second speaker is provided for less than a certain period of time after the recommendation for the audio content including the voice of the selected second speaker is performed, the feature vectors of the plurality of speakers Selecting at least one third vector whose similarity to the first feature vector is greater than or equal to a threshold and whose similarity to the second feature vector is less than the threshold,
An electronic device that recommends audio content including the voice of a third speaker matching the selected third feature vector.

According to paragraph 1,
The memory is,
An electronic device comprising a plurality of voice generation models for converting at least one text into an audio signal matching the voice of each of the plurality of speakers.

According to clause 5,
The processor,
Identifying a speech production model that matches the selected second speaker among at least one speech production model stored in the memory,
Inputting at least one text into the identified voice generation model to obtain an audio signal including the voice of the second speaker corresponding to the text,
An electronic device that recommends audio content including the acquired audio signal.

According to paragraph 1,
The memory is,
The plurality of speakers include information on a plurality of groups divided based on similarity between feature vectors of the plurality of speakers,
The processor,
Identifying the group to which the first speaker belongs among the plurality of groups,
Selecting the second speaker included in the identified group.

According to paragraph 1,
The memory is,
It includes a speaker identification model for acquiring the speaker's feature vector from the audio signal,
The processor,
When an audio signal containing the voice of a random speaker is acquired, the acquired audio signal is input into the speaker identification model to obtain a feature vector of the random speaker,
An electronic device that identifies the arbitrary speaker by comparing the obtained feature vector with feature vectors of each of the plurality of speakers.

According to clause 8,
The memory is,
A plurality of voice recognition models for recognizing audio signals matching the voices of each of the plurality of speakers and converting them into text,
The processor,
If the random speaker is identified as a third speaker included in the plurality of speakers, audio containing the voice of the random speaker is based on a voice recognition model that matches the third speaker among the plurality of voice recognition models. Convert the signal to text,
When the random speaker is identified as not being included in the plurality of speakers, a fourth speaker matching a feature vector similar to the feature vector of the random speaker among the feature vectors of the plurality of speakers is identified, and the plurality of speakers are identified. An electronic device that converts an audio signal containing the voice of an arbitrary speaker into text based on a voice recognition model that matches the fourth speaker among voice recognition models.

In the control method of an electronic device including information on feature vectors matching the voices of each of a plurality of speakers,
selecting the voice of a first speaker among the plurality of speakers according to a user input; and
A method of controlling an electronic device comprising: selecting a second speaker matching a second feature vector similar to the first feature vector of the voice of the selected first speaker.

A computer-readable medium storing a computer program that is executed by a processor of an electronic device and causes the electronic device to perform the control method of claim 10.