KR20210049601A

KR20210049601A - Method and apparatus for providing voice service

Info

Publication number: KR20210049601A
Application number: KR1020190134100A
Authority: KR
Inventors: 부영종; 최송아
Original assignee: 삼성전자주식회사
Priority date: 2019-10-25
Filing date: 2019-10-25
Publication date: 2021-05-06
Also published as: WO2021080190A1

Abstract

The present disclosure relates to a method for providing a voice service and an application thereof. A method of operating an electronic device according to one embodiment of the present disclosure includes the operations of: acquiring information about an object to store a voice; extracting the voice of the object by detecting the voice state of the object from an image including the object based on the obtained information about the object; storing a result of modeling the extracted voice, voice information related to the voice of the object, and the voice of the object; and providing a voice service based on the stored voice information. So, the voice service can be performed by extracting the voice of a speaker from the image.

Description

Method and device for providing voice service {METHOD AND APPARATUS FOR PROVIDING VOICE SERVICE}

본 개시의 실시예들은 음성 서비스 제공 방법에 관한 것으로서, 보다 구체적으로는 영상으로부터 발화자의 목소리를 추출하여, 사용자에게 음성 서비스를 제공하는 방법, 장치 및 서버에 관한 것이다.Embodiments of the present disclosure relate to a method of providing a voice service, and more particularly, to a method, apparatus, and server for providing a voice service to a user by extracting a talker's voice from an image.

최근의 전자 장치들은 키보드, 마우스, 터치 입력 등을 이용한 전통적인 입력 방식에 부가하여, 마이크를 통하여 음성을 입력하는 다양한 입력 방식을 지원할 수 있다. 전자 장치들은 마이크를 통해 입력된 음성을 일정한 음성 단위로 분할한 다음, 부호를 붙여 합성기에 입력하였다가 지시에 따라 필요한 음성 단위만을 다시 합쳐 말소리를 인위로 만들어내는 음성 합성 기술로서 예를 들어 텍스트 음성 변환(text-to-speech, TTS)을 사용할 수 있다. 뿐만 아니라, 생체 인식을 수행하기 위해 목소리를 이용하는 기술도 연구되고 있는 실정이다.In addition to traditional input methods using a keyboard, mouse, and touch input, recent electronic devices can support various input methods for inputting voice through a microphone. Electronic devices are speech synthesis technologies that artificially create speech by dividing the voice input through a microphone into certain units of speech, then adding a sign and inputting it to a synthesizer, then recombining only the necessary speech units according to the instructions. For example, text-to-speech. Transformation (text-to-speech, TTS) can be used. In addition, a technology that uses voices to perform biometrics is being studied.

다만, 전자 장치들은 음성 서비스를 제공하기 위하여 마이크를 통해 사용자로부터 수많은 문장에 대한 녹음 데이터를 획득하거나, 부자연스러운 인위적인 목소리를 이용하여야 했다. 따라서, 자연스러운 음성 서비스를 제공하기 위해서는 번거로운 녹음 작업이 수행되어야 하므로, 번거로움 없이 전자 장치가 서버 등으로부터 획득할 수 있는 영상으로부터 음성 목소리를 추출하여 저장함으로써 제공되는 음성 서비스에 대한 연구가 필요한 실정이다.However, electronic devices have to acquire recorded data for numerous sentences from users through microphones or use unnatural artificial voices in order to provide voice services. Therefore, in order to provide a natural voice service, a cumbersome recording operation needs to be performed.Therefore, a study on the voice service provided by extracting and storing the voice voice from the image that the electronic device can obtain from the server, etc. is required without hassle. .

본 개시가 해결하고자 하는 과제는 영상으로부터 발화자의 목소리를 추출하여 음성 서비스를 제공하는 방법, 디바이스 및 시스템을 제공하기 위한 것이다.A problem to be solved by the present disclosure is to provide a method, a device, and a system for providing a voice service by extracting a speaker's voice from an image.

또한, 상기 방법을 컴퓨터에서 실행시키기 위한 프로그램을 기록한 컴퓨터로 읽을 수 있는 기록매체를 포함하는 컴퓨터 프로그램 제품을 제공하는 데 있다. 해결하려는 기술적 과제는 상기에 기재된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다.In addition, it is to provide a computer program product including a computer-readable recording medium in which a program for executing the method on a computer is recorded. The technical problem to be solved is not limited to the technical problems as described above, and other technical problems may exist.

일 실시예에 따른 전자 장치의 동작 방법은, 음성을 저장할 객체에 관한 정보를 획득하는 동작, 상기 획득한 객체에 관한 정보에 기초하여, 상기 객체가 포함된 영상으로부터 상기 객체의 발화상태를 검출하여 상기 객체의 음성을 추출하는 동작, 상기 추출된 음성을 모델링한 결과, 상기 객체의 음성과 관련된 음성 정보 및 상기 객체의 음성을 저장하는 동작 및 상기 저장된 음성 정보에 기초하여, 음성 서비스를 제공하는 동작을 포함할 수 있다.An operation method of an electronic device according to an embodiment includes an operation of obtaining information on an object to store a voice, and detecting an utterance state of the object from an image including the object based on the obtained information on the object. An operation of extracting the voice of the object, an operation of storing voice information related to the voice of the object and the voice of the object as a result of modeling the extracted voice, and an operation of providing a voice service based on the stored voice information It may include.

또한, 일 실시예에 따른 전자 장치의 동작 방법은, 상기 획득한 객체에 관한 정보에 기초하여, 상기 장치에서 재생중인 영상, 상기 장치와 통신하는 외부 장치에서 재생중인 영상 및 상기 외부 장치에 저장된 영상 중 적어도 하나를 포함하는 영상으로부터 상기 객체가 포함된 영상을 획득하는 동작을 더 포함할 수 있다.In addition, a method of operating an electronic device according to an embodiment includes an image being played in the device, an image being played in an external device communicating with the device, and an image stored in the external device based on the acquired object information. The operation of obtaining an image including the object from an image including at least one of the above may be further included.

또한, 일 실시예에 따른 전자 장치의 동작 방법은, 복수의 인물이 포함된 영상을 획득하는 동작을 더 포함하고, 상기 객체의 음성을 추출하는 동작은, 상기 획득한 객체에 관한 정보에 기초하여, 상기 획득한 영상으로부터 상기 객체를 인식하는 동작 및 상기 획득한 영상으로부터 상기 인식된 객체의 발화상태를 검출하여 상기 객체의 음성을 추출하는 동작을 포함할 수 있다.In addition, the method of operating an electronic device according to an embodiment further includes an operation of obtaining an image including a plurality of people, and the operation of extracting a voice of the object is performed based on the obtained information on the object. And recognizing the object from the acquired image, and extracting the voice of the object by detecting the speech state of the recognized object from the acquired image.

또한, 일 실시예에 따른 전자 장치의 동작 방법에서 상기 음성 서비스를 제공하는 동작은, 상기 저장된 음성 정보에 기초하여 결정된 출력 텍스트를 상기 객체의 음성으로 출력하는 음성 서비스를 제공하는 동작을 포함할 수 있다.In addition, in the method of operating the electronic device according to an embodiment, the providing of the voice service may include providing a voice service of outputting an output text determined based on the stored voice information as the voice of the object. have.

또한, 일 실시예에 따른 전자 장치의 동작 방법에서 상기 객체의 음성과 관련된 음성 정보는, 상기 객체의 음성에 대한 사투리, 억양, 유행어, 말투, 발음, 패러디 및 대사 중 적어도 하나를 포함하는 음성 정보일 수 있다.In addition, in the method of operating the electronic device according to an embodiment, the voice information related to the voice of the object is voice information including at least one of dialect, intonation, buzzword, tone, pronunciation, parody, and dialogue for the voice of the object. Can be

또한, 일 실시예에 따른 전자 장치의 동작 방법은, 상기 획득한 객체에 관한 정보에 기초하여, 서버로부터 상기 객체가 포함된 영상을 수신하는 동작을 더 포함하고, 상기 객체의 음성을 추출하는 동작은, 상기 서버로부터 수신한 영상을 배속으로 재생함으로써, 상기 객체의 발화상태를 검출하여 배속으로 상기 객체의 음성을 추출할 수 있다.In addition, the method of operating an electronic device according to an embodiment further includes an operation of receiving an image including the object from a server, based on the obtained information on the object, and extracting the voice of the object. By reproducing the image received from the server at double speed, the speech state of the object can be detected and the voice of the object can be extracted at double speed.

또한, 일 실시예에 따른 전자 장치의 동작 방법에서 상기 객체의 음성을 추출하는 동작은, 상기 영상을 분석하여 상기 객체의 얼굴이 나타나는 화면 및 상기 객체의 얼굴이 나타나는 시간을 획득하는 동작 및 상기 객체의 입술의 움직임, 상기 객체의 입의 모양, 상기 객체의 치아 인식 및 상기 영상의 스크립트 중 적어도 하나를 포함하는 분석 결과에 기초하여, 상기 획득한 시간에 해당하는 영상으로부터 상기 객체의 발화상태를 검출하여 상기 객체의 음성을 추출하는 동작을 포함할 수 있다.In addition, in the operation method of the electronic device according to an embodiment, the operation of extracting the voice of the object includes an operation of analyzing the image to obtain a screen where the face of the object appears and a time when the face of the object appears, and the object Detecting the firing state of the object from the image corresponding to the acquired time based on the analysis result including at least one of the movement of the lips of the object, the shape of the mouth of the object, the tooth recognition of the object, and the script of the image Thus, the operation of extracting the voice of the object may be included.

또한, 일 실시예에 따른 전자 장치의 동작 방법에서 음성 정보 및 상기 객체의 음성을 저장하는 동작은, 서버로부터 상기 객체의 음성에 대한 모델링 결과를 수신하는 동작 및 상기 서버로부터 수신한 모델링 결과를 반영하여, 상기 음성 정보 및 상기 객체의 음성을 저장하는 동작을 포함할 수 있다.In addition, in the operation method of the electronic device according to an embodiment, the operation of storing voice information and the voice of the object reflects the operation of receiving a modeling result of the voice of the object from the server and the modeling result received from the server. Thus, the operation of storing the voice information and the voice of the object may be included.

또한, 일 실시예에 따른 전자 장치의 동작 방법에서 상기 모델링 결과는, 다른 장치에 의해 상기 객체의 음성이 또 다른 장치 또는 상기 장치에 의해 요청될 가능성이 있다고 판단된 결과, 상기 다른 장치에 의해 추출되어 서버에 저장된 모델링 결과일 수 있다.In addition, in the method of operating the electronic device according to an embodiment, the modeling result is a result of determining that the voice of the object may be requested by another device or the device by another device, and is extracted by the other device. It may be a modeling result stored in the server.

또한, 일 실시예에 따른 전자 장치의 동작 방법은, 상기 획득한 객체에 관한 정보에 기초하여, 상기 객체의 음성이 다른 장치에 의해 요청될 가능성이 있는지 판단하는 동작 및 상기 객체의 음성이 다른 장치에 의해 요청될 가능성이 있다고 판단한 결과, 상기 저장된 음성 정보 및 상기 저장된 객체의 음성을 서버에 전송하는 동작을 더 포함할 수 있다.In addition, a method of operating an electronic device according to an embodiment includes an operation of determining whether a voice of the object is likely to be requested by another device, and a device having a different voice of the object, based on the acquired information on the object. As a result of determining that there is a possibility to be requested by, the operation of transmitting the stored voice information and the stored object voice to the server may be further included.

또한, 일 실시예에 따른 전자 장치의 동작 방법은, 상기 장치의 사용자로부터 상기 추출된 음성이 상기 객체의 음성과 동일한지 여부에 대한 응답을 수신하는 동작 및 상기 수신된 응답에 기초하여 상기 추출된 음성이 상기 객체의 음성과 동일한지 결정하는 동작을 더 포함하고, 상기 음성 정보 및 상기 객체의 음성을 저장하는 단계는, 상기 동일하다고 결정한 결과, 상기 음성 정보 및 상기 객체의 음성을 저장할 수 있다.In addition, a method of operating an electronic device according to an embodiment includes an operation of receiving a response from a user of the device as to whether the extracted voice is the same as the voice of the object, and the extracted response based on the received response. The operation of determining whether the voice is the same as the voice of the object may be further included, and the storing of the voice information and the voice of the object may store the voice information and the voice of the object as a result of determining that the voice is the same.

다른 일 실시예에 따른 전자 장치는, 통신부, 하나 이상의 인스트럭션을 저장하는 메모리, 상기 메모리에 저장된 상기 하나 이상의 인스트럭션을 실행하는 적어도 하나의 프로세서를 포함하고, 상기 적어도 하나의 프로세서는, 음성을 저장할 객체에 관한 정보를 획득하고, 상기 획득한 객체에 관한 정보에 기초하여, 상기 객체가 포함된 영상으로부터 상기 객체의 발화상태를 검출하여 상기 객체의 음성을 추출하고, 상기 추출된 음성을 모델링한 결과, 상기 객체의 음성과 관련된 음성 정보 및 상기 객체의 음성을 저장하고, 상기 저장된 음성 정보에 기초하여, 음성 서비스를 제공할 수 있다.An electronic device according to another embodiment includes a communication unit, a memory that stores one or more instructions, at least one processor that executes the one or more instructions stored in the memory, and the at least one processor is an object to store a voice. As a result of obtaining information about the object, extracting the speech of the object by detecting the speech state of the object from the image including the object, and modeling the extracted speech, based on the obtained information on the object, The voice information related to the voice of the object and the voice of the object may be stored, and a voice service may be provided based on the stored voice information.

또 다른 일 실시예에 따른 하나 이상의 프로그램들을 저장하는 컴퓨터로 읽을 수 있는 기록매체는, 상기 하나 이상의 프로그램들은, 전자 장치의 하나 이상의 프로세서에 의해 실행될 때, 상기 전자 장치로 하여금: 음성을 저장할 객체에 관한 정보를 획득하고, 상기 획득한 객체에 관한 정보에 기초하여, 상기 객체가 포함된 영상으로부터 상기 객체의 발화상태를 검출하여 상기 객체의 음성을 추출하고, 상기 추출된 음성을 모델링한 결과, 상기 객체의 음성과 관련된 음성 정보 및 상기 객체의 음성을 저장하고, 상기 저장된 음성 정보에 기초하여, 음성 서비스를 제공하는 명령어들을 포함할 수 있다.A computer-readable recording medium storing one or more programs according to another embodiment, when the one or more programs are executed by one or more processors of the electronic device, causes the electronic device to: And, based on the obtained information on the object, by detecting the speech state of the object from the image including the object, the speech of the object is extracted, and as a result of modeling the extracted speech, the It may include instructions for storing voice information related to the voice of the object and the voice of the object, and providing a voice service based on the stored voice information.

본 개시의 실시예들에 따르면, 목소리를 저장하기 위해 여러 번 같은 발화를 반복해야 하는 번거로움 및 걸리는 시간을 절감시킬 수 있다.According to embodiments of the present disclosure, it is possible to reduce the hassle and time required to repeat the same utterance several times to store a voice.

또한, 본 개시의 실시예들에 따르면, 전자 장치는 데이터 베이스 내 저장된 목소리와 사용자의 목소리 뿐만 아니라, 그 이외의 목소리를 이용하여 음성 서비스를 제공할 수 있다.In addition, according to embodiments of the present disclosure, the electronic device may provide a voice service using not only a voice stored in a database and a user's voice, but also voices other than the voices.

본 개시의 다른 특징들 및 이점들은 다음의 상세한 설명 및 첨부된 도면들로부터 명백해질 것이다.Other features and advantages of the present disclosure will become apparent from the following detailed description and accompanying drawings.

도 1은 일 실시예에 따른 전자 장치에서 재생되는 영상으로부터 음성을 추출하여 음성 서비스를 제공하는 방법을 나타내는 개략도이다.
도 2는 일 실시예에 따른 영상으로부터 객체의 음성을 추출하여 음성 서비스를 제공하는 방법을 설명하기 위한 흐름도이다.
도 3a는 일 실시예에 따른 외부 장치로부터 영상을 획득하여 음성 서비스를 제공하는 방법을 설명하기 위한 흐름도이다.
도 3b는 일 실시예에 따른 외부 장치로부터 영상을 획득하여 음성 서비스를 제공하는 방법의 예시도이다.
도 3c는 일 실시예에 따른 디스플레이 장치에 의해 음성인식이 수행되어 음성 서비스를 제공하는 방법의 예시도이다.
도 4는 일 실시예에 따른 서버로부터 객체에 대한 영상을 검색하여 음성 서비스를 제공하는 방법의 예시도이다.
도 5는 일 실시예에 따른 외부 장치로부터 추출된 음성 데이터를 서버로부터 수신하는 방법의 예시도이다.
도 6은 일 실시예에 따른 영상으로부터 화면 및 음성을 모델링함으로써 TTS(text-to-speech) 목소리를 생성하는 방법을 나타내는 흐름도이다.
도 7은 일 실시예에 따른 영상의 부가 정보를 고려하여 객체의 발화상태를 검출하는 방법의 예시도이다.
도 8은 일 실시예에 따른 음성 서비스를 제공하기 위해 음성 목소리를 추출하고 검증 받는 방법의 예시도이다.
도 9는 일 실시예에 따른 복수의 인물을 포함하는 영상으로부터 사용자가 특정한 객체에 대한 음성을 저장하는 방법의 예시도이다.
도 10은 일 실시예에 따른 전자 장치의 구성을 도시한 블록도이다.
도 11는 일 실시예에 따른 전자 장치의 세부 구성을 도시한 블록도이다.1 is a schematic diagram illustrating a method of providing a voice service by extracting voice from an image reproduced in an electronic device according to an exemplary embodiment.
2 is a flowchart illustrating a method of providing a voice service by extracting a voice of an object from an image, according to an exemplary embodiment.
3A is a flowchart illustrating a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
3B is an exemplary diagram of a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.
3C is an exemplary diagram of a method for providing a voice service by performing voice recognition by a display device according to an exemplary embodiment.
4 is an exemplary diagram of a method of providing a voice service by searching for an image of an object from a server according to an exemplary embodiment.
5 is an exemplary diagram of a method of receiving voice data extracted from an external device from a server according to an exemplary embodiment.
6 is a flowchart illustrating a method of generating a text-to-speech (TTS) voice by modeling a screen and a voice from an image according to an exemplary embodiment.
7 is an exemplary diagram of a method of detecting an utterance state of an object in consideration of additional information of an image according to an exemplary embodiment.
8 is an exemplary diagram of a method for extracting and verifying a voice voice in order to provide a voice service according to an embodiment.
9 is an exemplary diagram of a method for a user to store a voice for a specific object from an image including a plurality of people, according to an exemplary embodiment.
10 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.
11 is a block diagram illustrating a detailed configuration of an electronic device according to an exemplary embodiment.

아래에서는 첨부한 도면을 참고하여 실시예들에 대하여 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나, 실시예들은 다양한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고, 도면에서 실시예들을 명확하게 설명하기 위해 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the embodiments may be implemented in various forms and are not limited to the embodiments described herein. In addition, in the drawings, parts irrelevant to the description are omitted in order to clearly describe the embodiments, and similar reference numerals are attached to similar parts throughout the specification.

본 개시에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 당 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 개시의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in the present disclosure have selected general terms that are currently widely used as possible while considering functions in the present disclosure, but this may vary according to the intention or precedent of a technician engaged in the art, the emergence of new technologies, and the like. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the corresponding disclosure. Therefore, the terms used in the present disclosure should be defined based on the meaning of the term and the overall contents of the present disclosure, not a simple name of the term.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수개의 표현을 포함한다. "포함하다" 또는 "가지다" 등의 용어는 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. 특히, 숫자들은 이해를 돕기 위한 예로서, 기재된 숫자들에 의해 실시예들이 한정되는 것으로 이해되지 말아야 한다.Singular expressions include plural expressions unless the context clearly indicates otherwise. Terms such as "comprise" or "have" are intended to designate the presence of a feature, number, step, action, component, part, or combination thereof, and one or more other features or numbers, steps, actions, or configurations. It is to be understood that the possibility of the presence or addition of elements, parts or combinations thereof is not preliminarily excluded. In particular, the numbers are examples for aiding understanding, and the embodiments should not be understood as being limited by the numbers described.

"적어도 하나의"와 같은 표현은, 구성요소들의 리스트 전체를 수식하고, 그 리스트의 구성요소들을 개별적으로 수식하지 않는다. 예를 들어, "A, B, 및 C 중 적어도 하나"는 오직 A, 오직 B, 오직 C, A와 B 모두, B와 C 모두, A와 C 모두, A와 B와 C 전체, 또는 그 조합을 가리킨다.An expression such as "at least one" modifies the entire list of elements and does not individually modify the elements of the list. For example, "at least one of A, B, and C" means only A, only B, only C, all A and B, all B and C, all A and C, all A and B and C, or a combination thereof. Points to.

제1, 제2 등과 같이 서수를 포함하는 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 구성 요소들은 용어들에 의해 한정되지는 않는다. 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 개시의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 및/또는 이라는 용어는 복수의 관련된 항목들의 조합 또는 복수의 관련된 항목들 중의 어느 하나의 항목을 포함한다.Terms including ordinal numbers such as first and second may be used to describe various elements, but the elements are not limited by the terms. The terms are used only for the purpose of distinguishing one component from other components. For example, without departing from the scope of the present disclosure, a first element may be referred to as a second element, and similarly, a second element may be referred to as a first element. The term and/or includes a combination of a plurality of related items or any one of a plurality of related items.

또한, 본 개시에서 사용되는 "부"라는 용어는 소프트웨어, FPGA 또는 ASIC과 같은 하드웨어 구성요소를 의미하며, "부"는 어떤 역할들을 수행한다. 그렇지만 "부"는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. "부"는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서 "부"는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 특성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 및 변수들을 포함한다. 구성요소들과 "부"들 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 "부"들로 결합되거나 추가적인 구성요소들과 "부"들로 더 분리될 수 있다.In addition, the term "unit" used in the present disclosure means a hardware component such as software, FPGA or ASIC, and "unit" performs certain roles. However, "unit" is not meant to be limited to software or hardware. The “unit” may be configured to be in an addressable storage medium or may be configured to reproduce one or more processors. Thus, as an example, "unit" refers to components such as software components, object-oriented software components, class components, and task components, processes, functions, properties, procedures, Includes subroutines, segments of program code, drivers, firmware, microcode, circuitry, data, database, data structures, tables, arrays and variables. The functions provided within the components and "units" may be combined into a smaller number of components and "units" or may be further separated into additional components and "units".

본 개시 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다.Throughout this disclosure, when a part is said to be "connected" with another part, this includes not only the case that it is "directly connected", but also the case that it is "electrically connected" with another element interposed therebetween. do. In addition, when a part "includes" a certain component, it means that other components may be further included rather than excluding other components unless specifically stated to the contrary.

"예시적인"이라는 단어는 본 개시에서 "예시 또는 예증으로서 사용된"의 의미로 사용된다. 본 개시에서 "예시적인"것으로 설명된 임의의 실시예는 반드시 바람직한 것으로서 해석되거나 다른 실시예들보다 이점을 갖는 것으로 해석되어서는 안된다.The word "exemplary" is used in this disclosure in the sense of "used as an example or illustration." Any embodiment described as “exemplary” in this disclosure is not necessarily to be construed as being preferred or to have an advantage over other embodiments.

또한, 본 개시에서 음성 서비스는, 음성 합성(speech synthesis) 서비스, TTS(text-to-speech) 서비스, 생체 인식 기반의 음성 서비스, 시각 장애인용 음성 서비스, 가상 어시스턴스의 음성 서비스, Voice Guide 서비스, 목소리 지문 서비스 등을 포함하는 모든 음성을 이용하여 제공되는 서비스를 의미할 수 있다. In addition, in the present disclosure, the voice service includes a speech synthesis service, a text-to-speech (TTS) service, a biometric-based voice service, a voice service for the visually impaired, a voice service of a virtual assistance, and a voice guide service. , Voice, fingerprint service, etc. may mean a service provided using all voices.

본 개시의 일 실시예는 기능적인 블록 구성들 및 다양한 처리 단계들로 나타내어질 수 있다. 이러한 기능 블록들의 일부 또는 전부는, 특정 기능들을 실행하는 다양한 개수의 하드웨어 및/또는 소프트웨어 구성들로 구현될 수 있다. 예를 들어, 본 개시의 기능 블록들은 하나 이상의 마이크로프로세서들에 의해 구현되거나, 소정의 기능을 위한 회로 구성들에 의해 구현될 수 있다. 또한, 예를 들어, 본 개시의 기능 블록들은 다양한 프로그래밍 또는 스크립팅 언어로 구현될 수 있다. 기능 블록들은 하나 이상의 프로세서들에서 실행되는 알고리즘으로 구현될 수 있다. 또한, 본 개시는 전자적인 환경 설정, 신호 처리, 및/또는 데이터 처리 등을 위하여 종래 기술을 채용할 수 있다.An embodiment of the present disclosure may be represented by functional block configurations and various processing steps. Some or all of these functional blocks may be implemented with various numbers of hardware and/or software components that perform specific functions. For example, the functional blocks of the present disclosure may be implemented by one or more microprocessors, or may be implemented by circuit configurations for a predetermined function. In addition, for example, the functional blocks of the present disclosure may be implemented in various programming or scripting languages. Functional blocks may be implemented as an algorithm executed on one or more processors. In addition, the present disclosure may employ conventional techniques for electronic environment setting, signal processing, and/or data processing.

이하 첨부된 도면을 참고하여 본 개시를 상세히 설명하기로 한다.Hereinafter, the present disclosure will be described in detail with reference to the accompanying drawings.

도 1은 일 실시예에 따른 전자 장치에서 재생되는 영상으로부터 음성을 추출하여 음성 서비스를 제공하는 방법을 나타내는 개략도이다.1 is a schematic diagram illustrating a method of providing a voice service by extracting voice from an image reproduced in an electronic device according to an exemplary embodiment.

도 1을 참조하면, 사용자가 전자 장치(100)에서 재생되고 있는 영상(110)을 시청하고 있는 도중, 전자 장치(100)는 사용자로부터 음성을 저장할 객체(120)에 관한 정보를 획득할 수 있다. 객체(120)는 영상(110)에서 발화를 수행하는 주체를 나타낸다. 일 실시예에서, 전자 장치(100)는 음성을 저장할 객체(120)에 관한 정보에 기초하여, 객체(120)가 포함된 영상(110)으로부터 객체(120)의 발화상태를 검출하여 객체(120)의 음성을 추출할 수 있다. 또한, 전자 장치(100)는 추출된 음성을 모델링함으로써, 객체(120)의 음성과 관련된 음성 정보와 객체(12)의 음성을 획득하고 저장할 수 있다. 또한, 전자 장치(100)는 저장된 음성 정보에 기초하여, 사용자에게 음성 서비스를 제공할 수 있다.Referring to FIG. 1, while a user is watching an image 110 being played on an electronic device 100, the electronic device 100 may obtain information on an object 120 to store a voice from the user. . The object 120 represents a subject performing speech in the image 110. In an embodiment, the electronic device 100 detects an utterance state of the object 120 from the image 110 including the object 120 based on information on the object 120 to be stored in the voice. ) Can be extracted. Also, by modeling the extracted voice, the electronic device 100 may acquire and store voice information related to the voice of the object 120 and the voice of the object 12. Also, the electronic device 100 may provide a voice service to a user based on the stored voice information.

일 실시예에서, 전자 장치(100)는 마이크를 통해 사용자의 음성을 수신하고, 마이크는 수신된 음성을 음성 신호로 변환할 수 있다. 전자 장치(100)는 아날로그 신호인 음성 신호에 대하여 ASR(Automatic Speech Recognition)을 수행하여 음성 부분을 컴퓨터로 판독 가능한 텍스트로 변환할 수 있다. 즉, 전자 장치(100)는 STT(Speech to Text) 기능을 수행할 수 있다. 또한, 전자 장치(100)는 컴퓨터로 판독 가능한 텍스트에 기초하여 음성을 저장할 객체에 관한 정보를 획득하고. 획득한 객체에 관한 정보에 기초하여, 객체가 포함된 영상으로부터 객체의 발화상태를 검출하여 객체의 음성을 추출하고, 추출된 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. 이에 따라, 전자 장치(100)는 TTS(Text to Speech) 기능을 수행하여 저장된 음성 정보에 기초하여 결정된 출력 텍스트를 객체의 음성으로 출력하는 음성 서비스를 제공할 수 있다. 일 실시예에서, 전자 장치(100)는 STT(Speech to Text) 기능을 수행하는 프로세서 및 TTS(Text to Speech) 기능을 수행하는 프로세서를 포함할 수 있다. 다른 일 실시예에서, 전자 장치(100)는 STT 기능 및 TTS 기능을 모두 수행하는 프로세서를 포함할 수도 있다. In an embodiment, the electronic device 100 may receive a user's voice through a microphone, and the microphone may convert the received voice into a voice signal. The electronic device 100 may convert a speech part into a computer-readable text by performing ASR (Automatic Speech Recognition) on a speech signal that is an analog signal. That is, the electronic device 100 may perform a Speech to Text (STT) function. In addition, the electronic device 100 acquires information on an object to store speech based on a computer-readable text. Based on the acquired object information, the object's speech is extracted by detecting the speech state of the object from the image containing the object, and as a result of modeling the extracted speech, speech information related to the object's speech and the object's speech are obtained. Can be saved. Accordingly, the electronic device 100 may perform a text to speech (TTS) function to provide a voice service for outputting an output text determined based on stored voice information as a voice of an object. In an embodiment, the electronic device 100 may include a processor that performs a speech to text (STT) function and a processor that performs a text to speech (TTS) function. In another embodiment, the electronic device 100 may include a processor that performs both an STT function and a TTS function.

다른 일 실시예에서, 음성 인식 또는 음성 합성 프로세스의 일부 또는 전부는 서버에서 수행될 수도 있다. 서버는 STT(Speech to Text) 서버 및 TTS(Text to Speech) 서버를 포함할 수 있고, STT 기능 및 TTS 기능을 모두 수행할 수 있는 하나의 서버를 포함할 수도 있다. In another embodiment, some or all of the speech recognition or speech synthesis process may be performed at the server. The server may include a Speech to Text (STT) server and a Text to Speech (TTS) server, and may include one server capable of performing both the STT function and the TTS function.

예를 들어, 전자 장치(100)는 객체(120)가 포함된 영상(110)을 재생하던 중에 사용자로부터 "백종원 목소리 저장해줘"라는 입력 메시지를 수신할 수 있다. 입력 메시지에는 음성을 저장할 객체(120)에 관한 정보로서 객체(120)의 이름에 해당하는 '백종원'이 포함될 수 있다. 즉, 전자 장치(100)는 사용자로부터 수신한 입력 메시지로부터 음성을 저장할 객체(120)에 관한 정보로서 객체(120)의 이름을 획득할 수 있다. 전자 장치(100)는 객체(120)의 이름에 기초하여, 영상(110)으로부터 객체(120)가 발화상태인지 검출할 수 있다. 일 실시예에서, 전자 장치(100)는 객체(120)의 얼굴 분석, 객체(120)의 입술의 움직임, 객체(120)의 입의 모양, 객체(120)의 치아 인식, 영상(110) 내 자막 유무, 영상(110)에 대응되는 스크립트 등을 고려하여 객체(120)가 발화상태인지 검출할 수 있으나, 발화상태인지 검출하는 데 고려하는 요소가 이에 제한되는 것은 아니다.For example, while playing the image 110 including the object 120, the electronic device 100 may receive an input message from the user saying "Save Baek Jong-won's voice". The input message may include'Baek Jong-won' corresponding to the name of the object 120 as information on the object 120 to store the voice. That is, the electronic device 100 may obtain the name of the object 120 as information on the object 120 to store a voice from an input message received from the user. The electronic device 100 may detect whether the object 120 is in the speech state from the image 110 based on the name of the object 120. In one embodiment, the electronic device 100 analyzes the face of the object 120, the movement of the lips of the object 120, the shape of the mouth of the object 120, the tooth recognition of the object 120, and the image 110 It is possible to detect whether the object 120 is in a speech state in consideration of the presence or absence of a caption, a script corresponding to the image 110, etc., but factors considered for detecting whether the object 120 is in the speech state are not limited thereto.

일 실시예에서, 전자 장치(100)는 객체(120)가 발화상태인 것으로 검출된 발화상태 구간에서, 객체(120)의 음성을 추출할 수 있다. 예를 들어, 영상(110)에 대한 분석을 시작한 후 미리 정해진 시간 동안 1초마다 화면을 캡처(capture)하고 화면 내 객체(120)가 입을 벌리고 있는지 유무, 입술의 모양, 자막의 유무, 화면 내 치아 유무, 스크립트 등을 분석하여 객체(120)가 발화상태인지 검출할 수 있다. 이에 따라, 객체가 발화상태인 것으로 검출된 발화상태 구간에서 전자 장치(100)는 객체(120)의 음성을 추출할 수 있다. 예를 들어, 전자 장치(100)는 발화상태로 검출된 시간으로부터 1초 동안 음성을 녹음할 수 있다. 다만, 이는 일 예일뿐 발화상태로 검출하는 방법 및 목소리를 녹음하는 방법이 이에 제한되는 것은 아니고, 전자 장치(100)는 1초보다 짧거나 긴 시간 동안 화면을 캡처할 수 있고, 1초보다 짧거나 긴 시간 동안 음성을 녹음할 수 있으며, 미리 정해진 시간 동안 객체(120)가 발화상태인지 검출하는 것이 아니라, 추출된 음성을 이용하여 TTS(text-to-speech)를 생성할 수 있을 때까지, 발화상태를 검출하고 해당 음성을 녹음할 수도 있다.In an embodiment, the electronic device 100 may extract the voice of the object 120 in the speech state section in which the object 120 is detected as the speech state. For example, after starting the analysis of the image 110, the screen is captured every 1 second for a predetermined period of time, and whether the object 120 in the screen is open or not, the shape of the lips, the presence of subtitles, and It is possible to detect whether the object 120 is in an ignition state by analyzing the presence or absence of a tooth and a script. Accordingly, the electronic device 100 may extract the voice of the object 120 in the speech state section in which the object is detected as the speech state. For example, the electronic device 100 may record a voice for 1 second from a time detected as an utterance state. However, this is only an example, and the method of detecting the speech state and the method of recording a voice are not limited thereto, and the electronic device 100 may capture a screen for a time shorter or longer than 1 second, and may be shorter than 1 second. Or recording a voice for a long period of time, and not detecting whether the object 120 is in a speech state for a predetermined time, but until a text-to-speech (TTS) can be generated using the extracted voice, It is also possible to detect the state of speech and record the corresponding voice.

일 실시예에서, 전자 장치(100)는 발화상태에 따른 음성을 모델링하여 객체와 매핑하고, 객체(120)의 음성과 관련된 음성 정보, 객체(120)의 음성 등을 저장할 수 있다. 즉, 전자 장치(100)는 추출된 화면, 발화상태로 검출된 시간, 추출된 음성 등을 모델링하여 병합할 수 있다. 객체(120)의 음성과 관련된 음성 정보는 객체(120)의 음성에 대한 사투리, 억양, 유행어, 말투, 발음, 패러디, 대사 등을 포함할 수 있으나, 이에 제한되는 것은 아니다. 예를 들어, 전자 장치(100)는 영상(110)으로부터 객체(120)의 발화구간에서 "알았쥬?"라고 말하는 객체(120)의 음성을 추출하면서, 객체(120)는 표준어의 두루높임인 '~요' 대신 '~유'를 사용한다는 음성 정보를 획득하고 객체(120)의 음성과 함께 음성 정보를 저장할 수 있다.In an embodiment, the electronic device 100 may model a voice according to an utterance state, map it to an object, and store voice information related to the voice of the object 120, the voice of the object 120, and the like. That is, the electronic device 100 may model and merge the extracted screen, the time detected as the speech state, and the extracted voice. The voice information related to the voice of the object 120 may include dialect, intonation, buzzword, speech, pronunciation, parody, dialogue, etc. of the voice of the object 120, but is not limited thereto. For example, while the electronic device 100 extracts the voice of the object 120 saying "Did you know?" in the speech section of the object 120 from the image 110, the object 120 is Voice information indicating that'~Yu' is used instead of'~yo' may be acquired, and the voice information may be stored together with the voice of the object 120.

일 실시예에서, 전자 장치(100)는 저장된 음성 정보에 기초하여, 음성 서비스(130)를 제공할 수 있다. 예를 들어, 사용자는 전자 장치(100)에게 객체(120)의 음성을 이용하여 알람(alarm)을 요청할 수 있다. 전자 장치(100)는 사용자가 특정한 시간에 객체(120)의 목소리로 사용자에게 알람을 제공할 수 있다. 일 실시예에서, 전자 장치(100)는 음성 서비스(130)를 제공함에 있어서 객체(120)의 음성과 관련된 음성 정보를 고려할 수 있다. 예를 들어, 전자 장치(100)는 "일어나세요" 대신 음성 정보에 기초하여 결정된 출력 텍스트인 "일어나시쥬?"의 텍스트를 객체의 음성으로 출력하는 음성 서비스를 사용자에게 제공할 수 있다.In an embodiment, the electronic device 100 may provide the voice service 130 based on the stored voice information. For example, the user may request an alarm from the electronic device 100 using the voice of the object 120. The electronic device 100 may provide an alarm to the user through the voice of the object 120 at a specific time by the user. In an embodiment, in providing the voice service 130, the electronic device 100 may consider voice information related to the voice of the object 120. For example, instead of "Wake up", the electronic device 100 may provide a user with a voice service that outputs the text of "Wake up?", which is an output text determined based on voice information, as the voice of an object.

한편, 전자 장치(100)는 스마트폰, 태블릿 PC, PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 스피커, 랩톱, 미디어 플레이어, 마이크로 서버, 전자책 객체 인식 장치, 디지털방송용 객체 인식 장치, 키오스크, MP3 플레이어, 디지털 카메라, 로봇 청소기, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다. 또한, 전자 장치(100)는 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 장치일 수 있다.Meanwhile, the electronic device 100 includes a smart phone, a tablet PC, a PC, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, an e-book object recognition device, an object recognition device for digital broadcasting, It may be a kiosk, an MP3 player, a digital camera, a robot cleaner, a home appliance, and other mobile or non-mobile computing devices, but is not limited thereto. In addition, the electronic device 100 may be a wearable device such as a watch, glasses, a hair band, and a ring having a communication function and a data processing function.

도 2는 일 실시예에 따른 영상으로부터 객체의 음성을 추출하여 음성 서비스를 제공하는 방법을 설명하기 위한 흐름도이다.2 is a flowchart illustrating a method of providing a voice service by extracting a voice of an object from an image, according to an exemplary embodiment.

도 2를 참조하면, 동작 210에서 전자 장치는 음성을 저장할 객체에 관한 정보를 획득할 수 있다. 일 실시예에서, 전자 장치는 사용자로부터 음성을 저장할 객체에 관한 정보로서 객체의 성명, 사진, 명칭, 상호, 초상, 예명, 필명, 약칭 등에 관한 정보를 수신할 수 있다. 전자 장치는 사용자로부터 수신한 정보에 기초하여, 객체가 포함된 영상을 획득할 수 있다. 예를 들어, 전자 장치는 객체가 포함된 영상으로서 전자 장치에서 재생중인 영상, 전자 장치와 통신할 수 있는 외부 장치에서 재생중인 영상, 외부 장치에 저장된 영상 등을 획득할 수 있으나, 이에 제한되는 것은 아니다. 일 실시예에서, 전자 장치와 통신할 수 있는 외부 장치는 서버, 스마트폰, 태블릿 PC, PC, TV, 스마트 TV, 휴대폰, PDA(personal digital assistant), 스피커, 랩톱, 미디어 플레이어, 마이크로 서버, 전자책 객체 인식 장치, 디지털방송용 객체 인식 장치, 키오스크, MP3 플레이어, 디지털 카메라, 로봇 청소기, 가전기기, 기타 모바일 또는 비모바일 컴퓨팅 장치, 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.Referring to FIG. 2, in operation 210, the electronic device may acquire information on an object to store voice. In an embodiment, the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pen name, abbreviation, etc. as information about an object to store a voice from a user. The electronic device may acquire an image including the object based on information received from the user. For example, as an image including an object, the electronic device may acquire an image being played on the electronic device, an image being played on an external device capable of communicating with the electronic device, and an image stored in the external device, but are limited thereto. no. In one embodiment, the external device capable of communicating with the electronic device is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, and an electronic device. Book object recognition devices, digital broadcasting object recognition devices, kiosks, MP3 players, digital cameras, robot vacuum cleaners, home appliances, other mobile or non-mobile computing devices, watches, glasses, hair bands and rings with communication and data processing functions It may include, but is not limited thereto.

동작 230에서, 전자 장치는 획득한 객체에 관한 정보에 기초하여, 객체가 포함된 영상으로부터 객체가 발화상태인지 검출할 수 있다. 일 실시예에서, 전자 장치는 객체가 포함된 영상을 분석하여 객체의 얼굴이 나타나는 화면과 객체의 얼굴이 나타나는 시간을 획득할 수 있다. 또한, 객체의 얼굴이 나타나는 시간에 대응되는 객체의 음성을 추출할 수 있다. 예를 들어, 전자 장치는 객체의 입술의 움직임, 객체의 입의 모양, 객체의 치아 인식, 영상의 자막 유무, 영상의 스크립트 등을 고려하여 영상으로부터 객체가 발화상태인지 검출할 수 있다. 일 실시예에서, 객체가 포함된 영상은 객체 이외의 다른 인물들을 포함할 수 있다. 전자 장치는 객체에 관한 정보에 기초하여, 예를 들어, 얼굴 인식 기술 등을 이용하여, 복수의 인물이 포함된 영상으로부터 음성 추출의 대상이 되는 객체를 인식할 수 있다. 또한, 전자 장치는 복수의 인물 중 인식 된 객체가 발화상태인지 여부를 검출할 수 있다.In operation 230, the electronic device may detect whether the object is in the speech state from the image including the object, based on the acquired information on the object. In an embodiment, the electronic device may obtain a screen on which the face of the object appears and a time when the face of the object appears by analyzing an image including the object. In addition, the voice of the object corresponding to the time when the face of the object appears may be extracted. For example, the electronic device may detect whether an object is in a speech state from an image in consideration of movement of the lips of the object, the shape of the mouth of the object, recognition of the teeth of the object, the presence of a caption in the image, and a script of the image. In an embodiment, an image including an object may include people other than the object. The electronic device may recognize an object to be extracted from an image including a plurality of people based on information on the object, for example, using face recognition technology. Also, the electronic device may detect whether an object recognized among a plurality of people is in a speech state.

동작 250에서, 전자 장치는 객체가 발화상태인 것으로 검출된 발화상태 구간에서, 객체의 음성을 추출할 수 있다. 일 실시예에서, 전자 장치는 사용자로부터 객체의 음성을 추출해달라는 요청을 수신한 후, 일정 시간 동안 객체가 발화상태인지 검출하고, 객체가 발화상태인 것으로 검출된 발화상태 구간에서, 객체의 음성을 추출할 수 있다. 예를 들어, 전자 장치는 사용자로부터 객체의 음성을 추출해달라는 요청을 수신한 후, 영상 분석을 시작할 수 있다. 일정 시간 동안 1초당 화면을 캡처(capture)하고 캡처된 화면을 분석하여 객체가 발화상태인지 검출할 수 있다. 일 실시예에서, 전자 장치는 객체가 발화상태인 것으로 검출된 발화상태 구간에서, 객체의 음성을 추출할 수 있다. 예를 들어, 전자 장치가 1초당 화면을 캡처한 결과 발화상태라고 판단한 구간으로부터 전자 장치는 1초 동안 객체의 음성을 녹음할 수 있다. 또는, 전자 장치는 발화상태라고 판단한 구간으로부터 발화상태가 아니라고 판단한 구간까지의 객체의 음성을 녹음할 수 있다. 다른 일 실시예에서, 전자 장치는 추출된 음성을 이용하여 TTS(text-to-speech)를 생성할 수 있을 때까지, 객체의 음성을 추출할 수도 있다.In operation 250, the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state. In one embodiment, after receiving a request to extract the voice of the object from the user, the electronic device detects whether the object is in the speech state for a certain period of time, and in the speech state section in which the object is in the speech state, the voice of the object is detected. Can be extracted. For example, after receiving a request from a user to extract a voice of an object, the electronic device may start image analysis. It is possible to detect whether an object is in fire state by capturing a screen per second for a certain period of time and analyzing the captured screen. In an embodiment, the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state. For example, the electronic device may record the voice of the object for 1 second from a section that is determined to be in an ignition state as a result of capturing a screen per second by the electronic device. Alternatively, the electronic device may record the voice of the object from the section determined to be in the utterance state to the section determined as not in the utterance state. In another embodiment, the electronic device may extract the voice of the object until it can generate text-to-speech (TTS) using the extracted voice.

일 실시예에서, 전자 장치는 객체가 발화 상태인 것으로 검출된 발화상태 구간에서, 배경 음악 또는 잡음 등과 객체의 음성을 구별하여 객체의 음성만을 추출할 수 있다. 예를 들어, 전자 장치는 사람 목소리에 해당하는 성분이 주로 위치한 주파수대를 제외한 노이즈를 제거하거나 배경 음악을 역위상으로 뒤집어서 덮어씌우는 방식으로 배경 음악을 제거할 수 있으나, 배경음악 또는 잡음을 제거하는 방법이 이에 제한되는 것은 아니고, 다른 종래 기술들이 사용될 수 있다. In an embodiment, the electronic device may extract only the voice of the object by distinguishing the voice of the object, such as background music or noise, in a section of the speech state in which the object is detected as the speech state. For example, an electronic device can remove background music by removing noise except for a frequency band in which a component corresponding to a human voice is mainly located, or by overwriting the background music in reverse phase. However, a method of removing background music or noise This is not limited thereto, and other conventional techniques may be used.

동작 270에서, 전자 장치는 추출된 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. 일 실시예에서, 전자 장치는 동작 230에서 검출된 발화상태 구간에 대한 화면과 동작 250에서 추출된 음성을 모델링할 수 있다. 즉, 전자 장치는 객체의 발화상태인 것으로 검출된 구간과 추출된 객체의 목소리를 대응시킴으로써 객체의 목소리를 추출할 수 있다. 또한, 전자 장치는 모델링 된 음성과 더불어 객체의 음성과 관련된 음성 정보를 저장할 수 있다. 예를 들어, 음성 정보는 객체의 음성에 대한 사투리, 방언, 억양, 유행어, 말투, 발음, 패러디, 대사 등일 수 있으나, 이에 제한되는 것은 아니다.In operation 270, as a result of modeling the extracted voice, the electronic device may store voice information related to the voice of the object and the voice of the object. In an embodiment, the electronic device may model a screen for a speech state section detected in operation 230 and a voice extracted in operation 250. That is, the electronic device may extract the voice of the object by correlating the section detected as being the object's speech state with the extracted voice of the object. In addition, the electronic device may store voice information related to the voice of the object in addition to the modeled voice. For example, the voice information may be dialect, dialect, intonation, buzzword, tone, pronunciation, parody, dialogue, etc. for the voice of the object, but is not limited thereto.

동작 290에서, 전자 장치는 저장된 음성 정보에 기초하여, 음성 서비스를 제공할 수 있다. 예를 들어, 전자 장치가 사용자의 알림 요청, TTS(text-to-speech)를 이용한 책 읽기 요청, 객체 음성을 이용한 가상 어시스턴스 요청 등을 수신한 경우, 전자 장치는 요청에 대응하는 음성 서비스를 사용자에게 제공할 수 있다. 일 실시예에서, 전자 장치는 저장된 음성 정보에 기초하여 결정된 출력 텍스트를 객체의 음성으로 출력하는 음성 서비스를 제공할 수 있다. 예를 들어, 전자 장치는 "안녕하세요?"에 대응되는 텍스트로부터 음성 정보에 기초하여 "안녕하시쥬?"에 대응되는 출력 텍스트를 결정하여, 결정된 출력 텍스트를 객체의 음성으로 출력하는 음성 서비스를 제공할 수 있다.In operation 290, the electronic device may provide a voice service based on the stored voice information. For example, when the electronic device receives a notification request from a user, a book reading request using text-to-speech (TTS), a virtual assistance request using object voice, etc., the electronic device provides a voice service corresponding to the request. It can be provided to the user. In an embodiment, the electronic device may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object. For example, the electronic device provides a voice service that determines the output text corresponding to “Hello?” based on voice information from the text corresponding to “Hello?” and outputs the determined output text as the voice of the object. can do.

일 실시예에서, 전자 장치는 객체에 관한 정보에 기초하여, 전자 장치에서 재생중인 영상, 전자 장치와 통신하는 외부 장치에서 재생중인 영상, 외부 장치에 저장된 영상 등으로부터 객체가 포함된 영상을 획득할 수 있다. 일 실시예에서, 전자 장치와 통신할 수 있는 외부 장치는 서버, 스마트폰, 태블릿 PC, PC, TV, 스마트 TV, 휴대폰, PDA(personal digital assistant), 스피커, 랩톱, 미디어 플레이어, 마이크로 서버, 전자책 객체 인식 장치, 디지털방송용 객체 인식 장치, 키오스크, MP3 플레이어, 디지털 카메라, 로봇 청소기, 가전기기, 기타 모바일 또는 비모바일 컴퓨팅 장치, 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.In one embodiment, the electronic device may obtain an image including an object from an image being played in the electronic device, an image being played in an external device communicating with the electronic device, an image stored in the external device, etc., based on the information on the object. I can. In one embodiment, the external device capable of communicating with the electronic device is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, and an electronic device. Book object recognition devices, digital broadcasting object recognition devices, kiosks, MP3 players, digital cameras, robot vacuum cleaners, home appliances, other mobile or non-mobile computing devices, watches, glasses, hair bands and rings with communication and data processing functions It may include, but is not limited thereto.

또한, 일 실시예에서, 전자 장치는 복수의 인물이 포함된 영상을 획득하고, 획득한 객체에 관한 정보에 기초하여, 획득한 영상으로부터 객체를 인식하고, 획득한 영상으로부터 인식된 객체의 발화상태를 검출하여 객체의 음성을 추출할 수 있다.In addition, in an embodiment, the electronic device acquires an image including a plurality of people, recognizes an object from the acquired image based on information on the acquired object, and recognizes the speech state of the object recognized from the acquired image. Can be detected to extract the voice of the object.

또한, 일 실시예에서, 전자 장치는 저장된 음성 정보에 기초하여 결정된 출력 텍스트를 객체의 음성으로 출력하는 음성 서비스를 제공할 수 있다.Also, in an embodiment, the electronic device may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object.

또한, 일 실시예에서, 전자 장치는 획득한 객체에 관한 정보에 기초하여, 서버로부터 객체가 포함된 영상을 수신하고, 서버로부터 수신한 영상을 배속으로 재생함으로써, 객체가 발화상태를 검출하여, 배속으로 객체의 음성을 추출할 수 있다.In addition, in an embodiment, the electronic device receives an image including the object from the server based on the acquired information on the object, and reproduces the image received from the server at double speed, thereby detecting the firing state of the object, You can extract the voice of an object at double speed.

또한, 일 실시예에서, 전자 장치는 영상을 분석하여 객체의 얼굴이 나타나는 화면 및 객체의 얼굴이 나타나는 시간을 획득하고, 객체의 입술의 움직임, 객체의 입의 모양, 객체의 치아 인식 및 영상의 스크립트 중 적어도 하나를 고려하여, 획득한 시간에 해당하는 영상으로부터 객체의 발화상태를 검출하여 객체의 음성을 추출할 수 있다.In addition, in an embodiment, the electronic device analyzes the image to obtain a screen on which the face of the object appears and a time when the face of the object appears, and the movement of the lips of the object, the shape of the mouth of the object, recognition of teeth of the object, and In consideration of at least one of the scripts, the speech of the object may be extracted by detecting the speech state of the object from the image corresponding to the acquired time.

또한, 일 실시예에서, 전자 장치는 서버로부터 객체의 음성에 대한 모델링 결과를 수신하고, 서버로부터 수신한 모델링 결과를 반영하여, 음성 정보 및 객체의 음성을 저장할 수 있다.In addition, in an embodiment, the electronic device may receive a modeling result of the object's voice from the server and reflect the modeling result received from the server to store voice information and the object's voice.

또한, 일 실시예에서, 모델링 결과는, 다른 장치에 의해 상기 객체의 음성이 또 다른 장치 또는 상기 장치에 의해 요청될 가능성이 있다고 판단된 결과, 상기 다른 장치에 의해 추출되어 서버에 저장된 모델링 결과일 수 있다.In addition, in one embodiment, the modeling result is a result of determining that the voice of the object is likely to be requested by another device or the device by another device, and is a modeling result extracted by the other device and stored in the server. I can.

또한, 일 실시예에서, 전자 장치는 획득한 객체에 관한 정보에 기초하여, 객체의 음성이 다른 장치에 의해 요청될 가능성이 있는지 판단하고, 객체의 음성이 다른 장치에 의해 요청될 가능성이 있다고 판단한 결과, 저장된 음성 정보 및 저장된 객체의 음성을 서버에 전송할 수 있다.In addition, in an embodiment, the electronic device determines whether there is a possibility that the voice of the object may be requested by another device, based on the acquired information on the object, and determines that the voice of the object is likely to be requested by another device. As a result, it is possible to transmit the stored voice information and the stored voice of the object to the server.

또한, 일 실시예에서, 전자 장치는 사용자로부터 추출된 음성이 객체의 음성과 동일한지 여부에 대한 응답을 수신하고, 수신된 응답에 기초하여 추출된 음성이 객체의 음성과 동일한지 결정하고, 동일하다고 결정한 결과, 음성 정보 및 객체의 음성을 저장할 수 있다.In addition, in one embodiment, the electronic device receives a response as to whether the voice extracted from the user is the same as the voice of the object, determines whether the extracted voice is the same as the voice of the object based on the received response, and As a result of determining that it is, it is possible to store the voice information and the voice of the object.

도 3a는 일 실시예에 따른 외부 장치로부터 영상을 획득하여 음성 서비스를 제공하는 방법을 설명하기 위한 흐름도이다. 또한, 도 3b은 일 실시예에 따른 외부 장치로부터 영상을 획득하여 음성 서비스를 제공하는 방법의 예시도이다. 도 3a 및 도 3b를 참조하여, 본 개시는 일 실시예에 따른 외부 장치로부터 영상을 획득하여 음성 서비스를 제공하는 방법을 설명하도록 한다.3A is a flowchart illustrating a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment. 3B is an exemplary diagram of a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment. 3A and 3B, the present disclosure describes a method of providing a voice service by acquiring an image from an external device according to an exemplary embodiment.

도 3a 및 도 3b을 참조하면, 동작 301에서, 전자 장치(100)는 사용자로부터 음성을 저장할 객체에 관한 정보(310)를 획득할 수 있다. 일 실시예에서, 전자 장치(100)는 객체에 관한 정보(310)와 더불어, 사용자가 제공받기를 원하는 음성 서비스 종류에 관한 정보를 획득할 수 있다. 다만, 전자 장치(100)는 객체에 관한 정보(310)와 음성 서비스 종류를 동시에 획득할 수도 있고, 객체에 관한 정보(310)만 획득할 수도 있으며, 음성 서비스 종류를 먼저 획득하고, 후에 객체에 관한 정보(310)를 획득할 수 있다. 예를 들어, 전자 장치(100)의 가상 어시스턴스(assistance)는 사용자로부터 "BTS 진 목소리로 책 읽어줘"라고 인식되는 음성을 수신할 수 있다. 전자 장치(100)는 획득된 음성으로부터 "BTS 진"이라는 객체에 관한 정보(310)와 "책 읽기"라는 음성 서비스 종류를 획득할 수 있다. 즉, 객체에 관한 정보(310)는 객체(340)의 이름일 수 있으나, 이에 제한되는 것은 아니고, 객체(340)의 성명, 사진, 명칭, 상호, 초상, 예명, 필명, 약칭 등일 수 있다.3A and 3B, in operation 301, the electronic device 100 may obtain information 310 about an object to store a voice from a user. In an embodiment, the electronic device 100 may obtain information about a type of a voice service desired to be provided by a user in addition to the information 310 about the object. However, the electronic device 100 may acquire the object information 310 and the voice service type at the same time, or acquire only the object information 310, obtain the voice service type first, and then acquire the voice service type. Information about 310 may be obtained. For example, the virtual assistance of the electronic device 100 may receive a voice recognized by the user as "Read a book with a true BTS voice". The electronic device 100 may obtain information 310 about an object called "BTS Jin" and a voice service type called "read a book" from the acquired voice. That is, the information 310 on the object may be the name of the object 340, but is not limited thereto, and may be a name, a photo, a name, a business name, a portrait, a stage name, a pseudonym, or the like of the object 340.

동작 303에서, 전자 장치(100)는 외부 장치(320)로부터 영상(330)을 수신할 수 있다. 일 실시예에서, 전자 장치(100)는 전자 장치(100)와 유선 또는 무선으로 통신하는 외부 장치(320)로부터 영상(330)을 수신할 수 있다. 전자 장치(100)는 외부 장치(320)와 출력 포트 또는 무선 통신을 이용하여, 외부 장치(320)로부터 객체에 관한 정보(310), 영상(350), 음성 데이터 등을 송수신할 수 있다. 즉, 전자 장치(100)는 외부 장치(320)로부터 객체가 포함된 영상(330)을 수신하고, 전자 장치(100)가 수신한 영상(330)으로부터 객체(340)의 음성을 추출할 수 있다. In operation 303, the electronic device 100 may receive an image 330 from the external device 320. In an embodiment, the electronic device 100 may receive an image 330 from an external device 320 that communicates with the electronic device 100 by wire or wirelessly. The electronic device 100 may transmit and receive information about an object 310, an image 350, and audio data from the external device 320 using an output port or wireless communication with the external device 320. That is, the electronic device 100 may receive the image 330 including the object from the external device 320 and extract the voice of the object 340 from the image 330 received by the electronic device 100. .

다른 일 실시예에서, 외부 장치(320)는 전자 장치(100)로부터 객체에 관한 정보(310)를 수신하고, 객체에 관한 정보(310)에 기초하여, 객체(340)의 음성을 추출하고, 외부 장치(320)에 의해 객체(340)의 음성이 추출된 음성 데이터를 전자 장치(100)에게 전송할 수 있다. 외부 장치(320)는 전자 장치(100)와 출력 포트 또는 무선 통신을 이용하여, 외부적으로 연결된 외부 장치(320)에 객체에 관한 정보(310), 영상(350), 음성 데이터 등을 송수신할 수 있다. 즉, 전자 장치(100)는 외부 장치(320)에게 객체에 관한 정보(310)를 전송하고, 외부 장치(320)는 외부 장치(320)에서 재생되고 있는 영상(330)으로부터 객체(340)의 음성을 추출하여 전자 장치(100)에게 음성 데이터를 전송할 수 있다. In another embodiment, the external device 320 receives information 310 about an object from the electronic device 100, extracts a voice of the object 340 based on the information 310 about the object, Voice data from which the voice of the object 340 is extracted by the external device 320 may be transmitted to the electronic device 100. The external device 320 transmits and receives information about an object 310, an image 350, audio data, etc. to an external device 320 that is externally connected by using an output port or wireless communication with the electronic device 100. I can. That is, the electronic device 100 transmits the information 310 on the object to the external device 320, and the external device 320 transmits the object 340 from the image 330 being played back in the external device 320. The voice may be extracted and voice data may be transmitted to the electronic device 100.

일 실시예에서, 출력 포트는 비디오/오디오 신호를 동시에 전송하는 HDMI, DP, 썬더볼트 이거나 또는 비디오 신호와 오디오 신호를 별개로 출력하기 위한 포트들을 포함할 수 있다. 또한, 무선 통신은 블루투스 통신, BLE(Bluetooth Low Energy) 통신, 근거리 무선 통신(Near Field Communication), WLAN(와이파이) 통신, 지그비(Zigbee) 통신, 적외선(IrDA, infrared Data Association) 통신, WFD(Wi-Fi Direct) 통신, UWB(ultra wideband) 통신, Ant+ 통신 등을 포함하는 근거리 통신과, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신하는 이동 통신과, 방송 채널을 통하여 외부로부터 방송 신호 및/또는 방송 관련된 정보를 수신하는 방송 통신 등을 포함할 수 있으나, 이에 제한되는 것은 아니다. 일 실시예에서, 외부 장치(320)는 서버, 스마트폰, 태블릿 PC, PC, TV, 스마트 TV, 휴대폰, PDA(personal digital assistant), 스피커, 랩톱, 미디어 플레이어, 마이크로 서버, 전자책 객체 인식 장치, 디지털방송용 객체 인식 장치, 키오스크, MP3 플레이어, 디지털 카메라, 로봇 청소기, 가전기기, 기타 모바일 또는 비모바일 컴퓨팅 장치, 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등을 포함할 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 전자 장치(100)는 스마트폰이고, 외부 장치(320)는 TV일 수 있다. 전자 장치(100)는 Wi-Fi/BT 또는 적외선 통신 등을 이용하여 외부 장치(320)인 TV에게 데이터를 송수신하거나 외부 장치(320)인 TV를 제어할 수 있다. 스마트폰에는 TV를 제어하기 위한 애플리케이션을 설치할 수 있고, 그 외의 방법으로 TV를 제어할 수도 있다.In an embodiment, the output port may be HDMI, DP, or Thunderbolt for simultaneously transmitting video/audio signals, or may include ports for separately outputting a video signal and an audio signal. In addition, wireless communication includes Bluetooth communication, Bluetooth Low Energy (BLE) communication, Near Field Communication, WLAN (Wi-Fi) communication, Zigbee communication, IRDA (infrared Data Association) communication, WFD (Wi -Fi Direct) communication, short-range communication including UWB (ultra wideband) communication, Ant+ communication, etc., mobile communication transmitting and receiving wireless signals with at least one of a base station, an external terminal, and a server on a mobile communication network, and through a broadcast channel. It may include broadcast communication for receiving broadcast signals and/or broadcast-related information from the outside, but is not limited thereto. In one embodiment, the external device 320 is a server, a smart phone, a tablet PC, a PC, a TV, a smart TV, a mobile phone, a personal digital assistant (PDA), a speaker, a laptop, a media player, a micro server, an e-book object recognition device. , Digital broadcasting object recognition devices, kiosks, MP3 players, digital cameras, robot cleaners, home appliances, other mobile or non-mobile computing devices, watches with communication and data processing functions, glasses, hair bands and rings. However, it is not limited thereto. For example, the electronic device 100 may be a smartphone, and the external device 320 may be a TV. The electronic device 100 may transmit/receive data to a TV, which is the external device 320, or control the TV, which is the external device 320, using Wi-Fi/BT or infrared communication. An application for controlling the TV can be installed on the smartphone, and the TV can be controlled in other ways.

동작 305에서, 전자 장치(100)는 객체에 관한 정보(310)에 기초하여, 외부 장치(320)로부터 수신한 영상(350)으로부터 객체(340)가 발화상태인지 검출할 수 있다. 일 실시예에서, 전자 장치(100)는 수신한 영상(350)을 분석하여 객체(340)의 얼굴이 나타나는 화면과 객체(340)의 얼굴이 나타나는 시간을 획득할 수 있다. 또한, 객체(340)의 얼굴이 나타나는 시간에 대응되는 객체의 음성을 추출할 수 있다. 예를 들어, 전자 장치(100)는 객체(340)의 입술의 움직임, 객체(340)의 입의 모양, 객체(340)의 치아 인식, 영상(350)의 자막 유무, 영상(350)의 스크립트 등을 고려하여 영상(350)으로부터 객체(340)가 발화상태인지 검출할 수 있다. 일 실시예에서, 객체가 포함된 영상(350)은 객체(340) 이외의 다른 인물들을 포함할 수 있다. 전자 장치(100)는 객체에 관한 정보(310)에 기초하여, 예를 들어, 얼굴 인식 기술 등을 이용하여, 복수의 인물이 포함된 영상으로부터 음성 추출의 대상이 되는 객체(340)를 인식할 수 있다. 또한, 전자 장치(100)는 복수의 인물 중 인식 된 객체(340)가 발화상태인지 여부를 검출할 수 있다. 전자 장치(100)는 객체가 포함된 영상(350)을 일정 시간 마다 캡처하여 N개의 화상을 획득할 수 있다. 또한, 전자 장치(100)는 N개의 화상으로부터 객체(340)가 포함되어 있는지 판단함으로써 N개의 화상 중에서 객체가 포함된 화상 M개를 결정할 수 있다. 또한, 전자 장치(100)는 객체(340)가 포함된 화상 M개로부터 객체(340)가 발화상태인 것으로 검출된 발화 구간 K개를 결정할 수 있다.In operation 305, the electronic device 100 may detect whether the object 340 is in the speech state from the image 350 received from the external device 320, based on the information 310 on the object. In an embodiment, the electronic device 100 may analyze the received image 350 to obtain a screen on which the face of the object 340 appears and a time when the face of the object 340 appears. Also, a voice of an object corresponding to a time when the face of the object 340 appears may be extracted. For example, the electronic device 100 may use the movement of the lips of the object 340, the shape of the mouth of the object 340, the tooth recognition of the object 340, the presence or absence of a caption of the image 350, the script of the image 350 In consideration of the like, it is possible to detect whether the object 340 is in the speech state from the image 350. In an embodiment, the image 350 including the object may include people other than the object 340. The electronic device 100 may recognize the object 340 to be extracted from an image including a plurality of people, based on the information 310 on the object, using, for example, face recognition technology. I can. Also, the electronic device 100 may detect whether the recognized object 340 among a plurality of people is in a speech state. The electronic device 100 may acquire N images by capturing the image 350 including the object every predetermined time. Also, by determining whether the object 340 is included from the N images, the electronic device 100 may determine M images including the object among the N images. Also, the electronic device 100 may determine K utterance intervals in which the object 340 is detected to be in the utterance state from M images including the object 340.

동작 307에서, 전자 장치(100)는 수신한 영상(330)으로부터 객체(340)의 음성을 추출하고 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. 일 실시예에서, 전자 장치(100)는 K개의 화상이 캡처된 시간에 대응되는 구간의 객체(340)의 음성을 획득할 수 있다. 또한, 전자 장치(100)는 객체(320)의 음성 및 객체(320)의 음성과 관련된 음성 정보를 저장할 수 있다. 일 실시예에서, 전자 장치(100)는 객체(320)의 음성을 추출하여 객체(320)의 음성 및 객체(320)의 음성과 관련된 음성 정보를 모두 저장할 수도 있고, 다른 일 실시예에서, 전자 장치(100)는 외부 장치(320)로부터 객체(320)의 음성 및 객체(320)의 음성과 관련된 음성 정보를 획득할 수도 있다. 또 다른 일 실시예에서, 전자 장치(100)는 외부 장치(320)로부터 객체(320)의 음성만을 수신하고, 전자 장치(100)가 수신한 객체(320)의 음성으로부터 음성과 관련된 음성 정보를 추출하여 객체(320)의 음성 및 음성 정보를 저장할 수 있다.In operation 307, the electronic device 100 extracts the voice of the object 340 from the received image 330 and stores voice information related to the voice of the object and the voice of the object. In an embodiment, the electronic device 100 may acquire the voice of the object 340 in a section corresponding to a time when K images are captured. In addition, the electronic device 100 may store the voice of the object 320 and voice information related to the voice of the object 320. In one embodiment, the electronic device 100 may store both the voice of the object 320 and voice information related to the voice of the object 320 by extracting the voice of the object 320. The device 100 may obtain the voice of the object 320 and voice information related to the voice of the object 320 from the external device 320. In another embodiment, the electronic device 100 receives only the voice of the object 320 from the external device 320, and receives voice information related to the voice from the voice of the object 320 received by the electronic device 100. By extracting, voice and voice information of the object 320 may be stored.

동작 309에서, 전자 장치(100)는 저장된 음성 정보에 기초하여, 음성 서비스를 제공할 수 있다. 예를 들어, 전자 장치(100)는 사용자의 책 읽기 서비스 요청에 기초하여, 사용자에게 객체(320)의 목소리로 책을 읽어주는 서비스를 제공할 수 있다.In operation 309, the electronic device 100 may provide a voice service based on the stored voice information. For example, the electronic device 100 may provide a service for reading a book with the voice of the object 320 to the user based on the user's request for a book reading service.

도 3c는 일 실시예에 따른 디스플레이 장치에 의해 음성인식이 수행되어 음성 서비스를 제공하는 방법의 예시도이다.3C is an exemplary diagram of a method for providing a voice service by performing voice recognition by a display device according to an exemplary embodiment.

도 3c를 참조하면, 디스플레이 장치(360)는 사용자로부터 음성 명령을 수신할 수 있다. 일 실시예에서, 디스플레이 장치(360)는 사용자로부터 직접적으로 음성 명령을 수신할 수 있다. 다른 일 실시예에서, 원격 조정기(390)가 사용자로부터 음성 명령을 수신하고, 디스플레이 장치(360)는 원격 조정기(390)로부터 음성 명령을 수신할 수 있다.Referring to FIG. 3C, the display device 360 may receive a voice command from a user. In an embodiment, the display device 360 may directly receive a voice command from a user. In another embodiment, the remote controller 390 may receive a voice command from a user, and the display device 360 may receive a voice command from the remote controller 390.

일 실시예에서, 디스플레이 장치(360)의 내부 마이크를 통해 디스플레이 장치(360)는 사용자의 음성 명령을 수신할 수 있다. 디스플레이 장치(360)는 디스플레이 장치(360)의 내부 마이크를 통해 사용자의 음성에 대하여 원거리 음성인식을 수행할 수 있다. 예를 들어, 사용자는 디스플레이 장치(360)를 통해 객체(380)가 포함된 영상(370)을 시청하고 있던 도중 디스플레이 장치(360)에게 "BTS 진 목소리 저장해줘"라는 음성을 제공할 수 있다. 디스플레이 장치(360)는 사용자의 음성에 대하여 원거리 음성인식을 수행하여 음성을 저장할 객체에 관한 정보를 획득할 수 있다. 또한, 디스플레이 장치(360)는 객체에 관한 정보에 기초하여, 객체가 포함된 영상으로부터 객체의 발화상태를 검출하여 객체의 음성을 추출하고 모델링할 수 있다. 이에 따라, 디스플레이 장치(360)는 객체(380)의 음성 및 객체의 음성과 관련된 음성 정보를 저장할 수 있다.In an embodiment, the display device 360 may receive a user's voice command through an internal microphone of the display device 360. The display device 360 may perform remote voice recognition on a user's voice through an internal microphone of the display device 360. For example, while the user is watching the image 370 including the object 380 through the display device 360, the user may provide the display device 360 with a voice saying “Save the BTS true voice”. The display device 360 may obtain information on an object to store the voice by performing remote voice recognition on the user's voice. In addition, the display device 360 may extract and model the voice of the object by detecting the speech state of the object from the image including the object, based on the information on the object. Accordingly, the display device 360 may store the voice of the object 380 and voice information related to the voice of the object.

다른 일 실시예에서, 원격 조정기(390)의 내부 마이크는 사용자의 음성 명령을 수신할 수 있다. 또한, 디스플레이 장치(360)는 원격 조정기(390)와 무선 또는 유선으로 통신할 수 있는 장치로서, 디스플레이 장치(360)는 원격 조정기(390)로부터 사용자의 음성 명령을 수신할 수 있다. 이에 따라, 디스플레이 장치(360)는 사용자의 음성 명령에 기초하여, 객체(380)의 음성 및 음성 정보를 저장할 수 있다. 예를 들어, 사용자는 디스플레이 장치(360)를 통해 객체(380)가 포함된 영상(370)을 시청하고 있던 도중 원격 조정기(390)에게 "BTS 진 목소리 저장해줘"라는 음성을 제공할 수 있다. 원격 조정기(390)는 사용자의 음성에 대하여 음성인식을 수행하여 사용자의 음성 명령을 디스플레이 장치(360)에 전송할 수 있다. 디스플레이 장치(360)는 수신한 음성 명령에 기초하여, 음성을 저장할 객체에 관한 정보를 획득할 수 있다. 또한, 디스플레이 장치(360)는 객체에 관한 정보에 기초하여, 객체가 포함된 영상으로부터 객체의 발화상태를 검출하여 객체의 음성을 추출하고 모델링할 수 있다. 이에 따라, 디스플레이 장치(360)는 객체(380)의 음성 및 객체의 음성과 관련된 음성 정보를 저장할 수 있다.In another embodiment, the internal microphone of the remote controller 390 may receive a user's voice command. In addition, the display device 360 is a device capable of wireless or wired communication with the remote controller 390, and the display device 360 may receive a user's voice command from the remote controller 390. Accordingly, the display device 360 may store voice and voice information of the object 380 based on the user's voice command. For example, while the user is watching the image 370 including the object 380 through the display device 360, the user may provide the remote controller 390 with a voice saying "Save BTS true voice". The remote controller 390 may perform voice recognition on the user's voice and transmit a user's voice command to the display device 360. The display device 360 may acquire information on an object to store the voice based on the received voice command. In addition, the display device 360 may extract and model the voice of the object by detecting the speech state of the object from the image including the object, based on the information on the object. Accordingly, the display device 360 may store the voice of the object 380 and voice information related to the voice of the object.

도 4는 일 실시예에 따른 서버로부터 객체에 대한 영상을 검색하여 음성 서비스를 제공하는 방법의 예시도이다.4 is an exemplary diagram of a method of providing a voice service by searching for an image of an object from a server according to an exemplary embodiment.

도 4를 참조하면, 동작 401에서 전자 장치(100)는 사용자로부터 음성을 저장할 객체에 관한 정보(410)를 획득할 수 있다. 예를 들어, 사용자는 전자 장치(100)의 가상 어시스턴스에게 "백종원 목소리 저장해줘"라고 음성을 입력하고, 전자 장치(100)는 입력된 음성으로부터 '백종원'이라는 객체에 관한 정보(410)를 획득할 수 있다. 일 실시예에서, 동작 401은 도 3을 참조하여 전술한 동작 301와 대응될 수 있다.Referring to FIG. 4, in operation 401, the electronic device 100 may obtain information 410 about an object to store a voice from a user. For example, the user inputs a voice to the virtual assistant of the electronic device 100, saying "Save Baek Jong-won's voice," and the electronic device 100 receives information 410 about an object called "Baek Jong-won" from the input voice. Can be obtained. In an embodiment, operation 401 may correspond to operation 301 described above with reference to FIG. 3.

동작 403에서, 전자 장치(100)는 서버(420)로부터 음성 데이터(430)를 수신할 수 있다. 일 실시예에서, 전자 장치(100)는 서버(420)로부터 영상을 수신하여 객체의 음성을 추출하거나, 서버(420)가 객체의 음성을 모델링한 후 전자 장치(100)는 서버로부터 모델링 결과를 수신하여 객체의 음성 및 음성 정보를 저장하거나 또는 서버(420)가 객체의 음성 및 음성 정보를 추출한 후 전자 장치(100)가 서버(420)로부터 객체의 음성 및 음성 정보를 수신 할 수 있으나, 전자 장치(100)가 서버(420)로부터 수신하는 데이터 종류가 이에 제한되는 것은 아니다. 예를 들어, 전자 장치(100)가 사용자로부터 객체의 음성을 추출하라는 명령을 수신하는 경우, 전자 장치(100)는 동영상 공유 사이트, 포털 사이트 등으로부터 객체에 관한 정보(410)을 검색하여 객체에 관한 영상을 획득하고, 객체에 관한 영상으로부터 객체의 음성 데이터(430)를 획득할 수 있다. 다른 예를 들어, 서버가 객체에 관한 영상을 획득한 후 객체의 음성을 추출하고 전자 장치에게 전송하여 전자 장치는 서버로부터 객체의 음성 데이터(430)를 획득할 수 있다. 일 실시예에 따르면, 사용자가 객체의 음성이 어느 영상으로부터 추출되는지 등을 명확하게 인지할 수 없는 상태에서 전자 장치(100)는 객체의 음성을 추출하기 위해 자동으로 백그라운드(background)에서 객체의 음성 및 음성 정보를 획득하고 저장할 수 있다.In operation 403, the electronic device 100 may receive the voice data 430 from the server 420. In one embodiment, the electronic device 100 receives an image from the server 420 and extracts the voice of the object, or after the server 420 models the voice of the object, the electronic device 100 receives the modeling result from the server. After receiving and storing the voice and voice information of the object or the server 420 extracts the voice and voice information of the object, the electronic device 100 may receive the voice and voice information of the object from the server 420. The type of data that the device 100 receives from the server 420 is not limited thereto. For example, when the electronic device 100 receives a command to extract the voice of an object from a user, the electronic device 100 retrieves information 410 about the object from a video sharing site, a portal site, etc. An image related to the object may be obtained, and audio data 430 of the object may be obtained from the image related to the object. For another example, after the server acquires an image of an object, the object's voice is extracted and transmitted to the electronic device, so that the electronic device may obtain the object's voice data 430 from the server. According to an embodiment, in a state in which the user cannot clearly recognize from which image the audio of the object is extracted, the electronic device 100 automatically extracts the audio of the object in the background. And acquire and store voice information.

일 실시예에서, 전자 장치(100)는 객체에 관한 정보(410)에 기초하여, 서버(420)로부터 객체가 포함된 영상을 수신하고, 서버로부터 수신한 영상을 배속으로 재생함으로써 객체가 발화상태인지 검출하고, 검출된 발화상태에 따라, 배속으로 객체의 음성을 추출할 수 있다. 일 실시예에 따르면, 전자 장치(100) 또는 서버(420)는 영상을 배속으로 재생함으로써 객체의 음성 및 객체의 음성과 관련된 음성 정보를 신속하게 획득할 수 있다.In one embodiment, the electronic device 100 receives an image including the object from the server 420 based on the information 410 on the object, and plays the image received from the server at double speed, so that the object is in an utterance state. It detects and detects, and according to the detected speech state, it is possible to extract the voice of the object at double speed. According to an embodiment, the electronic device 100 or the server 420 may quickly acquire a voice of an object and voice information related to the voice of the object by reproducing an image at double speed.

동작 405에서, 전자 장치(100)는 수신한 음성 데이터(430)에 기초하여 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. 일 실시예에서, 음성 데이터(430)는 객체의 음성 및 객체의 음성과 관련된 음성 정보를 모두 포함할 수도 있고, 다른 일 실시예에서, 음성 데이터(430)는 객체의 음성만을 포함하고, 전자 장치(100)가 수신한 음성 데이터(430)를 기초로, 객체(320)의 음성과 관련된 음성 정보를 획득할 수도 있다. 또 다른 일 실시예에서, 음성 데이터(430)는 객체가 포함된 영상만을 포함하고, 전자 장치(100)가 수신한 영상으로부터 객체의 음성 및 객체의 음성과 관련된 음성 정보를 추출하고 저장할 수 있다.In operation 405, the electronic device 100 may store voice information related to the voice of the object and the voice of the object based on the received voice data 430. In one embodiment, the voice data 430 may include both the voice of the object and voice information related to the voice of the object. In another embodiment, the voice data 430 includes only the voice of the object, and the electronic device Based on the voice data 430 received by the 100, voice information related to the voice of the object 320 may be obtained. In another embodiment, the audio data 430 may include only an image including an object, and may extract and store the audio of the object and audio information related to the audio of the object from the image received by the electronic device 100.

동작 407에서, 전자 장치(100)는 저장된 음성 정보에 기초하여, 음성 서비스를 제공할 수 있다. 예를 들어, 전자 장치(100)는 사용자의 책 읽기 서비스 요청에 기초하여, 사용자에게 객체의 음성으로 일정 리마인더를 제공할 수 있다.In operation 407, the electronic device 100 may provide a voice service based on the stored voice information. For example, based on the user's request for a book reading service, the electronic device 100 may provide the user with a schedule reminder with the voice of the object.

다른 일 실시예에서, 전자 장치(100)가 전자 장치(100)의 내부 메모리에 저장된 영상 또는 USB, SSD, 메모리 카드, eMMC, UFS 등의 외부 메모리에 저장된 영상으로부터 객체의 음성을 추출하고자 하는 경우, 전자 장치(100)는 영상을 재생하지 않고, 영상을 스캔함으로써 객체의 음성을 미리 추출할 수 있다. 일 실시예에 따르면, 전자 장치(100)가 내부 메모리 또는 외부 메모리에 저장된 데이터를 이용함으로써, 신속하게 객체의 음성을 추출할 수 있다.In another embodiment, when the electronic device 100 attempts to extract the audio of an object from an image stored in an internal memory of the electronic device 100 or an image stored in an external memory such as USB, SSD, memory card, eMMC, UFS, etc. , The electronic device 100 may pre-extract the voice of the object by scanning the image without reproducing the image. According to an embodiment, the electronic device 100 may quickly extract a voice of an object by using data stored in an internal memory or an external memory.

도 5는 일 실시예에 따른 외부 장치로부터 추출된 음성 데이터를 서버로부터 수신하는 방법의 예시도이다.5 is an exemplary diagram of a method of receiving voice data extracted from an external device from a server according to an exemplary embodiment.

도 5를 참조하면, 일 실시예에서, 외부 장치(530)는 객체(550)의 음성을 추출하여 저장할 수 있다. 또한, 외부 장치(530)는 다른 장치에 의해 객체의 음성이 요청될 가능성이 있다고 판단되면, 객체의 음성을 서버(520)에 전송할 수 있다. 또한, 서버(520)는 외부 장치(530)로부터 외부 장치(530)에 의해 검출된 객체(550)의 음성을 수신하여 저장할 수 있다. 일 실시예에서, 전자 장치(100)는 서버(520)로부터 외부 장치(530)에 의해 추출된 객체의 음성을 수신할 수 있다. 예를 들어, 전자 장치(100)는 서버(520)로부터 객체의 음성에 대한 모델링 결과를 수신하고, 수신한 모델링 결과에 기초하여 객체의 음성 및 음성 정보를 저장할 수 있다. 다른 예를 들어, 전자 장치(100)는 서버(520)로부터 객체의 음성 및 음성 정보를 수신하여 저장할 수 있다. Referring to FIG. 5, in an embodiment, the external device 530 may extract and store the voice of the object 550. In addition, when it is determined that there is a possibility that the voice of the object may be requested by another device, the external device 530 may transmit the voice of the object to the server 520. In addition, the server 520 may receive and store the voice of the object 550 detected by the external device 530 from the external device 530. In an embodiment, the electronic device 100 may receive a voice of an object extracted by the external device 530 from the server 520. For example, the electronic device 100 may receive a modeling result of the voice of the object from the server 520 and store voice and voice information of the object based on the received modeling result. For another example, the electronic device 100 may receive and store voice and voice information of an object from the server 520.

일 실시예에서, 외부 장치(530)는 객체(550)의 음성을 추출하여 서버(520)에게 음성 데이터(570)를 전송할 수 있다. 외부 장치(530)는 외부 장치(530)의 사용자에 의해 객체에 관한 정보를 획득할 수 있다. 외부 장치(530)는 획득한 객체에 관한 정보에 기초하여 영상(540)으로부터 객체(550)의 음성과 객체(550)의 음성과 관련된 음성 정보를 추출할 수 있다. 외부 장치(530)은 추출된 객체의 음성과 음성 정보를 서버(520)에게 전송할 수 있다.In an embodiment, the external device 530 may extract the voice of the object 550 and transmit the voice data 570 to the server 520. The external device 530 may acquire information on an object by a user of the external device 530. The external device 530 may extract the voice of the object 550 and voice information related to the voice of the object 550 from the image 540 based on the acquired information about the object. The external device 530 may transmit the extracted voice and voice information of the object to the server 520.

일 실시예에서, 전자 장치(100)는 사용자로부터 음성을 저장할 객체(550)에 관한 정보(510)를 획득할 수 있다. 예를 들어, 사용자는 전자 장치(100)의 가상 어시스턴스에게 "백종원 목소리 저장해줘"라고 음성을 입력하고, 전자 장치(100)는 입력된 음성으로부터 '백종원'이라는 객체에 관한 정보(510)를 획득할 수 있다.In an embodiment, the electronic device 100 may obtain information 510 about an object 550 to store a voice from a user. For example, the user inputs a voice to the virtual assistant of the electronic device 100, saying "Save Baek Jong-won's voice," and the electronic device 100 receives information 510 about an object called "Baek Jong-won" from the input voice. Can be obtained.

일 실시예에서, 전자 장치(100)는 서버(520)로부터 객체(550)의 음성 정보 및 음성(580)을 수신할 수 있다. 또한, 서버(520)는 외부 장치(530)에 의해 추출된 음성 데이터(570)를 기초로, 전자 장치(100)에게 음성 정보 및 음성(580)을 전송할 수 있다.In an embodiment, the electronic device 100 may receive voice information of the object 550 and voice 580 from the server 520. Also, the server 520 may transmit voice information and voice 580 to the electronic device 100 based on the voice data 570 extracted by the external device 530.

일 실시예에서, 전자 장치(100)는 서버(520)로부터 객체의 음성 정보 및 음성(580)을 수신하고, 저장할 수 있다. 일 실시예에 따르면, 객체가 유명 인사에 해당하는 경우, 여러 장치에 의해 객체의 음성이 추출되는 동작을 생략하고, 장치에서 추출된 객체의 음성이 서버에 저장되고 다른 장치가 저장된 음성을 이용할 수 있게 함으로써, 장치가 음성을 획득하는 시간이 절약되고, 추출된 음성의 정확도가 향상되며, 음성 서비스의 질이 향상될 수 있다.In an embodiment, the electronic device 100 may receive and store voice information and voice 580 of an object from the server 520. According to an embodiment, when the object corresponds to a famous person, the operation of extracting the voice of the object by several devices is omitted, the voice of the object extracted from the device is stored in the server, and the voice stored by another device can be used. By doing so, the time for the device to acquire the voice is saved, the accuracy of the extracted voice is improved, and the quality of the voice service can be improved.

일 실시예에서, 전자 장치(100)는 서버(520)로부터 획득한 객체의 음성 정보 및 음성(580)을 이용하여, 사용자에게 음성 서비스를 제공할 수 있다. 일 실시예에서, 전자 장치(100)는 저장된 음성 정보에 기초하여 결정된 출력 텍스트를 객체의 음성으로 출력하는 음성 서비스를 제공할 수 있다. 예를 들어, 사용자는 가상 어시스턴스에게 "백종원 목소리로 알리오올리오 파스타 레시피 읽어줘"라고 요청할 수 있다. 이에 따라, 전자 장치(100)는 저장된 음성 정보 및 음성(580)에 기초하여, 알리오 올리오 파스타 레시피에 관한 정보를 백종원 음성으로 출력할 수 있다. 예를 들어, 가상 어시스턴스는 "알리오 올리오 파스타 만들기 시작하겠습니다"의 텍스트(text)로부터 저장된 음성 정보에 기초하여 목표 텍스트에 대응되는 출력 테스트를 "알리오 올리오 파스타 만들어볼게유"의 텍스트로 결정할 수 있다. 이에 따라, 가상 어시스턴스는 결정된 출력 텍스트인 "알리오 올리오 파스타 만들어볼게유"를 백종원의 음성으로 사용자에게 제공할 수 있다.In an embodiment, the electronic device 100 may provide a voice service to a user by using voice information and voice 580 of an object acquired from the server 520. In an embodiment, the electronic device 100 may provide a voice service for outputting an output text determined based on stored voice information as a voice of an object. For example, the user may ask the virtual assistant to "read the recipe for Aliiolio pasta in Baek Jong-won's voice." Accordingly, the electronic device 100 may output information about an Alio Olio pasta recipe as a Baek Jong-won voice based on the stored voice information and the voice 580. For example, the virtual assistant determines the output test corresponding to the target text as the text of "I'll make Alio Olio Pasta" based on the stored voice information from the text of "I'll start making Alio Olio Pasta". I can. Accordingly, the virtual assistance may provide the determined output text "I'll make Alio Olio Pasta" to the user through the voice of Baek Jong-won.

일 실시예에서, 외부 장치(530)가 서버(520)에게 음성 데이터(570)를 전송하지 않고, 객체가 포함된 영상(540) 자체를 전송할 수도 있다. 이에 따라, 서버(520)는 수신한 영상(540)으로부터 객체의 음성과 관련된 음성 정보 및 객체의 음성을 획득하고, 이를 저장하며, 외부 장치(530) 및 전자 장치(100)에게 저장된 음성 정보 및 객체의 음성을 전송할 수 있다.In one embodiment, the external device 530 may not transmit the audio data 570 to the server 520 but may transmit the image 540 itself including the object. Accordingly, the server 520 acquires voice information related to the voice of the object and the voice of the object from the received image 540, stores the voice information, and stores the voice information stored in the external device 530 and the electronic device 100, and The voice of the object can be transmitted.

도 6은 일 실시예에 따른 영상으로부터 화면 및 음성을 모델링함으로써 TTS(text-to-speech) 목소리를 생성하는 방법을 나타내는 흐름도이다.6 is a flowchart illustrating a method of generating a text-to-speech (TTS) voice by modeling a screen and a voice from an image according to an exemplary embodiment.

도 6을 참조하면, 동작 610에서, 전자 장치는 음성을 저장할 객체에 관한 정보를 획득할 수 있다. 일 실시예에서, 전자 장치는 사용자로부터 음성을 저장할 객체에 관한 정보로서 객체의 성명, 사진, 명칭, 상호, 초상, 예명, 필명, 약칭 등에 관한 정보를 수신할 수 있으나, 객체에 관한 정보가 이에 제한되는 것은 아니다. Referring to FIG. 6, in operation 610, the electronic device may acquire information on an object to store voice. In an embodiment, the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pseudonym, and abbreviation as information about the object to store the voice from the user, but the information about the object is It is not limited.

동작 620에서, 전자 장치는 객체가 포함된 영상을 획득할 수 있다. 일 실시예에서, 전자 장치는 사용자로부터 수신한 정보에 기초하여, 객체가 포함된 영상을 획득할 수 있다. 예를 들어, 전자 장치는 객체가 포함된 영상으로서 장치에서 재생중인 영상, 장치와 통신할 수 있는 외부 장치에서 재생중인 영상, 서버로부터 수신한 영상 등을 획득할 수 있으나, 이에 제한되는 것은 아니다.In operation 620, the electronic device may acquire an image including the object. In an embodiment, the electronic device may acquire an image including an object based on information received from a user. For example, as an image including an object, the electronic device may acquire an image being played on the device, an image being played on an external device capable of communicating with the device, an image received from a server, and the like, but is not limited thereto.

동작 630 및 동작 640에서, 전자 장치는 특정 시간 당 화면을 캡처(capture)하고, 특정 시간 간격으로 목소리를 녹음할 수 있다. 예를 들어, 디바이스가 1분간 목소리를 검출하는 경우, 디바이스는 1초마다 영상을 캡처하여 60개의 화면을 캡처하고, 1초 간격으로 목소리를 녹음하여 60개의 목소리 모델을 획득할 수 있다. 다만, 이는 일 예일뿐, 디바이스는 1분보다 긴 시간 동안 목소리를 검출하거나, 1분 동안 짧은 시간 동안 목소리를 검출할 수도 있고, 객체의 추출된 음성이 TTS에 이용될 수 있을 때까지 목소리를 검출할 수 있다. 또한, 디바이스는 1초보다 긴 시간 마다 화면을 캡처하거나 1초보다 짧은 시간 마다 화면을 캡처할 수 있고, 1초보다 긴 시간 간격으로 목소리를 녹음하거나, 1초보다 짧은 시간 간격으로 목소리를 녹음할 수도 있다. 즉, 전자 장치는 객체가 포함된 영상으로부터 N개의 화면, N개의 목소리 모델 및 N개의 화면과 목소리 모델이 획득된 영상 시간을 획득할 수 있다.In operations 630 and 640, the electronic device may capture a screen per specific time and record a voice at a specific time interval. For example, when the device detects a voice for 1 minute, the device may capture an image every second to capture 60 screens, and record voices every second to obtain 60 voice models. However, this is only an example, and the device may detect a voice for a time longer than 1 minute, or a voice for a short time for 1 minute, and detect the voice until the extracted voice of the object can be used for TTS. can do. In addition, the device can capture a screen every longer than 1 second or every less than 1 second, record a voice at intervals longer than 1 second, or record a voice at intervals shorter than 1 second. May be. That is, the electronic device may acquire N screens, N voice models, and image times at which N screens and voice models are acquired from an image including an object.

동작 650에서, 전자 장치는 동작 630에서 획득한 N개의 화상에 기초하여, 객체가 발화상태인지 검출하기 위해 얼굴을 분석할 수 있다. 일 실시예에서, 동작 620에서 획득한 객체가 포함된 영상에 객체 이외의 다른 인물이 있는 경우, 객체의 얼굴을 분석하여 N개의 화상으로부터 객체가 존재하는 화상 M개를 결정할 수 있다. 즉, 전자 장치는 동작 610에서 획득한 객체에 관한 정보에 기초하여, 복수의 인물이 포함된 영상으로부터 N개의 화상을 획득하고, N개의 화상으로부터 복수의 인물 중 객체를 인식하여 객체가 포함된 M개의 화상을 결정할 수 있다.In operation 650, the electronic device may analyze a face to detect whether the object is in a speech state, based on the N images acquired in operation 630. In an embodiment, when there is a person other than the object in the image including the object acquired in operation 620, the face of the object may be analyzed and M images in which the object exists may be determined from the N images. That is, the electronic device acquires N images from an image including a plurality of people, based on the information on the object acquired in operation 610, and recognizes an object among a plurality of people from the N images, You can determine the images of the dog.

동작 660에서, 전자 장치는 동작 640에서 획득한 N개의 목소리 모델을 저장할 수 있다. 일 실시예에서, 저장된 N개의 목소리 모델은 동작 680에서 디바이스가 객체의 발화상태에 해당하는 목소리를 모델링하기 위해 사용될 수 있다.In operation 660, the electronic device may store the N voice models acquired in operation 640. In an embodiment, the stored N voice models may be used by the device to model the voice corresponding to the speech state of the object in operation 680.

동작 670에서, 전자 장치는 객체가 발화상태인지 여부를 검출할 수 있다. 일 실시예에서, 동작 650에서 결정된 M개의 화상으로부터 객체가 발화상태인지 여부를 검출할 수 있다. 전자 장치는 객체의 입술의 움직임, 객체의 입의 모양, 객체의 치아 인식, 화상에 자막이 존재하는지 여부, 영상의 스크립트 등을 고려하여 화상으로부터 객체가 발화상태인지 여부를 판단할 수 있다. 전자 장치는 M개의 화상으로부터 객체가 발화상태라고 판단된 K개의 화상을 결정하고, K개의 화상이 캡처된 시간을 저장할 수 있다.In operation 670, the electronic device may detect whether the object is in the speech state. In an embodiment, it is possible to detect whether an object is in a speech state from the M images determined in operation 650. The electronic device may determine whether the object is in an ignition state from the image in consideration of the movement of the lips of the object, the shape of the mouth of the object, recognition of the teeth of the object, whether a caption exists in the image, a script of the image, and the like. The electronic device may determine K images in which the object is determined to be in an ignition state from the M images, and store a time when the K images are captured.

동작 680에서, 전자 장치는 동작 660에서 저장된 N개의 목소리 모델로부터 동작 670에서 획득된 K개의 화상에 대응되는 K개의 목소리 모델을 사용하여 발화 상태에 해당하는 목소리 모델링을 수행할 수 있다. 예를 들어, 동작 670에서 발화 상태라고 판단되어 획득된 K개의 화상이 영상을 재생한 후 2초 내지 5초 및 10초 내지 19초에 캡처된 화상인 경우, 전자 장치는 N개의 목소리 모델 중 2초 내지 6초 구간 및 10초 내지 20초 구간을 녹음한 목소리 모델을 사용하여 목소리 모델링을 수행할 수 있다. 다만, 이는 일 예일뿐, 목소리 모델링 방법이 이에 제한되는 것은 아니다.In operation 680, the electronic device may perform voice modeling corresponding to the speech state by using K voice models corresponding to the K images acquired in operation 670 from the N voice models stored in operation 660. For example, in operation 670, if K images obtained by determining that they are in the utterance state are images captured 2 seconds to 5 seconds and 10 seconds to 19 seconds after reproducing the image, the electronic device is 2 out of the N voice models. Voice modeling may be performed using a voice model recorded in seconds to 6 seconds and 10 to 20 seconds. However, this is only an example, and the voice modeling method is not limited thereto.

일 실시예에서, 전자 장치는 발화상태에 해당하는 목소리를 모델링하는 과정에서 객체가 포함된 영상에 대응되는 스크립트(script), 자막 등을 부가적으로 고려할 수 있다. 예를 들어, 전자 장치는 영상에 해당되는 스크립트(script)를 서버로부터 수신하여 객체의 발화시점을 판단할 때 스크립트를 고려거나, 영상의 자막을 OCR을 사용하여 읽어 발화시점을 판단할 때 고려하거나, 영상 내 음성을 텍스트(text)로 바꾼 후, 텍스트를 객체의 발화시점을 판단할 때 고려할 수 있으나, 객체의 발화상태에 해당하는 목소리를 모델링하는 과정에서 부가적으로 고려되는 요소가 이에 한정되는 것은 아니다.In an embodiment, in the process of modeling a voice corresponding to an utterance state, the electronic device may additionally consider a script, a caption, etc. corresponding to an image including an object. For example, the electronic device receives a script corresponding to an image from the server and considers the script when determining the ignition time of the object, or when determining the utterance time by reading the subtitle of the video using OCR, , After changing the voice in the video to text, the text can be considered when determining the utterance point of the object, but additional factors considered in the process of modeling the voice corresponding to the utterance state of the object are limited to this. It is not.

동작 690에서, 전자 장치는 동작 680에서 수행한 목소리 모델링에 기초하여, TTS(text-to-speech) 목소리를 생성할 수 있다. 전자 장치는 생성된 TTS 목소리를 이용하여 사용자에게 다양한 음성 서비스를 제공할 수 있다.In operation 690, the electronic device may generate a text-to-speech (TTS) voice based on the voice modeling performed in operation 680. The electronic device may provide various voice services to users by using the generated TTS voice.

일 실시예에서, 동작 610 내지 690의 일부 또는 전부는 전자 장치가 아닌 서버에서 수행되고, 전자 장치는 서버로부터 동작의 수행 결과 또는 데이터 등을 수신하여 이용할 수 있다. 서버는 STT(Speech to Text) 서버 및 TTS(Text to Speech) 서버를 포함할 수 있고, STT 기능 및 TTS 기능을 모두 수행할 수 있는 하나의 서버를 포함할 수도 있다. 예를 들어, STT 기능은 음성 인식 기능으로서, 사람이 말하는 음성 언어를 문제 데이터로 전환하는 기능을 나타내고, TTS 기능은 음성 합성 기능으로서, 텍스트로부터 텍스트에 대응되는 말소리의 음파를 생성하는 기능일 수 있다.In an embodiment, some or all of the operations 610 to 690 are performed by a server other than an electronic device, and the electronic device may receive and use a result of the operation or data from the server. The server may include a Speech to Text (STT) server and a Text to Speech (TTS) server, and may include one server capable of performing both the STT function and the TTS function. For example, the STT function is a speech recognition function, representing a function of converting a speech language spoken by a person into problem data, and the TTS function is a speech synthesis function, and a function of generating sound waves of speech corresponding to text from text. have.

도 7은 일 실시예에 따른 영상의 부가 정보를 고려하여 객체의 발화상태를 검출하는 방법의 예시도이다.7 is an exemplary diagram of a method of detecting an utterance state of an object in consideration of additional information of an image according to an exemplary embodiment.

도 7을 참조하면, 전자 장치는 복수 개의 화상에 포함된 객체의 얼굴을 분석하여 발화상태를 검출할 수 있다. 일 실시예에서, 제1 화상(710)은 제1 자막(715)을 포함하고, 제2 화상(720) 및 제3 화상(730)은 자막을 포함하지 않고, 제4 화상(740)은 제2 자막(745)를 포함할 수 있다. 또한, 제1 화상(710) 및 제3 화상(730) 내 객체의 입 모양은 닫은 입 모양이고, 제2 화상(720) 및 제4 화상(740) 내 객체의 입 모양은 열린 입 모양일 수 있다.Referring to FIG. 7, the electronic device may detect a speech state by analyzing faces of objects included in a plurality of images. In one embodiment, the first picture 710 includes a first caption 715, the second picture 720 and the third picture 730 do not include a caption, and the fourth picture 740 is 2 Captions 745 may be included. In addition, the mouth shape of the object in the first image 710 and the third image 730 may be a closed mouth shape, and the mouth shape of the object in the second image 720 and the fourth image 740 may be an open mouth shape. have.

예를 들어, 전자 장치는 제1 화상(710)의 경우, 발화상태가 아니라고 인식할 수 있다. 전자 장치는 객체의 입의 모양, 입술의 움직임, 치아 노출 정도 등을 분석하여 제1 화상(710) 내 객체가 발화상태가 아니라고 결정할 수 있다. 다른 예를 들어, 전자 장치는 제2 화상(720)의 경우, 발화상태라고 결정할 수 있다. 전자 장치는 객체의 입의 모양, 입술의 움직임, 치아 노출 정도 등을 분석하여 제2 화상(720) 내 객체의 입 모양이 열린 입 모양이기 때문에 제2 화상(720) 내 객체는 발화상태라고 결정할 수 있다. 또 다른 예를 들어, 전자 장치는 제3 화상(730)의 경우, 발화상태가 아니라고 인식할 수 있다. 전자 장치는 객체의 입의 모양, 입술의 움직임, 치아 노출 정도, 화상 내 자막이 존재하지 않는 상황 등을 고려하여 제3 화상(730) 내 객체는 발화상태가 아니라고 결정할 수 있다. 또 다른 예를 들어, 전자 장치는 제4 화상(740)의 경우, 발화상태라고 인식할 수 있다. 전자 장치는 객체의 입의 모양, 입술의 움직임, 치아 노출 정도, 화상 내 제2 자막(745)의 존재 등을 고려하여 제4 화상(740) 내 객체는 발화상태라고 결정할 수 있다.For example, the electronic device may recognize that the first image 710 is not in an ignition state. The electronic device may determine that the object in the first image 710 is not in an ignition state by analyzing the shape of the mouth of the object, the movement of the lips, and the degree of exposure of the teeth. For another example, the electronic device may determine that the second image 720 is in an ignition state. The electronic device determines that the object in the second image 720 is in an ignition state because the mouth shape of the object in the second image 720 is an open mouth shape by analyzing the shape of the object's mouth, the movement of the lips, and the degree of tooth exposure. I can. As another example, the electronic device may recognize that the third image 730 is not in an ignition state. The electronic device may determine that the object in the third image 730 is not in an ignition state in consideration of the shape of the mouth of the object, the movement of the lips, the degree of exposure of the teeth, and a situation in which no caption in the image exists. As another example, the electronic device may recognize that the fourth image 740 is in an ignition state. The electronic device may determine that the object in the fourth image 740 is in an ignition state in consideration of the shape of the mouth of the object, the movement of the lips, the degree of tooth exposure, and the presence of the second caption 745 in the image.

일 실시예에서, 전자 장치는 발화상태에 해당하는 목소리를 모델링하는 과정에서 객체가 포함된 영상에 대응되는 스크립트(script), 사전 EPG 정보 등에 포함된 대본, 자막 등을 부가적으로 고려할 수 있다. 예를 들어, 전자 장치는 영상에 해당되는 스크립트(script)를 서버로부터 수신하여 객체의 발화시점을 판단할 때 스크립트를 고려거나, 영상의 자막을 OCR을 사용하여 읽어 발화시점을 판단할 때 고려하거나, 영상 내 음성을 텍스트(text)로 바꾼 후, 텍스트를 객체의 발화시점을 판단할 때 고려할 수 있으나, 객체의 발화상태에 해당하는 목소리를 모델링하는 과정에서 부가적으로 고려되는 요소가 이에 한정되는 것은 아니다.In an embodiment, in a process of modeling a voice corresponding to an utterance state, the electronic device may additionally consider a script corresponding to an image including an object, a script included in pre-EPG information, and the like. For example, the electronic device receives a script corresponding to an image from the server and considers the script when determining the ignition time of the object, or when determining the utterance time by reading the subtitle of the video using OCR, , After changing the voice in the video to text, the text can be considered when determining the utterance point of the object, but additional factors considered in the process of modeling the voice corresponding to the utterance state of the object are limited to this. It is not.

도 8은 일 실시예에 따른 음성 서비스를 제공하기 위해 음성 목소리를 추출하고 검증 받는 방법의 예시도이다.8 is an exemplary diagram of a method for extracting and verifying a voice voice in order to provide a voice service according to an embodiment.

도 8을 참조하면, 전자 장치는 사용자의 스마트폰(810)과 유선 또는 무선 통신(ex. 블루투스)를 통해 통신하는 스피커(820)일 수 있다. 사용자는 가상 어시스턴스로서 동작하는 스피커(820)에게 "백종원 목소리 저장해줘"라는 명령을 입력할 수 있다. 가상 어시스턴스로서 동작하는 스피커(820)는 사용자의 명령을 수신하여 스마트폰(810)으로 하여금 객체의 음성을 추출하게 할 수 있다. 또한, 가상 어시스턴스로서 동작하는 스피커(820)는 스마트폰(810)으로부터 추출된 음성 정보 및 음성을 수신하여, 사용자에게 추출된 음성이 객체의 음성이 맞는지 확인하는 프로세스를 수행할 수 있다. 또한, 가상 어시스턴스로서 동작하는 스피커(820)는 추출된 객체의 음성을 이용하여 사용자에게 음성 서비스를 제공할 수 있다. 다만, 스마트폰(810) 및 스피커(820)는 일 예일뿐, 스마트폰(810) 및/또는 스피커(820)가 아닌 다른 전자 장치를 이용하여 해당 프로세스가 진행될 수 있다. 예를 들어, 전자 장치는 스마트폰, 태블릿 PC, PC, 스마트 TV, 휴대폰, PDA(personal digital assistant), 랩톱, 미디어 플레이어, 마이크로 서버, GPS(global positioning system) 장치, 전자책 단말기, 디지털방송용 단말기, 네비게이션, 키오스크, MP3 플레이어, 디지털 카메라, 가전기기 및 기타 모바일 또는 비모바일 컴퓨팅 장치일 수 있으나, 이에 제한되지 않는다. 또한, 전자 장치는 통신 기능 및 데이터 프로세싱 기능을 구비한 시계, 안경, 헤어 밴드 및 반지 등의 웨어러블 디바이스일 수 있다.Referring to FIG. 8, the electronic device may be a speaker 820 that communicates with a user's smartphone 810 through wired or wireless communication (eg, Bluetooth). The user may input a command "Save Baek Jong-won's voice" to the speaker 820 operating as a virtual assistant. The speaker 820 operating as a virtual assistant may receive a user's command and cause the smartphone 810 to extract the voice of the object. In addition, the speaker 820 operating as a virtual assistance may receive voice information and voice extracted from the smartphone 810 and perform a process of checking whether the extracted voice to the user is the voice of the object. In addition, the speaker 820 operating as a virtual assistance may provide a voice service to a user by using the extracted voice of the object. However, the smart phone 810 and the speaker 820 are only examples, and the corresponding process may be performed using an electronic device other than the smart phone 810 and/or the speaker 820. For example, electronic devices include smartphones, tablet PCs, PCs, smart TVs, mobile phones, personal digital assistants (PDAs), laptops, media players, micro servers, global positioning system (GPS) devices, e-book terminals, and digital broadcasting terminals. , Navigation, kiosk, MP3 player, digital camera, home appliance, and other mobile or non-mobile computing devices, but is not limited thereto. Further, the electronic device may be a wearable device such as a watch, glasses, hair band, and ring having a communication function and a data processing function.

전자 장치는 음성을 저장할 객체에 관한 정보를 획득할 수 있다. 일 실시예에서, 전자 장치는 사용자로부터 음성을 저장할 객체에 관한 정보로서 객체의 성명, 사진, 명칭, 상호, 초상, 예명, 필명, 약칭 등에 관한 정보를 수신할 수 있다. 전자 장치는 사용자로부터 수신한 정보에 기초하여, 객체가 포함된 영상을 획득할 수 있다. 예를 들어, 전자 장치는 객체가 포함된 영상으로서 장치에서 재생중인 영상, 장치와 통신할 수 있는 외부 장치에서 재생중인 영상, 서버로부터 수신한 영상 등을 획득할 수 있으나, 이에 제한되는 것은 아니다. 예를 들어, 사용자로부터 "백종원 목소리 저장해줘"라는 메시지를 수신한 경우, 전자 장치는 수신한 메시지로부터 '백종원'에 해당되는 객체에 관한 정보를 획득할 수 있다.The electronic device may obtain information on an object to store voice. In an embodiment, the electronic device may receive information about the object's name, photo, name, trade name, portrait, stage name, pen name, abbreviation, etc. as information about an object to store a voice from a user. The electronic device may acquire an image including the object based on information received from the user. For example, as an image including an object, the electronic device may acquire an image being played on the device, an image being played on an external device capable of communicating with the device, an image received from a server, and the like, but is not limited thereto. For example, when a message "Save Baek Jong-won's voice" is received from a user, the electronic device may obtain information on an object corresponding to "Baek Jong-won" from the received message.

전자 장치는 객체의 음성을 저장하는 동작을 시작할 수 있다. 일 실시예에서, 전자 장치는 객체에 관한 정보에 기초하여, 전자 장치에서 재생중인 영상, 전자 장치와 통신하는 외부 장치에서 재생중인 영상, 서버로부터 수신한 영상 등을 획득할 수 있으나, 획득할 수 있는 영상의 종류가 이에 한정되는 것은 아니다. 또한, 전자 장치는 획득한 영상으로부터 객체가 발화상태인지 검출하고, 발화상태인 것으로 검출된 발화상태 구간에서, 객체의 음성을 추출할 수 있다. 또한, 전자 장치는 추출된 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 획득할 수 있다. 일 실시예에서, 객체의 음성과 관련된 음성 정보는 객체의 음성에 대한 사투리, 억양, 유행어, 말투, 발음, 패러디, 명대사 등일 수 있으나, 이에 제한되는 것은 아니다.The electronic device may start an operation of storing the voice of the object. In an embodiment, the electronic device may acquire an image being played on the electronic device, an image being played on an external device communicating with the electronic device, an image received from the server, etc., based on the information on the object. The types of images that are present are not limited thereto. In addition, the electronic device may detect whether the object is in the speech state from the acquired image, and extract the voice of the object in the speech state section detected as being in the speech state. Also, as a result of modeling the extracted voice, the electronic device may obtain voice information related to the voice of the object and the voice of the object. In one embodiment, the voice information related to the voice of the object may be dialect, intonation, buzzword, speech, pronunciation, parody, noun metabolism, etc. for the voice of the object, but is not limited thereto.

전자 장치는 획득한 객체의 음성이 사용자가 요청한 음성이 맞는지 사용자로부터 확인 받을 수 있다. 일 실시예에서, 전자 장치는 객체의 음성을 재생한 뒤 확인을 요청하는 메시지를 전송할 수도 있고, 객체의 음성으로 확인을 요청하는 메시지를 전송할 수도 있다. 예를 들어, 전자 장치는 백종원 목소리를 재생한 후, 다시 기계음으로 "백종원 목소리가 맞습니까?"라는 음성을 출력할 수 있다. 다른 예를 들어, 전자 장치는 백종원의 음성 정보로서 "맞습니까?" 대신 "맞쥬?"를 사용한다는 방언 정보에 기초하여 백종원 목소리로서 "백종원 목소리 맞쥬?"라는 음성을 출력할 수도 있다.The electronic device may receive confirmation from the user whether the acquired voice of the object is the voice requested by the user. In an embodiment, the electronic device may transmit a message requesting confirmation after reproducing the voice of the object, or may transmit a message requesting confirmation through the voice of the object. For example, after reproducing the voice of Jong-won Baek, the electronic device may output a voice saying "Is the voice of Jong-won Baek correct?" as a mechanical sound. For another example, the electronic device is "Right?" as Baek Jong-won's voice information. Instead, it is possible to output a voice saying "Is it right?" as Baek Jong-won's voice, based on dialect information that "Is it right?".

도 9는 일 실시예에 따른 복수의 인물을 포함하는 영상으로부터 사용자가 특정한 객체에 대한 음성을 저장하는 방법의 예시도이다.9 is an exemplary diagram of a method for a user to store a voice for a specific object from an image including a plurality of people, according to an exemplary embodiment.

도 9를 참조하면, 전자 장치는 사용자로부터 객체에 관한 정보를 획득할 수 있다. 예를 들어, 전자 장치의 가상 어시스턴스(assistance)는 사용자로부터 "BTS 정국이 목소리 저장해줘"라고 인식되는 음성을 수신할 수 있다. 전자 장치는 획득된 음성으로부터 "BTS 정국이"라는 객체에 관한 정보를 획득할 수 있다. 즉, 객체에 관한 정보는 객체(906)의 이름일 수 있으나, 이에 제한되는 것은 아니고, 객체(906)의 성명, 사진, 명칭, 상호, 초상, 예명, 필명, 약칭 등일 수 있다.Referring to FIG. 9, the electronic device may obtain information on an object from a user. For example, the virtual assistance of the electronic device may receive a voice that is recognized as "BTS Jungkook saves the voice" from the user. The electronic device may obtain information on the object "BTS Jungkooki" from the acquired voice. That is, the information on the object may be the name of the object 906, but is not limited thereto, and may be a name, a photo, a name, a trade name, a portrait, a stage name, a pseudonym, an abbreviation, and the like of the object 906.

동작 920에서, 전자 장치는 영상(900)을 획득하고, 영상을 분석할 수 있다. 일 실시예에서, 전자 장치는 객체에 관한 정보에 기초하여, 전자 장치에서 재생 중인 영상, 전자 장치와 통신할 수 있는 외부 장치에서 재생 중인 영상, 서버로부터 수신한 영상 등을 획득할 수 있다. 또한, 획득한 영상(900)은 복수의 인물(902, 904, 906, 908)이 포함된 영상일 수 있다. 전자 장치는 획득한 영상(900)을 분석하여 복수의 인물(902, 904, 906, 908) 중 객체(906)를 판별할 수 있다. 일 실시예에서, 전자 장치는 획득한 영상(900) 중 객체(906)가 포함된 복수 개의 화상을 캡처할 수 있다. 다른 일 실시예에서, 전자 장치는 획득한 영상(900)을 일정 시간 간격으로 캡처하여 N개의 화상을 획득한 후, N개의 캡처된 화상들 중에서 객체(906)가 포함된 M 개의 화상을 결정할 수 있다.In operation 920, the electronic device acquires the image 900 and analyzes the image. In an embodiment, the electronic device may acquire an image being played on the electronic device, an image being played on an external device capable of communicating with the electronic device, an image received from a server, and the like, based on information about the object. Also, the acquired image 900 may be an image including a plurality of people 902, 904, 906, and 908. The electronic device may determine the object 906 among the plurality of people 902, 904, 906 and 908 by analyzing the acquired image 900. In an embodiment, the electronic device may capture a plurality of images including the object 906 among the acquired images 900. In another embodiment, the electronic device may acquire N images by capturing the acquired image 900 at regular time intervals, and then determine M images including the object 906 from among the N captured images. have.

동작 930에서, 전자 장치는 객체(906)가 포함된 복수 개의 화상 중에서 객체가 발화상태인 구간을 결정할 수 있다. 일 실시예에서, 전자 장치는 객체의 입술의 움직임, 객체의 입의 모양, 객체의 치아 인식, 영상의 자막 유무, 영상의 스크립트 등을 고려하여 영상으로부터 객체가 발화상태인지 검출할 수 있다.In operation 930, the electronic device may determine a section in which the object is in the utterance state among a plurality of images including the object 906. In an embodiment, the electronic device may detect whether an object is in a speech state from an image in consideration of a movement of a lip of an object, a shape of a mouth of an object, recognition of teeth of an object, presence or absence of a caption in an image, a script of an image, and the like.

일 실시예에서, 전자 장치는, 객체가 발화상태인 것으로 검출된 발화상태 구간에서, 객체의 음성을 추출할 수 있다. 또한, 전자 장치는 추출한 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 획득할 수 있다.In an embodiment, the electronic device may extract the voice of the object in the speech state section in which the object is detected as the speech state. Also, as a result of modeling the extracted voice, the electronic device may obtain voice information related to the voice of the object and the voice of the object.

동작 940에서, 전자 장치는 추출된 음성을 모델링한 결과를 데이터 베이스 내 음성과 비교할 수 있다. 일 실시예에 따르면, 객체가 유명인에 해당하는 경우 데이터 베이스 내 객체에 해당하는 음성과 비교함으로써, 음성의 정확도 및 음성 서비스의 질이 향상될 수 있다.In operation 940, the electronic device may compare the result of modeling the extracted speech with the speech in the database. According to an embodiment, when the object corresponds to a famous person, the accuracy of the voice and the quality of the voice service may be improved by comparing it with the voice corresponding to the object in the database.

예를 들어, 전자 장치는 복수의 인물(902, 904, 906, 908)이 포함된 화상으로부터 제 1인물(902), 제2 인물(904), 제3 인물(906) 및 제4 인물(908)을 구별할 수 있다. 화상이 캡처된 시간에 대응되는 구간에서 추출된 음성이, 객체에 해당하는 제3 인물(906)의 음성일 확률은, 총 4명의 인물이 존재하므로 25%일 수 있다. 또한, 전자 장치는 객체의 입술의 움직임, 객체의 입의 모양, 객체의 치아 인식, 영상의 자막 유무, 영상의 스크립트, 마이크 존재 여부 등을 고려하여, 객체에 해당하는 제3 인물(906)이 발화상태인지 검출할 수 있다. 전자 장치는 제1 인물(902) 및 제3 인물(906)은 입 모양이 열린 모양인 반면, 제2 인물(904) 및 제4 인물(908)은 입 모양이 닫힌 모양인 것을 판단하여 화상이 캡처된 구간에서 추출된 음성이 제3 인물(906)의 음성에 해당할 확률은 60%일 수 있다. 또한, 전자 장치는, 추출된 음성을 데이터 베이스 내 음성과 비교할 수 있다. 데이터 베이스 내 음성과 비교한 결과, 제3 인물(906)이 객체에 해당할 확률은 90%일 수 있다. 특정 인물의 음성이 객체의 음성에 해당할 확률이 임계값을 도과하는 경우, 전자 장치는 추출된 음성을 객체의 음성으로서 저장할 수 있다. 일 실시예에서, 전자 장치는 객체의 음성과 관련된 음성 정보를 객체의 음성과 함께 저장할 수 있다.For example, the electronic device may display a first person 902, a second person 904, a third person 906, and a fourth person 908 from an image including a plurality of people 902, 904, 906, and 908. ) Can be distinguished. The probability that the voice extracted in the section corresponding to the time when the image was captured is the voice of the third person 906 corresponding to the object may be 25% since there are a total of four persons. In addition, the electronic device considers the movement of the lips of the object, the shape of the mouth of the object, the recognition of the teeth of the object, the presence of a caption on the image, the script of the image, the presence of a microphone, etc., and the third person 906 corresponding to the object is It can detect whether it is in an ignition state. The electronic device determines that the mouth shape of the first person 902 and the third person 906 is open, while the second person 904 and the fourth person 908 have a closed mouth shape, and the image is The probability that the voice extracted from the captured section corresponds to the voice of the third person 906 may be 60%. Also, the electronic device may compare the extracted voice with the voice in the database. As a result of comparing the voice in the database, the probability that the third person 906 corresponds to the object may be 90%. When the probability that the voice of a specific person corresponds to the voice of the object exceeds the threshold value, the electronic device may store the extracted voice as the voice of the object. In an embodiment, the electronic device may store voice information related to the voice of the object together with the voice of the object.

도 10은 일 실시예에 따른 전자 장치의 구성을 도시한 블록도이다.10 is a block diagram illustrating a configuration of an electronic device according to an exemplary embodiment.

도 10을 참조하면, 전자 장치(1000)는 통신부(1020), 프로세서(1040) 및 메모리(1060)를 포함할 수 있다. 그러나, 도 10에 도시된 구성 요소 모두가 전자 장치(1000)의 필수 구성 요소인 것은 아니다. 도 10에 도시된 구성 요소보다 많은 구성 요소에 의해 전자 장치(1000)가 구현될 수도 있고, 도 10에 도시된 구성 요소보다 적은 구성 요소에 의해 전자 장치(1000)가 구현될 수도 있다. 뿐만 아니라 통신부(1020), 프로세서(1040), 메모리(1060)가 하나의 칩(chip) 형태로 구현될 수도 있다. Referring to FIG. 10, the electronic device 1000 may include a communication unit 1020, a processor 1040, and a memory 1060. However, not all of the components shown in FIG. 10 are essential components of the electronic device 1000. The electronic device 1000 may be implemented by more components than the components illustrated in FIG. 10, or the electronic device 1000 may be implemented by fewer components than the components illustrated in FIG. 10. In addition, the communication unit 1020, the processor 1040, and the memory 1060 may be implemented in the form of a single chip.

일 실시예에서, 통신부(1020)는 전자 장치(1000)와 유선 또는 무선으로 연결된 다른 전자 장치와 통신할 수 있다. 예를 들면, 통신부(1020)는 전자 장치(1000)에서 재생중인 영상을 외부 장치 또는 서버로 전송하거나, 외부 장치 또는 서버로부터 영상을 수신할 수 있고, 외부 장치 또는 서버와 데이터 등을 송수신할 수 있다. 일 실시예에서, 통신부(1020)는 영상 또는 화상의 디스플레이를 위해 비디오/오디오 신호를 출력하기 위한 출력 포트 또는 무선 통신을 이용하여, 외부적으로 연결된 디스플레이 장치에 영상 또는 화상을 전송할 수 있다. 이러한 출력 포트는 비디오/오디오 신호를 동시에 전송하는 HDMI, DP, 썬더볼트 등이거나 비디오 신호와 오디오 신호를 별개로 출력하기 위한 포트들을 포함할 수 있다.In an embodiment, the communication unit 1020 may communicate with another electronic device connected to the electronic device 1000 by wire or wirelessly. For example, the communication unit 1020 may transmit an image being played on the electronic device 1000 to an external device or server, receive an image from an external device or server, and transmit/receive data to and from the external device or server. have. In an embodiment, the communication unit 1020 may transmit an image or image to an externally connected display device using an output port for outputting a video/audio signal or wireless communication for display of an image or image. These output ports may include HDMI, DP, Thunderbolt, etc. for simultaneously transmitting video/audio signals, or ports for separately outputting video and audio signals.

일 실시예에서, 프로세서(1040)는 전자 장치(1000)의 전체적인 동작을 제어하며, CPU, GPU 등과 같은 프로세서를 적어도 하나 이상 포함할 수 있다. 프로세서(1040)는 전자 장치(1000)를 작동하기 위한 동작을 수행하도록 전자 장치(1000)에 포함된 다른 구성들을 제어할 수 있다. 예를 들어, 프로세서(1040)는 메모리(1060)에 저장된 프로그램을 실행시키거나, 저장된 파일을 읽어오거나, 새로운 파일을 저장할 수도 있다. 일 실시예에서, 프로세서(1040)는 메모리(1060)에 저장된 프로그램을 실행함으로써, 전자 장치(1000)를 작동하기 위한 동작을 수행할 수 있다. 예를 들면, 프로세서(1040)는 음성을 저장할 객체에 관한 정보를 획득하고, 획득한 객체에 관한 정보에 기초하여, 객체가 포함된 영상으로부터 객체의 발화상태를 검출하여 객체의 음성을 추출하고, 추출된 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장하고, 저장된 음성 정보에 기초하여, 음성 서비스를 제공할 수 있다.In one embodiment, the processor 1040 controls the overall operation of the electronic device 1000 and may include at least one processor such as a CPU or a GPU. The processor 1040 may control other components included in the electronic device 1000 to perform an operation for operating the electronic device 1000. For example, the processor 1040 may execute a program stored in the memory 1060, read a stored file, or store a new file. In an embodiment, the processor 1040 may perform an operation for operating the electronic device 1000 by executing a program stored in the memory 1060. For example, the processor 1040 obtains information on an object to store a voice, and, based on the obtained information on the object, extracts the voice of the object by detecting the speech state of the object from the image including the object, As a result of modeling the extracted voice, voice information related to the voice of the object and the voice of the object may be stored, and a voice service may be provided based on the stored voice information.

일 실시예에서, 메모리(1060)는 애플리케이션과 같은 프로그램 및 파일 등과 같은 다양한 종류의 데이터가 설치 및 저장할 수 있다. 프로세서(1040)는 메모리(1060)에 저장된 데이터에 접근하여 이를 이용하거나, 또는 새로운 데이터를 메모리(1060)에 저장할 수도 있다. 일 실시예에서, 메모리(1060)는 데이터 베이스를 포함할 수 있다. 일 실시예에서, 메모리(1060)는 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. In one embodiment, the memory 1060 may install and store various types of data such as programs and files such as applications. The processor 1040 may access and use data stored in the memory 1060 or may store new data in the memory 1060. In one embodiment, the memory 1060 may include a database. In an embodiment, the memory 1060 may store voice information related to the voice of the object and the voice of the object.

도 10에 도시되지는 아니하였으나, 전자 장치(1000)는 센서부, 디스플레이, 안테나, 감지부, 입/출력부, 비디오 처리부, 오디오 처리부, 오디오 출력부, 사용자 입력부 등을 더 포함할 수 있다.Although not shown in FIG. 10, the electronic device 1000 may further include a sensor unit, a display, an antenna, a sensing unit, an input/output unit, a video processing unit, an audio processing unit, an audio output unit, a user input unit, and the like.

도 11는 일 실시예에 따른 전자 장치의 세부 구성을 도시한 블록도이다.11 is a block diagram illustrating a detailed configuration of an electronic device according to an exemplary embodiment.

도 11을 참조하면, 전자 장치는 프로세서(1100), 사용자 입력부(1110), 디스플레이(1120), 비디오 처리부(1130), 오디오 처리부(1140), 오디오 출력부(1150), 감지부(1160), 통신부(1170), 입/출력부(1180), 메모리(1190)를 포함할 수 있다. 그러나, 도 11에 도시된 구성 요소 모두가 전자 장치의 필수 구성 요소인 것은 아니다. 도 11에 도시된 구성 요소보다 많은 구성 요소에 의해 전자 장치가 구현될 수도 있고, 도 11에 도시된 구성 요소보다 적은 구성 요소에 의해 전자 장치가 구현될 수도 있다. 뿐만 아니라 프로세서(1100), 사용자 입력부(1110), 디스플레이(1120), 비디오 처리부(1130), 오디오 처리부(1140), 오디오 출력부(1150), 감지부(1160), 통신부(1170), 입/출력부(1180), 메모리(1190)가 하나의 칩(chip) 형태로 구현될 수도 있다.Referring to FIG. 11, the electronic device includes a processor 1100, a user input unit 1110, a display 1120, a video processing unit 1130, an audio processing unit 1140, an audio output unit 1150, a sensing unit 1160, and A communication unit 1170, an input/output unit 1180, and a memory 1190 may be included. However, not all of the components shown in FIG. 11 are essential components of the electronic device. The electronic device may be implemented by more constituent elements than the constituent elements shown in FIG. 11, or the electronic device may be implemented by fewer constituent elements than the constituent elements shown in FIG. 11. In addition, the processor 1100, user input unit 1110, display 1120, video processing unit 1130, audio processing unit 1140, audio output unit 1150, sensing unit 1160, communication unit 1170, input/output The output unit 1180 and the memory 1190 may be implemented in the form of a single chip.

일 실시예에서, 프로세서(1100), 통신부(1170), 메모리(1190)는 도 10을 참조하여 전술한 프로세서(1040), 통신부(1020), 메모리(1060)와 대응될 수 있다. 따라서, 전자 장치를 설명하는데 있어서 도 10에서와 중복되는 설명은 생략한다.In one embodiment, the processor 1100, the communication unit 1170, and the memory 1190 may correspond to the processor 1040, the communication unit 1020, and the memory 1060 described above with reference to FIG. 10. Accordingly, in describing the electronic device, a description overlapping with that in FIG. 10 will be omitted.

일 실시예에서, 사용자 입력부(1110)는, 사용자가 전자 장치를 제어하기 위한 데이터를 입력하는 수단을 의미한다. 예를 들어, 사용자 입력부(1110)는 키 패드(key pad), 돔 스위치 (dome switch), 터치 패드, 조그 휠, 조그 스위치 등을 포함할 수 있으나, 이에 한정되는 것은 아니다.In one embodiment, the user input unit 1110 refers to a means for a user to input data for controlling an electronic device. For example, the user input unit 1110 may include a key pad, a dome switch, a touch pad, a jog wheel, a jog switch, and the like, but is not limited thereto.

일 실시예에서, 디스플레이(1120)는, 프로세서(1100)의 제어에 의해 영상을 화면에 표시할 수 있다. 화면에 표시되는 영상 또는 화상은 통신부(1170), 입/출력부(1180), 메모리(1190)로부터 수신될 수 있다.In an embodiment, the display 1120 may display an image on a screen under the control of the processor 1100. The image or image displayed on the screen may be received from the communication unit 1170, the input/output unit 1180, and the memory 1190.

일 실시예에서, 비디오 처리부(1130)는, 디스플레이(1120)에 의해 표시될 영상 데이터를 처리하며, 영상 데이터에 대한 디코딩, 렌더링, 스케일링, 노이즈 필터링, 프레임 레이트 변환, 및 해상도 변환 등과 같은 다양한 영상 처리 동작을 수행할 수 있다.In one embodiment, the video processing unit 1130 processes image data to be displayed by the display 1120, and various images such as decoding, rendering, scaling, noise filtering, frame rate conversion, and resolution conversion for the image data. Processing operations can be performed.

일 실시예에서, 오디오 처리부(1140)는, 오디오 데이터에 대한 처리를 수행한다. 오디오 처리부(1140)에서는 오디오 데이터에 대한 디코딩이나 증폭, 노이즈 필터링 등과 같은 다양한 처리가 수행될 수 있다.In an embodiment, the audio processing unit 1140 processes audio data. The audio processing unit 1140 may perform various processing such as decoding, amplification, noise filtering, or the like for audio data.

일 실시예에서, 오디오 출력부(1150)는, 프로세서(1100)의 제어에 의해 튜너부를 통해 수신된 방송 신호에 포함된 오디오, 통신부(1170) 또는 입/출력부(1180)를 통해 입력되는 오디오, 메모리(1190)에 저장된 오디오를 출력할 수 있다. 오디오 출력부(1150)는 스피커(1152), 헤드폰 출력 단자(1154) 또는 S/PDIF(Sony/Philips Digital Interface: 출력 단자(1156) 중 적어도 하나를 포함할 수 있다.In an embodiment, the audio output unit 1150 includes audio included in a broadcast signal received through a tuner unit under the control of the processor 1100, audio input through the communication unit 1170, or the input/output unit 1180. , Audio stored in the memory 1190 may be output. The audio output unit 1150 may include at least one of a speaker 1152, a headphone output terminal 1154, or a Sony/Philips Digital Interface (S/PDIF) output terminal 1156.

일 실시예에서, 감지부(1160)는, 사용자의 음성, 사용자의 영상, 또는 사용자의 인터랙션(interaction)을 감지하며, 마이크(1162), 카메라(1164), 광 수신부(1166)를 포함할 수 있다. 일 실시예에서, 감지부(1160)는 사용자의 TTS(text-to-speech) 목소리를 생성하기 위해 사용자의 음성을 감지하거나, 음성을 저장할 객체에 관한 정보를 획득하여 객체의 TTS 목소리를 생성하기 위해 사용자로부터 음성 명령을 수신할 수 있다.In one embodiment, the sensing unit 1160 detects a user's voice, a user's image, or a user's interaction, and may include a microphone 1162, a camera 1164, and a light receiving unit 1166. have. In one embodiment, the detection unit 1160 detects the user's voice to generate the user's text-to-speech (TTS) voice, or obtains information on an object to store the voice to generate the TTS voice of the object. To receive a voice command from the user.

일 실시예에서, 마이크(1162)는 사용자의 발화(utterance)된 음성을 수신한다. 마이크(1162)는 수신된 음성을 전기 신호로 변환하여 프로세서(1100)로 출력할 수 있다. 일 실시예에서, 마이크(1162)는 사용자의 발화된 음성을 수신하여 수신된 음성을 전기 신호로 변환함으로써 프로세서로 하여금 사용자의 TTS(text-to-speech) 목소리를 생성하게 할 수 있다.In one embodiment, the microphone 1162 receives the user's uttered voice. The microphone 1162 may convert the received voice into an electrical signal and output it to the processor 1100. In one embodiment, the microphone 1162 may cause the processor to generate the user's text-to-speech (TTS) voice by receiving the user's spoken voice and converting the received voice into an electrical signal.

일 실시예에서, 카메라(1164)는 카메라 인식 범위에서 제스처를 포함하는 사용자의 모션에 대응되는 영상(예를 들어, 연속되는 프레임)을 수신할 수 있다. 또한, 광 수신부(1166)는 원격 제어 장치에서부터 수신되는 광 신호(제어 신호를 포함)를 수신한다. 광 수신부 163는 원격 제어 장치로부터 사용자 입력(예를 들어, 터치, 눌림, 터치 제스처, 음성, 또는 모션)에 대응되는 광 신호를 수신할 수 있다. 수신된 광 신호로부터 프로세서 120의 제어에 의해 제어 신호가 추출될 수 있다.In an embodiment, the camera 1164 may receive an image (eg, a continuous frame) corresponding to a user's motion including a gesture in the camera recognition range. Further, the light receiving unit 1166 receives an optical signal (including a control signal) received from a remote control device. The optical receiver 163 may receive an optical signal corresponding to a user input (eg, touch, pressing, touch gesture, voice, or motion) from the remote control device. A control signal may be extracted from the received optical signal by the control of the processor 120.

일 실시예에서, 통신부(1170)는, 전자 장치와 무선 통신 시스템 사이 또는 전자 장치와 다른 전자 장치가 위치한 네트워크 사이의 무선 통신을 가능하게 하는 하나 이상의 모듈을 포함할 수 있다. 예를 들어, 통신부(1170)는 근거리 통신부(1172), 이동 통신부(1174), 방송 수신부(1176), 무선 인터넷 모듈(미도시) 등을 포함할 수 있다. 통신부(1170)는 송/수신부로 호칭될 수 있다. In an embodiment, the communication unit 1170 may include one or more modules that enable wireless communication between an electronic device and a wireless communication system or between an electronic device and a network in which another electronic device is located. For example, the communication unit 1170 may include a short-range communication unit 1172, a mobile communication unit 1174, a broadcast receiving unit 1176, a wireless Internet module (not shown), and the like. The communication unit 1170 may be referred to as a transmission/reception unit.

일 실시예에 따른 통신부(1170)는 무선 인터넷 모듈 또는 근거리 통신부(1172) 등을 이용하여 다른 외부 기기에 연결하거나 또는 비디오/오디오 데이터의 송수신을 가능하게 할 수 있다.The communication unit 1170 according to an embodiment may connect to another external device or transmit/receive video/audio data using a wireless Internet module or a short-range communication unit 1172.

일 실시예에서, 근거리 통신부(1172)는, 근거리 통신을 위한 모듈을 의미할 수 있다. 근거리 통신 기술로 블루투스(Bluetooth), RFID(Radio Frequency Identification), 적외선 통신(IrDA, infrared Data Association), UWB(Ultra Wideband), ZigBee 등이 이용될 수 있다.In an embodiment, the short-range communication unit 1172 may mean a module for short-range communication. As a short-range communication technology, Bluetooth, Radio Frequency Identification (RFID), infrared data association (IrDA), Ultra Wideband (UWB), ZigBee, and the like may be used.

일 실시예에서, 이동 통신부(1174)는, 이동 통신망 상에서 기지국, 외부의 단말, 서버 중 적어도 하나와 무선 신호를 송수신할 수 있다. 상기 무선 신호는, 음성 호 신호, 화상 통화 호 신호 또는 문자/멀티미디어 메시지 송수신에 따른 다양한 형태의 데이터를 포함할 수 있다.In an embodiment, the mobile communication unit 1174 may transmit and receive a wireless signal with at least one of a base station, an external terminal, and a server on a mobile communication network. The wireless signal may include a voice call signal, a video call signal, or various types of data according to transmission and reception of text/multimedia messages.

일 실시예에서, 방송 수신부(1176)는, 방송 채널을 통하여 외부의 방송 관리 서버로부터 방송 신호 및/또는 방송 관련된 정보를 수신할 수 있다. 방송 신호는, TV 방송 신호, 라디오 방송 신호, 데이터 방송 신호를 포함할 뿐만 아니라, TV 방송 신호 또는 라디오 방송 신호에 데이터 방송 신호가 결합된 형태의 방송 신호도 포함할 수 있다.In an embodiment, the broadcast receiving unit 1176 may receive a broadcast signal and/or broadcast-related information from an external broadcast management server through a broadcast channel. The broadcast signal may include a TV broadcast signal, a radio broadcast signal, and a data broadcast signal, as well as a broadcast signal in which a data broadcast signal is combined with a TV broadcast signal or a radio broadcast signal.

일 실시예에서, 무선 인터넷 모듈은 무선 인터넷 접속을 위한 모듈을 말하는 것으로, 디바이스에 내장되거나 외장될 수 있다. 무선 인터넷 기술로는 WLAN(Wireless LAN)(WiFi), Wibro(Wireless broadband), Wimax(World Interoperability for Microwave Access), HSDPA(High Speed Downlink Packet Access) 등이 이용될 수 있다. 무선 인터넷 모듈을 통해서 전자 장치는 다른 전자 장치와 와이파이(Wi-Fi) P2P(Peer to Peer)연결을 할 수 있다. 이러한 와이 파이(Wi-Fi) P2P 연결을 통하여 장치간 스트리밍 서비스를 제공할 수 있으며, 데이터 송/수신 또는 프린터와 연결되어 프린팅 서비스를 제공할 수 있다.In one embodiment, the wireless Internet module refers to a module for wireless Internet access, and may be built-in or external to the device. As a wireless Internet technology, WLAN (Wireless LAN) (WiFi), Wibro (Wireless broadband), Wimax (World Interoperability for Microwave Access), HSDPA (High Speed Downlink Packet Access), and the like may be used. Through the wireless Internet module, the electronic device can establish a Wi-Fi peer-to-peer (P2P) connection with other electronic devices. A streaming service between devices can be provided through such a Wi-Fi P2P connection, and a printing service can be provided by transmitting/receiving data or connecting to a printer.

일 실시예에서, 입/출력부(1180)는, 프로세서(1100)의 제어에 의해 전자 장치의 외부에서부터 비디오(예를 들어, 동영상 등), 오디오(예를 들어, 음성, 음악 등) 및 부가 정보(예를 들어, EPG 등) 등을 송수신 할 수 있다. 입/출력부(1180)는 HDMI 포트(High-Definition Multimedia Interface port(1182), 컴포넌트 잭(component jack, 1184), PC 포트(PC port, 1186), 및 USB 포트(USB port, 1188) 중 하나를 포함할 수 있다. 입/출력부(1180)는 HDMI 포트(1182), 컴포넌트 잭(1184), PC 포트(1186) 및 USB 포트(1188)의 조합을 포함할 수 있다.In one embodiment, the input/output unit 1180 includes video (eg, video), audio (eg, voice, music, etc.), and additions from the outside of the electronic device under the control of the processor 1100. Information (eg, EPG, etc.) can be transmitted and received. The input/output unit 1180 is one of an HDMI port (High-Definition Multimedia Interface port 1182), a component jack (1184), a PC port (PC port, 1186), and a USB port (USB port, 1188). The input/output unit 1180 may include a combination of an HDMI port 1182, a component jack 1184, a PC port 1188, and a USB port 1188.

일 실시예에서, 메모리(1190)는, 애플리케이션과 같은 프로그램 및 파일 등과 같은 다양한 종류의 데이터가 설치 및 저장될 수 있다. 프로세서(1100)는 메모리(1190)에 저장된 데이터에 접근하여 이를 이용하거나, 또는 새로운 데이터를 메모리(1190)에 저장할 수도 있다. 일 실시예에서, 메모리(1190)는 데이터 베이스를 포함할 수 있다. 일 실시예에서, 메모리(1190)는 전자 장치의 프로세서에 의해 추출된 음성을 모델링한 결과, 객체의 음성과 관련된 음성 정보 및 객체의 음성을 저장할 수 있다. In one embodiment, the memory 1190 may install and store various types of data such as programs and files such as applications. The processor 1100 may access and use data stored in the memory 1190 or may store new data in the memory 1190. In one embodiment, the memory 1190 may include a database. In an embodiment, as a result of modeling the voice extracted by the processor of the electronic device, the memory 1190 may store voice information related to the voice of the object and the voice of the object.

일 실시예는 컴퓨터에 의해 실행되는 프로그램 모듈과 같은 컴퓨터로 읽을 수 있고 컴퓨터에 의해 실행가능한 명령어를 포함하는 기록 매체의 형태로도 구현될 수 있다. 컴퓨터 판독 가능 매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.One embodiment may be implemented in the form of a recording medium including computer-readable and computer-executable instructions, such as a program module executed by a computer. Computer-readable media can be any available media that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. Further, the computer-readable medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

개시된 실시예들은 컴퓨터로 읽을 수 있는 저장 매체(computer-readable storage media)에 저장된 명령어를 포함하는 S/W 프로그램으로 구현될 수 있다. The disclosed embodiments may be implemented as a S/W program including instructions stored in a computer-readable storage media.

컴퓨터는, 저장 매체로부터 저장된 명령어를 호출하고, 호출된 명령어에 따라 개시된 실시예에 따른 동작이 가능한 장치로서, 개시된 실시예들에 따른 전자 장치를 포함할 수 있다.The computer, as a device capable of calling a stored command from a storage medium and performing operations according to the disclosed embodiments according to the called command, may include an electronic device according to the disclosed embodiments.

컴퓨터로 읽을 수 있는 저장매체는, 비일시적(non-transitory) 저장매체의 형태로 제공될 수 있다. 여기서,'비일시적'은 저장매체가 신호(signal)를 포함하지 않으며 실재(tangible)한다는 것을 의미할 뿐 데이터가 저장매체에 반영구적 또는 임시적으로 저장됨을 구분하지 않는다.The computer-readable storage medium may be provided in the form of a non-transitory storage medium. Here,'non-transient' means that the storage medium does not contain a signal and is tangible, but does not distinguish between semi-permanent or temporary storage of data in the storage medium.

또한, 개시된 실시예들에 따른 제어 방법은 컴퓨터 프로그램 제품(computer program product)에 포함되어 제공될 수 있다. 컴퓨터 프로그램 제품은 상품으로서 판매자 및 구매자 간에 거래될 수 있다.In addition, the control method according to the disclosed embodiments may be provided by being included in a computer program product. Computer program products can be traded between sellers and buyers as commodities.

컴퓨터 프로그램 제품은 S/W 프로그램, S/W 프로그램이 저장된 컴퓨터로 읽을 수 있는 저장 매체를 포함할 수 있다. 예를 들어, 컴퓨터 프로그램 제품은 디바이스의 제조사 또는 전자 마켓(예, 구글 플레이 스토어, 앱 스토어)을 통해 전자적으로 배포되는 S/W 프로그램 형태의 상품(예, 다운로더블 앱)을 포함할 수 있다. 전자적 배포를 위하여, S/W 프로그램의 적어도 일부는 저장 매체에 저장되거나, 임시적으로 생성될 수 있다. 이 경우, 저장 매체는 제조사의 서버, 전자 마켓의 서버, 또는 SW 프로그램을 임시적으로 저장하는 중계 서버의 저장매체가 될 수 있다.The computer program product may include a S/W program and a computer-readable storage medium in which the S/W program is stored. For example, the computer program product may include a product (eg, a downloadable app) in the form of a S/W program that is electronically distributed through a device manufacturer or an electronic market (eg, Google Play Store, App Store). . For electronic distribution, at least a part of the S/W program may be stored in a storage medium or may be temporarily generated. In this case, the storage medium may be a server of a manufacturer, a server of an electronic market, or a storage medium of a relay server temporarily storing an SW program.

컴퓨터 프로그램 제품은, 서버 및 디바이스로 구성되는 시스템에서, 서버의 저장매체 또는 디바이스의 저장매체를 포함할 수 있다. 또는, 서버 또는 디바이스와 통신 연결되는 제 3 장치(예, 스마트폰)가 존재하는 경우, 컴퓨터 프로그램 제품은 제 3 장치의 저장매체를 포함할 수 있다. 또는, 컴퓨터 프로그램 제품은 서버로부터 디바이스 또는 제 3 장치로 전송되거나, 제 3 장치로부터 디바이스로 전송되는 S/W 프로그램 자체를 포함할 수 있다.The computer program product may include a storage medium of a server or a storage medium of a device in a system composed of a server and a device. Alternatively, when there is a third device (eg, a smartphone) that is communicatively connected to a server or device, the computer program product may include a storage medium of the third device. Alternatively, the computer program product may include a S/W program itself transmitted from a server to a device or a third device, or transmitted from a third device to a device.

이 경우, 서버, 디바이스 및 제 3 장치 중 하나가 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수 있다. 또는, 서버, 디바이스 및 제 3 장치 중 둘 이상이 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 분산하여 실시할 수 있다.In this case, one of the server, the device, and the third device may execute the computer program product to perform the method according to the disclosed embodiments. Alternatively, two or more of a server, a device, and a third device may execute a computer program product to distribute and implement the method according to the disclosed embodiments.

예를 들면, 서버(예로, 클라우드 서버 또는 인공 지능 서버 등)가 서버에 저장된 컴퓨터 프로그램 제품을 실행하여, 서버와 통신 연결된 디바이스가 개시된 실시예들에 따른 방법을 수행하도록 제어할 수 있다.For example, a server (eg, a cloud server or an artificial intelligence server) may execute a computer program product stored in the server, thereby controlling a device connected to the server to perform the method according to the disclosed embodiments.

또 다른 예로, 제 3 장치가 컴퓨터 프로그램 제품을 실행하여, 제 3 장치와 통신 연결된 디바이스가 개시된 실시예에 따른 방법을 수행하도록 제어할 수 있다. 제 3 장치가 컴퓨터 프로그램 제품을 실행하는 경우, 제 3 장치는 서버로부터 컴퓨터 프로그램 제품을 다운로드하고, 다운로드 된 컴퓨터 프로그램 제품을 실행할 수 있다. 또는, 제 3 장치는 프리로드 된 상태로 제공된 컴퓨터 프로그램 제품을 실행하여 개시된 실시예들에 따른 방법을 수행할 수도 있다.As still another example, the third device may execute a computer program product, and a device that is connected in communication with the third device may be controlled to perform the method according to the disclosed embodiment. When the third device executes the computer program product, the third device may download the computer program product from the server and execute the downloaded computer program product. Alternatively, the third device may perform the method according to the disclosed embodiments by executing the computer program product provided in a preloaded state.

또한, 본 개시에서, "부"는 프로세서 또는 회로와 같은 하드웨어 구성(hardware component), 및/또는 프로세서와 같은 하드웨어 구성에 의해 실행되는 소프트웨어 구성(software component)일 수 있다.In addition, in the present disclosure, the "unit" may be a hardware component such as a processor or a circuit, and/or a software component executed by a hardware configuration such as a processor.

전술한 본 개시의 설명은 예시를 위한 것이며, 본 개시가 속하는 기술분야의 통상의 지식을 가진 자는 본 개시의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present disclosure is for illustrative purposes only, and those of ordinary skill in the art to which the present disclosure pertains will be able to understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present disclosure. will be. Therefore, it should be understood that the embodiments described above are illustrative and non-limiting in all respects. For example, each component described as a single type may be implemented in a distributed manner, and similarly, components described as being distributed may also be implemented in a combined form.

본 개시의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 개시의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present disclosure is indicated by the claims to be described later rather than the detailed description, and all changes or modified forms derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present disclosure. do.

Claims

Obtaining information on an object to store a voice;
Extracting a speech of the object by detecting a speech state of the object from an image including the object based on the obtained information on the object;
Storing voice information related to the voice of the object and the voice of the object as a result of modeling the extracted voice; And
And providing a voice service based on the stored voice information.

The method of claim 1,
An image including the object from an image including at least one of an image being played in the device, an image being played in an external device communicating with the device, and an image stored in the external device, based on the acquired information on the object The method of operating an electronic device, further comprising an operation of obtaining a.

The method of claim 1,
Further comprising an operation of obtaining an image including a plurality of people,
The operation of extracting the voice of the object,
Recognizing the object from the acquired image based on the acquired object information; And
And extracting the speech of the object by detecting the speech state of the recognized object from the acquired image.

The method of claim 1, wherein the providing of the voice service comprises:
And providing a voice service for outputting an output text determined based on the stored voice information as a voice of the object.

The method of claim 1, wherein the voice information related to the voice of the object,
A method of operating an electronic device, which is voice information including at least one of dialect, intonation, vogue, tone, pronunciation, parody, and dialogue for the voice of the object.

The method of claim 1,
Further comprising an operation of receiving an image including the object from a server based on the obtained information on the object,
The operation of extracting the voice of the object,
A method of operating an electronic device, wherein the image received from the server is reproduced at a double speed, thereby detecting an utterance state of the object and extracting the voice of the object at a double speed.

The method of claim 1, wherein the operation of extracting the voice of the object comprises:
Analyzing the image to obtain a screen on which the face of the object appears and a time when the face of the object appears; And
In consideration of including at least one of the movement of the lips of the object, the shape of the mouth of the object, the recognition of the teeth of the object, and the script of the image, by detecting the firing state of the object from the image corresponding to the acquired time A method of operating an electronic device, comprising the operation of extracting the voice of the object.

The method of claim 1, wherein the storing of the voice information and the voice of the object comprises:
Receiving a modeling result of the voice of the object from a server; And
And storing the voice information and the voice of the object by reflecting the modeling result received from the server.

The method of claim 8, wherein the modeling result is
A method of operating an electronic device, which is a modeling result extracted by the other device and stored in a server as a result of determining that the voice of the object may be requested by another device or the device by another device.

The method of claim 1,
Receiving a response from a user as to whether the extracted voice is the same as the voice of the object; And
Further comprising determining whether the extracted voice is the same as the voice of the object based on the received response,
Storing the voice information and the voice of the object,
As a result of determining that they are the same, the voice information and the voice of the object are stored.

As an electronic device,
Communication department;
A memory for storing one or more instructions; And
At least one processor that executes the one or more instructions stored in the memory,
The at least one processor,
Acquires information about the object to store the voice,
Based on the obtained information on the object, extracting the voice of the object by detecting the speech state of the object from the image including the object,
As a result of modeling the extracted voice, voice information related to the voice of the object and the voice of the object are stored,
An electronic device that provides a voice service based on the stored voice information.

The method of claim 11, wherein the at least one processor,
Acquiring an image including the object including at least one of an image being played in the device, an image being played in an external device communicating with the device, and an image stored in the external device based on the acquired object information That, the electronic device.

The method of claim 11, wherein the at least one processor,
Acquire an image containing a plurality of people,
Recognizing the object from the acquired image, based on the information on the acquired object,
An electronic device for extracting the speech of the object by detecting the speech state of the recognized object from the acquired image.

The method of claim 11, wherein the at least one processor,
An electronic device that provides a voice service for outputting an output text determined based on the stored voice information as a voice of the object.

The method of claim 11, wherein the voice information related to the voice of the object,
The dialect, intonation, buzzword, tone, and pronunciation of the object's voice. The electronic device, which is voice information including at least one of a parody and a dialogue.

The method of claim 11, wherein the at least one processor,
Based on the obtained information on the object, receiving an image including the object from the server,
An electronic device for reproducing the image received from the server at a double speed, detecting an utterance state of the object, and extracting a voice of the object at a double speed.

The method of claim 11, wherein the at least one processor,
Analyzing the image to obtain a screen on which the face of the object appears and a time when the face of the object appears,
In consideration of at least one of the movement of the lips of the object, the shape of the mouth of the object, the recognition of the teeth of the object, and the script of the image, the object's ignition state is detected from the image corresponding to the acquired time, and the object To extract the voice of the, electronic device.

The method of claim 11, wherein the at least one processor,
Receiving a modeling result for the voice of the object from the server,
The electronic device that reflects the modeling result received from the server and stores the voice information and the voice of the object.

The method of claim 18, wherein the modeling result is
The electronic device, which is a modeling result extracted by the other device and stored in a server as a result of determining that the voice of the object is likely to be requested by another device or the device by another device.

As a computer-readable recording medium storing one or more programs,
The one or more programs, when executed by one or more processors of the electronic device, cause the electronic device to:
Acquires information about the object to store the voice,
Based on the obtained information on the object, extracting the voice of the object by detecting the speech state of the object from the image including the object,
As a result of modeling the extracted voice, voice information related to the voice of the object and the voice of the object are stored,
A computer-readable recording medium including instructions for providing a voice service based on the stored voice information.