KR20200105700A

KR20200105700A - Voice effects based on facial expressions

Info

Publication number: KR20200105700A
Application number: KR1020207022657A
Authority: KR
Inventors: 션 에이. 람프라샤드; 카를로스 엠. 아벤다노; 아람 엠. 린달
Original assignee: 애플 인크.
Priority date: 2018-02-28
Filing date: 2019-02-26
Publication date: 2020-09-08
Also published as: CN111787986A; WO2019168834A1; WO2020013891A1; CN112512649A; DE112019001058T5; KR102367143B1

Abstract

본 개시내용의 실시예들은 하드웨어 컴포넌트들로부터 추출된 얼굴 특징부 특성들 및/또는 음성 특징부 특성들에 적어도 부분적으로 기초하여 비디오 클립의 오디오 및/또는 비디오 정보를 조정하기 위한 기술들을 제공할 수 있다. 예를 들어, 가상 아바타의 아바타 비디오 클립을 생성하기 위한 요청을 검출하는 것에 응답하여, 카메라의 시야 내의 얼굴과 연관된 비디오 신호 및 오디오 신호가 캡처될 수 있다. 음성 특징부 특성들 및 얼굴 특징부 특성들이 오디오 신호 및 비디오 신호로부터 각각 추출될 수 있다. 일부 예들에서, 아바타 비디오 클립을 미리보기하기 위한 요청을 검출하는 것에 응답하여, 얼굴 특징부 특성들 및 음성 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호가 생성될 수 있고, 조정된 오디오 신호를 이용한 가상 아바타의 비디오 클립의 미리보기가 디스플레이될 수 있다.Embodiments of the present disclosure may provide techniques for adjusting audio and/or video information of a video clip based at least in part on facial feature features and/or voice feature features extracted from hardware components. have. For example, in response to detecting a request to create an avatar video clip of a virtual avatar, a video signal and an audio signal associated with a face within the camera's field of view may be captured. Speech feature features and facial feature features may be extracted from the audio signal and the video signal, respectively. In some examples, in response to detecting a request to preview the avatar video clip, an adjusted audio signal may be generated based at least in part on the facial feature characteristics and the voice feature characteristics, and the adjusted audio A preview of the video clip of the virtual avatar using the signal may be displayed.

Description

Voice effects based on facial expressions

관련 출원에 대한 상호 참조Cross-reference to related applications

본 출원은 2018년 2월 28일자로 출원되고 발명의 명칭이 "얼굴 표정들에 기초한 음성 효과들(Voice Effects Based on Facial Expressions)"인 미국 가특허 출원 제15/908,603호 및 2018년 7월 11일자로 출원되고 발명의 명칭이 "오디오 및 비디오 효과들을 제공하는 기술들(Techniques for Providing Audio and Video Effects)"인 미국 부분 계속 특허 출원 제16/033,111호로부터 우선권의 이득을 주장하고, 이들 개시내용들은 모든 목적들을 위해 전체적으로 본 명세서에 참고로 포함된다.This application is filed on February 28, 2018, and is entitled "Voice Effects Based on Facial Expressions" in US Provisional Patent Application Nos. 15/908,603 and July 11, 2018. Claiming the benefit of priority from US Part Continuing Patent Application No. 16/033,111, filed dated and titled "Techniques for Providing Audio and Video Effects", these disclosures Are incorporated herein by reference in their entirety for all purposes.

이모지들과 같은 멀티미디어 콘텐츠는 메시징 통신들의 일부로서 전송될 수 있다. 이모지들은 다양한 미리 정의된 사람들, 물체들, 액션들, 및/또는 다른 것들을 표현할 수 있다. 일부 메시징 애플리케이션들은, 다른 콘텐츠(예를 들어, 다른 멀티미디어 및/또는 텍스트 콘텐츠)를 포함할 수 있는 메시지의 일부로서 전송될 수 있는 이모지들의 미리 정의된 라이브러리로부터 사용자들이 선택할 수 있게 한다. 애니모지(Animoji)들은 이러한 다른 멀티미디어 콘텐츠의 한 유형이며, 사용자는 자신들을 표현하기 위해 아바타(예를 들어, 인형)를 선택할 수 있다. 애니모지는 마치 그것이 사용자의 비디오인 것처럼 움직이고 말할 수 있다. 애니모지들은 사용자들이 재미있고 창의적인 방식으로 개인화된 버전들의 이모지들을 생성할 수 있게 한다.Multimedia content such as emojis can be transmitted as part of messaging communications. Emojis can represent a variety of predefined people, objects, actions, and/or others. Some messaging applications allow users to select from a predefined library of emojis that can be sent as part of a message that can include other content (eg, other multimedia and/or text content). Animojis are one type of this different multimedia content, and users can select an avatar (eg, doll) to represent themselves. Animoji can move and speak as if it were a user's video. Animojis allow users to create personalized versions of emojis in a fun and creative way.

본 개시내용의 실시예들은 아바타 비디오 클립 개정 및 재생 기술들을 구현하기 위한 시스템들, 방법들, 및 컴퓨터 판독가능 매체를 제공할 수 있다. 일부 예들에서, 컴퓨팅 디바이스는 사용자의 얼굴을 추적하고 가상 아바타 표현(예를 들어, 인형 또는 사용자의 얼굴의 비디오 캐릭터 버전)을 제시하기 위한 사용자 인터페이스(UI)를 제시할 수 있다. 기록하기 위한 요청을 식별하면, 컴퓨팅 디바이스는 오디오 및 비디오 정보를 캡처하고, 컨텍스트뿐만 아니라 얼굴 특징부 특성들 및 음성 특징부 특성들을 추출 및 검출하고, 추출된/식별된 특징부들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 정보를 개정하고, 개정된 오디오 및/또는 비디오 정보를 사용하여 아바타의 비디오 클립을 제시할 수 있다.Embodiments of the present disclosure may provide systems, methods, and computer-readable media for implementing avatar video clip revision and playback techniques. In some examples, the computing device may present a user interface (UI) for tracking the user's face and presenting a virtual avatar representation (eg, a doll or a video character version of the user's face). Upon identifying the request to record, the computing device captures the audio and video information, extracts and detects facial feature features and voice feature features as well as context, and is based at least in part on the extracted/identified features. Accordingly, the audio and/or video information may be revised, and a video clip of the avatar may be presented using the revised audio and/or video information.

일부 실시예들에서, 다양한 오디오 및 비디오 효과 기술들을 구현하기 위한 컴퓨터 구현 방법이 제공될 수 있다. 방법은 가상 아바타 생성 인터페이스를 디스플레이하는 단계를 포함할 수 있다. 방법은 또한 가상 아바타 생성 인터페이스에서 가상 아바타의 제1 미리보기 콘텐츠를 디스플레이하는 단계를 포함할 수 있고, 가상 아바타의 제1 미리보기 콘텐츠는 카메라의 시야 내의 사용자 헤드샷의 실시간 미리보기 비디오 프레임들 및 외관에서의 연관된 헤드샷 변화들에 대응할 수 있다. 방법은 또한 가상 아바타의 제1 미리보기 콘텐츠를 디스플레이하는 동안, 가상 아바타 생성 인터페이스에서 입력을 검출하는 단계를 포함할 수 있다. 일부 예들에서, 가상 아바타 생성 인터페이스에서 입력을 검출하는 것에 응답하여, 방법은 또한, 카메라를 통해, 기록 세션 동안 사용자 헤드샷과 연관된 비디오 신호를 캡처하는 단계, 마이크로폰을 통해, 기록 세션 동안 사용자 오디오 신호를 캡처하는 단계, 캡처된 사용자 오디오 신호로부터 오디오 특징부 특성들을 추출하는 단계, 및 캡처된 비디오 신호로부터 얼굴과 연관된 얼굴 특징부 특성들을 추출하는 단계를 포함할 수 있다. 추가적으로, 기록 세션의 만료를 검출하는 것에 응답하여, 방법은 또한, 얼굴 특징부 특성들 및 오디오 특징부 특성들에 적어도 부분적으로 기초하여 캡처된 오디오 신호로부터 조정된 오디오 신호를 생성하는 단계, 얼굴 특징부 특성들 및 조정된 오디오 신호에 따라 가상 아바타 생성 인터페이스에서 가상 아바타의 제2 미리보기 콘텐츠를 생성하는 단계, 및 가상 아바타 생성 인터페이스에서 제2 미리보기 콘텐츠를 제시하는 단계를 포함할 수 있다.In some embodiments, a computer implemented method may be provided for implementing various audio and video effect techniques. The method may include displaying a virtual avatar creation interface. The method may also include displaying the first preview content of the virtual avatar in the virtual avatar creation interface, wherein the first preview content of the virtual avatar comprises real-time preview video frames of the user's headshot within the field of view of the camera and It can respond to associated headshot changes in appearance. The method may also include detecting an input in the virtual avatar creation interface while displaying the first preview content of the virtual avatar. In some examples, in response to detecting an input at the virtual avatar generation interface, the method further comprises: capturing, via a camera, a video signal associated with a user headshot during a recording session, via a microphone, a user audio signal during the recording session. Capturing, extracting audio feature features from the captured user audio signal, and extracting facial feature features associated with the face from the captured video signal. Additionally, in response to detecting the expiration of the recording session, the method further comprises generating an adjusted audio signal from the captured audio signal based at least in part on the facial feature characteristics and the audio feature characteristics, the facial feature The virtual avatar generation interface may include generating second preview content of the virtual avatar according to the sub-characteristics and the adjusted audio signal, and presenting the second preview content in the virtual avatar generation interface.

일부 실시예들에서, 방법은 또한,비디오 신호로부터 추출된 얼굴 특징부 특성들과 연관된 얼굴 특징부 메타데이터를 저장하는 단계, 및 얼굴 특징부 특성들 및 오디오 특징부 특성들에 적어도 부분적으로 기초하여 얼굴 특징부 메타데이터로부터 조정된 얼굴 특징부 메타데이터를 생성하는 단계를 포함할 수 있다. 추가적으로, 가상 아바타의 제2 미리보기는 조정된 얼굴 메타데이터에 따라 추가로 디스플레이될 수 있다. 일부 예들에서, 가상 아바타의 제1 미리보기는 미리보기 세션 동안 얼굴의 외관의 변화들에 따라 식별된 미리보기 얼굴 특징부 특성들에 따라 디스플레이될 수 있다.In some embodiments, the method further comprises storing facial feature metadata associated with facial feature features extracted from the video signal, and based at least in part on the facial feature features and audio feature features. And generating the adjusted facial feature metadata from the facial feature metadata. Additionally, the second preview of the virtual avatar may be additionally displayed according to the adjusted face metadata. In some examples, the first preview of the virtual avatar may be displayed according to preview facial feature characteristics identified according to changes in the appearance of the face during the preview session.

일부 실시예들에서, 다양한 오디오 및 비디오 효과 기술들을 구현하기 위한 전자 디바이스가 제공될 수 있다. 시스템은 카메라, 마이크로폰, 미리 기록된/미리 결정된 오디오의 라이브러리, 및 카메라 및 마이크로폰과 통신하는 하나 이상의 프로세서들을 포함할 수 있다. 일부 예들에서, 프로세서들은 동작들을 수행하기 위해 컴퓨터 실행가능 명령어들을 실행하도록 구성될 수 있다. 동작들은 가상 아바타의 제1 미리보기를 디스플레이하는 동안, 가상 아바타 생성 인터페이스에서 입력을 검출하는 것을 포함할 수 있다. 동작들은 또한 가상 아바타 생성 인터페이스에서 입력을 검출하는 것에 응답하여, 포함하는 캡처 세션을 개시하는 것을 포함할 수 있다. 캡처 세션은,카메라를 통해, 카메라의 시야 내의 얼굴과 연관된 비디오 신호를 캡처하는 것, 마이크로폰을 통해, 캡처된 비디오 신호와 연관된 오디오 신호를 캡처하는 것, 캡처된 오디오 신호로부터 오디오 특징부 특성들을 추출하는 것, 및 캡처된 비디오 신호로부터 얼굴과 연관된 얼굴 특징부 특성들을 추출하는 것을 포함할 수 있다. 일부 예들에서, 동작들은 또한, 적어도 캡처 세션의 만료를 검출하는 것에 응답하여, 오디오 특징부 특성들 및 얼굴 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호를 생성하는 것, 및 가상 아바타 생성 인터페이스에서 제2 미리보기 콘텐츠를 제시하는 것을 포함할 수 있다.In some embodiments, an electronic device may be provided for implementing various audio and video effect techniques. The system may include a camera, a microphone, a library of prerecorded/predetermined audio, and one or more processors in communication with the camera and microphone. In some examples, processors may be configured to execute computer-executable instructions to perform operations. The operations may include detecting an input in the virtual avatar generation interface while displaying the first preview of the virtual avatar. The actions may also include in response to detecting input at the virtual avatar creation interface, initiating an containing capture session. The capture session includes, via the camera, capturing a video signal associated with a face within the camera's field of view, via a microphone, capturing an audio signal associated with the captured video signal, and extracting audio feature characteristics from the captured audio signal. And extracting facial feature features associated with the face from the captured video signal. In some examples, the actions also include, at least in response to detecting the expiration of the capture session, generating an adjusted audio signal based at least in part on the audio feature features and facial feature features, and creating a virtual avatar. It may include presenting the second preview content in the interface.

일부 경우들에서, 오디오 신호는 가상 아바타의 유형에 적어도 부분적으로 기초하여 추가로 조정될 수 있다. 추가적으로, 가상 아바타의 유형은 가상 아바타 생성 인터페이스에 제시된 아바타 유형 선택 어포던스에 적어도 부분적으로 기초하여 수신될 수 있다. 일부 경우들에서, 가상 아바타의 유형은 동물 유형을 포함할 수 있고, 조정된 오디오 신호는 동물 유형과 연관된 미리 결정된 사운드에 적어도 부분적으로 기초하여 생성될 수 있다. 미리 결정된 사운드들의 사용 및 타이밍은 캡처된 오디오로부터의 오디오 특징부들 및/또는 캡처된 비디오로부터의 얼굴 특징부들에 기초할 수 있다. 이러한 미리 결정된 사운드는 또한 캡처된 오디오로부터의 오디오 특징부들 및 캡처된 비디오로부터의 얼굴 특징부들에 기초하여 그 자체가 수정될 수 있다. 일부 예들에서, 하나 이상의 프로세서들은 오디오 신호의 일부분이 시야 내의 얼굴에 대응하는지 여부를 결정하도록 추가로 구성될 수 있다. 추가적으로, 오디오 신호의 일부분이 얼굴에 대응한다는 결정에 따라, 오디오 신호의 일부분은 조정된 오디오 신호를 생성하는 데 사용하기 위해 저장될 수 있고/있거나, 오디오 신호의 일부분이 얼굴에 대응하지 않는다는 결정에 따라, 오디오 신호의 적어도 일부분은 폐기되고 수정 및/또는 재생을 위해 고려되지 않을 수 있다. 추가적으로, 오디오 특징부 특성들은 시야 내의 얼굴과 연관된 음성의 특징부들을 포함할 수 있다. 일부 예들에서, 하나 이상의 프로세서들은 비디오 신호로부터 추출된 얼굴 특징부 특성들과 연관된 얼굴 특징부 메타데이터를 저장하도록 추가로 구성될 수 있다. 일부 예들에서, 하나 이상의 프로세서들은 오디오 신호로부터 추출된 오디오 특징부 특성들과 연관된 오디오 특징부 메타데이터를 저장하도록 추가로 구성될 수 있다. 또한, 하나 이상의 프로세서들은 얼굴 특징부 특성들 및 오디오 특징부 특성들에 적어도 부분적으로 기초하여 조정된 얼굴 메타데이터를 생성하도록 추가로 구성될 수 있으며, 가상 아바타의 제2 미리보기는 조정된 얼굴 메타데이터 및 조정된 오디오 신호에 따라 생성될 수 있다.In some cases, the audio signal may be further adjusted based at least in part on the type of virtual avatar. Additionally, the type of the virtual avatar may be received based at least in part on the avatar type selection affordance presented in the virtual avatar generation interface. In some cases, the type of the virtual avatar may include an animal type, and the adjusted audio signal may be generated based at least in part on a predetermined sound associated with the animal type. The use and timing of the predetermined sounds may be based on audio features from captured audio and/or facial features from captured video. This predetermined sound can also be modified itself based on audio features from captured audio and facial features from captured video. In some examples, one or more processors may be further configured to determine whether a portion of the audio signal corresponds to a face in the field of view. Additionally, depending on the determination that a portion of the audio signal corresponds to a face, a portion of the audio signal may be stored for use in generating the adjusted audio signal and/or upon the determination that the portion of the audio signal does not correspond to a face. Accordingly, at least a portion of the audio signal may be discarded and not considered for modification and/or reproduction. Additionally, audio feature features may include features of speech associated with a face within the field of view. In some examples, one or more processors may be further configured to store facial feature metadata associated with facial feature features extracted from the video signal. In some examples, the one or more processors may be further configured to store audio feature metadata associated with audio feature features extracted from the audio signal. Further, the one or more processors may be further configured to generate the adjusted facial metadata based at least in part on the facial feature characteristics and the audio feature characteristics, and the second preview of the virtual avatar is It can be generated according to the data and the adjusted audio signal.

일부 실시예들에서, 컴퓨터 판독가능 매체가 제공될 수 있다. 컴퓨터 판독가능 매체는, 하나 이상의 프로세서들에 의해 실행될 때, 하나 이상의 프로세서들로 하여금 동작들을 수행하게 하는 컴퓨터 실행가능 명령어들을 포함할 수 있다. 동작들은 가상 아바타의 아바타 비디오 클립을 생성하기 위한 요청을 검출하는 것에 응답하여 하기의 액션들 즉, 전자 디바이스의 카메라를 통해, 카메라의 시야 내의 얼굴과 연관된 비디오 신호를 캡처하는 것, 전자 디바이스의 마이크로폰을 통해, 오디오 신호를 캡처하는 것, 캡처된 오디오 신호로부터 음성 특징부 특성들을 추출하는 것, 및 캡처된 비디오 신호로부터 얼굴과 연관된 얼굴 특징부 특성들을 추출하는 것을 수행하는 것을 포함할 수 있다 동작들은 또한 아바타 비디오 클립을 미리보기하기 위한 요청을 검출하는 것에 응답하여 하기의 액션들 즉, 얼굴 특징부 특성들 및 음성 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호를 생성하는 것, 및 조정된 오디오 신호를 사용하여 가상 아바타의 비디오 클립의 미리보기를 디스플레이하는 것을 수행하는 것을 포함할 수 있다.In some embodiments, a computer-readable medium may be provided. A computer-readable medium may contain computer-executable instructions that, when executed by one or more processors, cause one or more processors to perform operations. The actions are the following actions in response to detecting a request to create an avatar video clip of a virtual avatar: capturing a video signal associated with a face in the field of view of the camera, via the camera of the electronic device, the microphone of the electronic device Via, capturing an audio signal, extracting voice feature features from the captured audio signal, and extracting facial feature features associated with the face from the captured video signal. Also in response to detecting a request to preview the avatar video clip, generating an adjusted audio signal based at least in part on the following actions: facial feature features and voice feature features, and adjusting And displaying a preview of the video clip of the virtual avatar using the generated audio signal.

일부 실시예들에서, 오디오 신호는 얼굴과 연관된 얼굴 특징부 특성들에서 식별된 얼굴 표정에 적어도 부분적으로 기초하여 조정될 수 있다. 일부 경우들에서, 오디오 신호는 얼굴과 연관된 음성 특성의 레벨, 음조(pitch), 지속기간, 형식, 또는 변화에 적어도 부분적으로 기초하여 조정될 수 있다. 또한, 일부 실시예들에서, 하나 이상의 프로세서들은 가상 아바타의 비디오 클립을 다른 전자 디바이스로 송신하는 것을 포함하는 동작들을 수행하도록 추가로 구성될 수 있다.In some embodiments, the audio signal may be adjusted based at least in part on a facial expression identified in facial feature characteristics associated with the face. In some cases, the audio signal may be adjusted based, at least in part, on a level, pitch, duration, format, or change of a voice characteristic associated with the face. Further, in some embodiments, the one or more processors may be further configured to perform operations including transmitting a video clip of the virtual avatar to another electronic device.

첨부 도면들과 함께 하기의 상세한 설명은 본 개시내용의 본질 및 이점들에 대한 보다 양호한 이해를 제공할 것이다.The following detailed description in conjunction with the accompanying drawings will provide a better understanding of the nature and advantages of the present disclosure.

도 1은 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하기 위한 예시적인 흐름을 예시하는 단순화된 블록도이다.
도 2는 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하기 위한 예시적인 흐름을 예시하는 다른 단순화된 블록도이다.
도 3은 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하기 위한 하드웨어 및 소프트웨어 컴포넌트들을 예시하는 다른 단순화된 블록도이다.
도 4는 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하는 것을 예시하기 위한 흐름도이다.
도 5는 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하는 것을 예시하기 위한 다른 흐름도이다.
도 6은 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하기 위한 사용자 인터페이스를 예시하는 단순화된 블록도이다.
도 7은 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하는 것을 예시하기 위한 다른 흐름도이다.
도 8은 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하는 것을 예시하기 위한 다른 흐름도이다.
도 9는 적어도 하나의 예에 따른, 본 명세서에 설명된 바와 같은 오디오 및/또는 비디오 효과 기술들을 제공하기 위한 컴퓨터 아키텍처를 예시하는 단순화된 블록도이다.1 is a simplified block diagram illustrating an example flow for providing audio and/or video effects techniques as described herein, according to at least one example.
2 is another simplified block diagram illustrating an exemplary flow for providing audio and/or video effects techniques as described herein, according to at least one example.
3 is another simplified block diagram illustrating hardware and software components for providing audio and/or video effects techniques as described herein, according to at least one example.
4 is a flow chart to illustrate providing audio and/or video effect techniques as described herein, according to at least one example.
5 is another flow chart to illustrate providing audio and/or video effect techniques as described herein, according to at least one example.
6 is a simplified block diagram illustrating a user interface for providing audio and/or video effects techniques as described herein, according to at least one example.
7 is another flow chart to illustrate providing audio and/or video effect techniques as described herein, according to at least one example.
8 is another flow chart to illustrate providing audio and/or video effect techniques as described herein, according to at least one example.
9 is a simplified block diagram illustrating a computer architecture for providing audio and/or video effects techniques as described herein, in accordance with at least one example.

본 개시내용의 소정 실시예들은 얼굴 표정들에 적어도 부분적으로 기초하여 음성 효과들(예를 들어, 개정된 오디오)을 제공하기 위한 다양한 기술들을 구현하기 위한 디바이스들, 컴퓨터 판독가능 매체, 및 방법들에 관한 것이다. 추가적으로, 일부 경우들에서, 다양한 기술들은 또한 기록물(recording)의 오디오 특성들에 적어도 부분적으로 기초하여 비디오 효과들을 제공할 수 있다. 더 나아가, 다양한 기술들은 또한 기록물의 얼굴 표정들 및 오디오 특성들 중 하나 또는 둘 모두에 적어도 부분적으로 기초하여 음성 효과들 및 비디오 효과들을 (예를 들어, 함께) 제공할 수 있다. 일부 예들에서, 음성 효과들 및/또는 비디오 효과들은 사용자의 만화 표현(예를 들어, 아바타 또는 디지털 인형)을 디스플레이하도록 구성된 사용자 인터페이스(UI)에서 제시될 수 있다. 사용자를 표현하는 그러한 아바타는, 대부분의 스마트 폰 사용자들에게 친숙한 이모지 캐릭터인 것처럼 보일 수 있기 때문에 애니모지로 간주될 수 있지만, 그것은 사용자의 실제 움직임들을 모방하도록 애니메이션화될 수 있다.Certain embodiments of the present disclosure provide devices, computer-readable media, and methods for implementing various techniques for providing voice effects (e.g., revised audio) based at least in part on facial expressions. It is about. Additionally, in some cases, various techniques may also provide video effects based at least in part on the audio characteristics of the recording. Furthermore, various techniques may also provide (eg, together) voice effects and video effects based at least in part on one or both of the facial expressions and audio characteristics of the recording. In some examples, voice effects and/or video effects may be presented in a user interface (UI) configured to display a cartoon representation of the user (eg, an avatar or digital puppet). Such an avatar representing the user can be considered an animoji because it may appear to be an emoji character familiar to most smart phone users, but it can be animated to mimic the actual movements of the user.

예를 들어, 컴퓨팅 디바이스의 사용자에게는 애니모지 비디오(예를 들어, 비디오 클립)를 생성하기 위한 UI가 제시될 수 있다. 비디오 클립은 미리 결정된 시간량(예를 들어, 10초, 30초 등)으로 제한될 수 있거나, 비디오 클립은 제한되지 않을 수 있다. UI에서, 미리보기 영역은 아바타 캐릭터를 사용하여 사용자의 얼굴의 실시간 표현을 사용자에게 제시할 수 있다. 다양한 아바타 캐릭터들이 제공될 수 있고, 사용자는 심지어 그 자신의 아바타들을 생성하거나 임포트(import)할 수 있다. 미리보기 영역은 아바타의 초기 미리보기 및 기록된 비디오 클립의 미리보기를 제공하도록 구성될 수 있다. 추가적으로, 기록된 비디오 클립은 (예를 들어, 임의의 비디오 또는 오디오 효과들 없이) 그의 원래의 형태로 미리보기될 수 있거나, 또는 오디오 및/또는 비디오 효과들과 함께 미리보기될 수 있다. 일부 경우들에서, 사용자는 초기 비디오 클립이 기록된 후에 아바타를 선택할 수 있다. 이어서, 비디오 클립 미리보기는 하나의 아바타로부터 그에 따라 동일하거나 상이한 비디오 효과들이 적용된 다른 아바타로 적절하게 변할 수 있다. 예를 들어, 원시 미리보기(예를 들어, 효과들이 없는 원래의 형태)가 보여지는 중에, 사용자가 아바타 캐릭터들을 전환하는 경우, UI는 동일한 비디오 클립의 렌더링을 디스플레이하지만 새롭게 선택된 아바타를 갖도록 업데이트될 수 있다. 다시 말하면, 기록 중에 캡처된 얼굴 특징부들 및 오디오(예를 들어, 사용자의 음성)는 (예를 들어, 임의의 효과들 없이) 아바타들 중 임의의 것으로부터 제시될 수 있다. 미리보기에서, 그것은, 아바타 캐릭터가 사용자가 기록 중에 이동한 것과 동일한 방향으로 이동하고, 사용자가 기록 중에 말한 것을 말하는 것처럼 보일 것이다.For example, a user of the computing device may be presented with a UI for creating an Animoji video (eg, a video clip). The video clip may be limited to a predetermined amount of time (eg, 10 seconds, 30 seconds, etc.), or the video clip may not be limited. In the UI, the preview area may present a real-time expression of the user's face to the user using an avatar character. Various avatar characters can be provided, and the user can even create or import his own avatars. The preview area may be configured to provide an initial preview of the avatar and a preview of the recorded video clip. Additionally, the recorded video clip may be previewed in its original form (eg, without any video or audio effects), or may be previewed with audio and/or video effects. In some cases, the user may select an avatar after the initial video clip is recorded. The video clip preview can then be appropriately changed from one avatar to another avatar to which the same or different video effects have been applied accordingly. For example, if a user switches avatar characters while a raw preview (e.g., the original form with no effects) is being shown, the UI will display a rendering of the same video clip but will be updated to have the newly selected avatar. I can. In other words, facial features and audio (eg, the user's voice) captured during recording may be presented from any of the avatars (eg, without any effects). In the preview, it will appear as if the avatar character moves in the same direction the user moved during recording, and says what the user said during recording.

예로서, 사용자는 UI를 통해 제1 아바타(예를 들어, 유니콘 헤드)를 선택할 수 있거나, 디폴트 아바타가 초기에 제공될 수 있다. UI는 미리보기 영역에서 아바타(본 예에서는 사용자가 선택한 경우 만화 유니콘의 헤드 또는 디폴트인 임의의 다른 이용가능한 인형)를 제시할 것이고, 디바이스는 (예를 들어, 하나 이상의 마이크로폰들 및/또는 하나 이상의 카메라들을 사용하여) 오디오 및/또는 비디오 정보를 캡처하기 시작할 것이다. 일부 경우들에서, 초기 미리보기 화면에 대해 비디오 정보만이 필요하다. 비디오 정보는 분석될 수 있고, 얼굴 특징부들은 추출될 수 있다. 이어서, 이들 추출된 얼굴 특징부들은, 유니콘 헤드의 초기 미리보기가 사용자의 헤드를 반영하는 것처럼 보이도록, 실시간으로 유니콘 얼굴에 맵핑될 수 있다. 일부 경우들에서, 용어 "실시간"은 추출, 맵핑, 렌더링, 및 제시의 결과들이 사용자의 각각의 움직임에 응답하여 수행되어 실질적으로 즉시 제시될 수 있음을 나타내는 데 사용된다. 사용자에게, 이것은, 사용자의 얼굴의 이미지가 아바타로 대체된 것을 제외하고는 거울을 보는 것처럼 보일 것이다.For example, the user may select a first avatar (eg, a unicorn head) through the UI, or a default avatar may be initially provided. The UI will present an avatar (in this example the head of a cartoon unicorn if selected by the user or any other available doll by default) in the preview area, and the device (e.g., one or more microphones and/or one or more Using cameras) will begin to capture audio and/or video information. In some cases, only video information is needed for the initial preview screen. Video information can be analyzed, and facial features can be extracted. These extracted facial features can then be mapped to the unicorn face in real time so that the initial preview of the unicorn head appears to reflect the user's head. In some cases, the term "real-time" is used to indicate that the results of extraction, mapping, rendering, and presentation can be performed in response to each movement of the user and presented substantially immediately. To the user, this will appear to be looking in a mirror except that the image of the user's face has been replaced with an avatar.

사용자의 얼굴이 디바이스의 카메라의 시선(line of sight)(예를 들어, 뷰(view)) 내에 있는 동안, UI는 초기 미리보기를 계속 제시할 것이다. UI 상의 기록 어포던스(예를 들어, 가상 버튼)의 선택 시, 디바이스는 오디오 성분을 갖는 비디오를 캡처하기 시작할 수 있다. 일부 예들에서, 이것은 카메라가 프레임들을 캡처하고, 마이크로폰이 오디오 정보를 캡처하는 것을 포함한다. 3차원(3D) 정보를 캡처할 수 있는 특수 카메라가 또한 활용될 수 있다. 추가적으로, 일부 예들에서, 비디오를 캡처할 수 있는 임의의 카메라가 활용될 수 있다. 비디오는 그의 원래의 형태로 저장될 수 있고/있거나 비디오와 연관된 메타데이터가 저장될 수 있다. 이와 같이, 비디오 및/또는 오디오 정보를 캡처하는 것은 정보를 저장하는 것과 상이할 수 있다. 예를 들어, 정보를 캡처하는 것은 정보를 감지하는 것 및 프로세싱에 이용가능하도록 적어도 정보를 캐싱하는 것을 포함할 수 있다. 프로세싱된 데이터는 또한 데이터를 저장하거나 단순히 활용할지 여부가 결정될 때까지 캐싱될 수 있다. 예를 들어, 초기 미리보기 동안, 사용자의 얼굴이 실시간으로 인형으로서 제시되고 있는 동안, 비디오 데이터(예를 들어, 데이터와 연관된 메타데이터)는 그것이 인형에 맵핑되고 제시되는 동안 캐싱될 수 있다. 그러나, 이러한 데이터는 아예 영구적으로 저장되지 않아서 초기 미리보기가 재사용가능하거나 복구가능하지 않을 수 있다.While the user's face is within the line of sight (eg, view) of the device's camera, the UI will continue to present the initial preview. Upon selection of the recording affordance (eg, a virtual button) on the UI, the device may start capturing video with an audio component. In some examples, this includes the camera capturing frames and the microphone capturing audio information. Special cameras capable of capturing three-dimensional (3D) information can also be utilized. Additionally, in some examples, any camera capable of capturing video may be utilized. The video may be stored in its original form and/or metadata associated with the video may be stored. As such, capturing video and/or audio information may differ from storing information. For example, capturing information may include sensing the information and at least caching the information to be available for processing. The processed data can also be cached until a decision is made whether to store or simply utilize the data. For example, during the initial preview, while the user's face is being presented as a doll in real time, video data (eg, metadata associated with the data) may be cached while it is mapped to the doll and presented. However, such data is not stored permanently at all, so the initial preview may not be reusable or recoverable.

대안적으로, 일부 예들에서, 일단 사용자가 UI의 기록 어포던스를 선택하면, 비디오 데이터 및 오디오 데이터가 더 영구적으로 저장될 수 있다. 이러한 방식으로, 본 명세서에 설명된 오디오 및 비디오 효과들을 제공하기 위해, 오디오 및 비디오(A/V) 데이터는 분석, 프로세싱 등이 될 수 있다. 일부 예들에서, 비디오 데이터는 얼굴 특징부들(예를 들어, 얼굴 특징부 특성들)을 추출하도록 프로세싱될 수 있고, 이들 얼굴 특징부들은 애니모지 비디오 클립에 대한 메타데이터로서 저장될 수 있다. 메타데이터의 세트는 비디오 클립과 연관된 시간, 날짜 및 사용자를 나타내는 식별자(ID)로 저장될 수 있다. 추가적으로, 오디오 데이터는 동일하거나 다른 ID로 저장될 수 있다. 저장되면, 또는 일부 예들에서, 저장 전에, 시스템(예를 들어, 디바이스의 프로세서들)은 오디오 데이터로부터 오디오 특징부 특성들을 추출하고 비디오 파일로부터 얼굴 특징부 특성들을 추출할 수 있다. 이러한 정보는 사용자의 컨텍스트, 핵심 단어들, 의도, 및/또는 감정들을 식별하는 데 활용될 수 있고, 비디오 및 오디오 효과들은 인형들을 렌더링하기 전에 오디오 및 비디오 데이터 내로 도입될 수 있다. 일부 예들에서, 오디오 신호는 추출된 특징부들에 적어도 부분적으로 기초하여 상이한 단어들, 사운드들, 톤들, 음조들, 타이밍 등을 포함하도록 조정될 수 있다. 추가적으로, 일부 예들에서, 비디오 데이터(예를 들어, 메타데이터)가 또한 조정될 수 있다. 일부 예들에서, 오디오 특징부들은 미리보기 자체 동안 실시간으로 추출된다. 이들 오디오 특징부들은 아바타 특정이며, 연관 아바타가 미리보기되고 있는 경우에만 생성될 수 있다. 오디오 특징부들은 아바타 애그노스틱(agnostic)이며, 모든 아바타들에 대해 생성될 수 있다. 오디오 신호는 또한, 이들 실시간 오디오 특징부 추출들에 부분적으로 기초하여, 그리고 기록 프로세스 동안 또는 그 후에, 그러나 미리보기 전에 생성된 미리 저장된 추출된 비디오 특징부들을 이용하여 조정될 수 있다.Alternatively, in some examples, once the user selects the recording affordance of the UI, the video data and audio data may be more permanently stored. In this way, the audio and video (A/V) data can be analyzed, processed, etc. to provide the audio and video effects described herein. In some examples, video data may be processed to extract facial features (eg, facial feature features), and these facial features may be stored as metadata for an Animoji video clip. The set of metadata may be stored as an identifier (ID) representing the time, date and user associated with the video clip. Additionally, audio data may be stored with the same or different IDs. Once stored, or in some examples, prior to storage, the system (eg, the processors of the device) may extract audio feature features from the audio data and facial feature features from the video file. This information can be used to identify the user's context, key words, intent, and/or emotions, and video and audio effects can be introduced into the audio and video data before rendering the puppets. In some examples, the audio signal may be adjusted to include different words, sounds, tones, tones, timing, etc. based at least in part on the extracted features. Additionally, in some examples, video data (eg, metadata) can also be adjusted. In some examples, audio features are extracted in real time during the preview itself. These audio features are avatar specific and can only be created if the associated avatar is being previewed. Audio features are avatar agnostic and can be created for all avatars. The audio signal can also be adjusted based in part on these real-time audio feature extractions, and during or after the recording process, but using pre-stored extracted video features created before previewing.

일단 추출된 특성들에 적어도 부분적으로 기초하여 비디오 및 오디오 데이터가 조정되면, 인형의 제2 미리보기가 렌더링될 수 있다. 이러한 렌더링은 각각의 가능한 인형에 대해 수행될 수 있으며, 예컨대 사용자가 스크롤하여 상이한 인형들을 선택함에 따라, 조정된 데이터가 이미 렌더링된다. 또는 렌더링은 각각의 인형의 선택 이후에 수행될 수 있다. 임의의 경우에서, 일단 사용자가 인형을 선택하면, 제2 미리보기가 제시될 수 있다. 제2 미리보기는 사용자에 의해 기록되었지만 오디오 및/또는 비디오가 조정된 비디오 클립을 재생할 것이다. 위로부터의 예를 사용하여, 사용자가 화난 톤(예를 들어, 거친 음성 및 주름잡힌 이마)으로 자신을 기록한 경우, 분노의 컨텍스트 또는 의도가 검출될 수 있고, 오디오 파일은 으르렁거리는 사운드를 포함하도록 조정될 수 있다. 따라서, 제2 미리보기는 유니콘이 사용자가 말한 단어들을 말하는 것처럼 보이겠지만, 사용자의 음성은 으르렁거리는 것처럼 들리도록, 또는 톤을 더 바리톤으로(예를 들어, 더 낮게) 만들도록 조정될 수 있다. 이어서, 사용자는 제2 미리보기를 저장하거나, (예를 들어, 메시징 애플리케이션 등을 통해) 다른 사용자로의 전송을 위해 그것을 선택할 수 있다. 일부 예들에서, 아래 및 위의 애니모지 비디오 클립들은.mov 파일들로서 공유될 수 있다. 그러나, 다른 예들에서, 설명된 기술들은 (예를 들어, 비디오 메시징 등을 이용하여) 실시간으로 사용될 수 있다.Once the video and audio data is adjusted based at least in part on the extracted features, a second preview of the doll may be rendered. This rendering can be performed for each possible doll, for example as the user scrolls and selects different dolls, the adjusted data is already rendered. Alternatively, rendering may be performed after selection of each doll. In any case, once the user selects the doll, a second preview may be presented. The second preview will play a video clip recorded by the user but with audio and/or video adjusted. Using the example from above, if the user recorded himself in an angry tone (e.g., a rough voice and a wrinkled forehead), the context or intent of the anger could be detected, and the audio file would contain a growling sound. Can be adjusted. Thus, the second preview can be adjusted to make the unicorn appear to speak words spoken by the user, but the user's voice to sound as if growling, or to make the tone more baritone (eg, lower). The user can then save the second preview or select it for transmission to another user (eg, via a messaging application, etc.). In some examples, the below and above Animoji video clips can be shared as .mov files. However, in other examples, the described techniques may be used in real time (eg, using video messaging or the like).

도 1은 사용자의 기록물에서 검출된 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 효과들을 제공하기 위한 예시적인 흐름(100)을 예시하는 단순화된 블록도이다. 예시적인 흐름(100)에서, 두 개의 별개의 세션들, 즉, 기록 세션(102) 및 재생 세션(104)이 있다. 기록 세션(102)에서, 디바이스(106)는 블록(110)에서 사용자(108)의 오디오 성분을 갖는 비디오를 캡처할 수 있다. 일부 예들에서, 비디오 및 오디오는 두 개의 상이한 디바이스들(예를 들어, 마이크로폰 및 카메라)을 사용하여 개별적으로 캡처(예를 들어, 수집)될 수 있다. 비디오 및 오디오의 캡처는 사용자(108)에 의한 기록 어포던스의 선택에 적어도 부분적으로 기초하여 트리거될 수 있다. 일부 예들에서, 사용자(108)는 블록(112)에서 단어 "안녕하세요"를 말할 수 있다. 추가적으로, 블록(112)에서, 디바이스(106)는 사용자의 액션들의 비디오 및/또는 오디오 성분들을 계속해서 캡처할 수 있다. 블록(114)에서, 디바이스(106)는 비디오 및 오디오 성분들을 계속해서 캡처할 수 있고, 이러한 예에서, 사용자(108)는 단어 "짖기(bark)"를 말할 수 있다. 블록(114)에서, 디바이스(106)는 또한 오디오 정보로부터 음성 단어들을 추출할 수 있다. 그러나, 다른 예들에서, 음성 단어 추출(또는 임의의 오디오 특징부 추출)은 기록 세션(102)이 완료된 후에 실제로 일어날 수 있다. 다른 예들에서, 음성 단어 추출(또는 임의의 오디오 특징부 추출)은 실제로 미리보기 블록(124)동안 실시간으로 일어날 수 있다. 또한, 추출(예를 들어, 오디오의 분석)은 기록 세션(102)이 여전히 진행되는 동안 실시간으로 수행될 수 있다. 어느 경우든, 디바이스(106)에 의해 실행되는 아바타 프로세스는 사용자가 단어 "짖기"를 말했다는 것을 추출을 통해 식별할 수 있고, 어떤 오디오 효과들을 구현할지를 결정하기 위해 일부 로직을 채용할 수 있다.1 is a simplified block diagram illustrating an exemplary flow 100 for providing audio and/or video effects based at least in part on audio and/or video features detected in a user's recording. In the exemplary flow 100, there are two separate sessions, a recording session 102 and a playback session 104. In the recording session 102, the device 106 can capture a video with the audio component of the user 108 at block 110. In some examples, video and audio may be captured (eg, collected) separately using two different devices (eg, microphone and camera). The capture of video and audio may be triggered based at least in part on the selection of recording affordance by the user 108. In some examples, user 108 can say the word “hello” at block 112. Additionally, at block 112, device 106 may continue to capture video and/or audio components of the user's actions. At block 114, device 106 may continue to capture video and audio components, and in this example, user 108 may say the word “bark”. At block 114, device 106 may also extract spoken words from the audio information. However, in other examples, speech word extraction (or any audio feature extraction) may actually occur after recording session 102 is complete. In other examples, speech word extraction (or any audio feature extraction) may actually occur in real time during preview block 124. Further, extraction (eg, analysis of audio) may be performed in real time while recording session 102 is still in progress. In either case, the avatar process executed by device 106 can identify through extraction that the user has spoken the word "barking" and employ some logic to determine which audio effects to implement.

예로서, 기록 세션(102)은 사용자(108)가 기록 어포던스를 다시 선택하거나(예를 들어, 기록을 종료하고자 하는 요구를 나타냄), 종료 기록 어포던스를 선택할 때(예를 들어, 기록 어포던스는 기록 중에 종료 기록 어포던스로서 작용할 수 있음), 또는 기간(예를 들어, 10초, 30초 등)의 만료에 적어도 부분적으로 기초하여 종료될 수 있다. 일부 경우들에서, 이 기간은 자동으로 미리 결정될 수 있는 한편, 다른 경우들에서, 그것은 사용자 선택일 수 있다(예를 들어, 옵션들의 목록으로부터 선택되거나 텍스트 입력 인터페이스를 통해 자유 형태로 입력될 수 있음). 기록이 완료되면, 사용자(108)는 사용자(108)가 기록물의 미리보기를 보기를 원하는 것을 나타내는 미리보기 어포던스를 선택할 수 있다. 하나의 옵션은 임의의 시각적 또는 오디오 효과들 없이 원래의 기록을 재생하는 것일 수 있다. 그러나, 다른 옵션은 비디오 클립의 개정된 버전을 재생하는 것일 수 있다. 음성 단어 "짖기"의 검출에 적어도 부분적으로 기초하여, 아바타 프로세스는 비디오 클립의 오디오 및/또는 비디오를 개정했을 수 있다.As an example, the recording session 102 may be performed when the user 108 reselects the recording affordance (e.g., indicating a request to end recording), or selects the termination recording affordance (e.g., recording affordance During, it may act as an end recording affordance), or may end based at least in part on the expiration of a period (eg, 10 seconds, 30 seconds, etc.). In some cases, this period can be automatically predetermined, while in other cases it can be user selection (e.g., it can be selected from a list of options or entered in a free form via a text input interface. ). When recording is complete, the user 108 may select a preview affordance indicating that the user 108 wants to view a preview of the recording. One option may be to play the original recording without any visual or audio effects. However, another option could be to play a revised version of the video clip. Based at least in part on detection of the spoken word “barking”, the avatar process may have revised the audio and/or video of the video clip.

블록(116)에서, 디바이스(106)는 화면 상에 아바타(인형 및/또는 애니모지로도 불림)(118)를 제시할 수 있다. 디바이스(106)는 또한 비디오 클립과 연관된 오디오를 재생할 수 있는 스피커(120)로 구성될 수 있다. 이러한 예에서, 블록(116)은 블록(110)과 동일한 시점에 대응하며, 사용자(108)는 그의 입을 열었지만 아직 말하지 않았을 수 있다. 이와 같이, 아바타(118)는 그의 입을 열은 상태로 제시될 수 있지만, 오디오는 스피커(120)로부터 아직 제시되지 않는다. 사용자(108)가 "안녕하세요"라고 말한 블록(112)에 대응하는 블록(122)에서, 아바타 프로세스는 아바타-특정 음성과 함께 아바타(118)를 제시할 수 있다. 다시 말하면, 블록(122)에서, 미리 정의된 개 음성이 단어 "안녕하세요"를 말하는 데 사용될 수 있다. 개 음성 단어 "안녕하세요"는 스피커(120)에 의해 제시될 수 있다. 아래에서 더 상세히 설명되는 바와 같이, 사용자(108)에 의한 선택을 위해 이용가능한 다양한 상이한 동물(및 다른 캐릭터) 아바타들이 있다. 일부 예들에서, 각각의 아바타는 그 아바타에 가장 잘 맞는 특정한 미리 정의된 음성과 연관될 수 있다. 예를 들어, 개는 개 음성을 가질 수 있고, 고양이는 고양이 음성을 가질 수 있고, 돼지는 돼지 음성을 가질 수 있고, 로봇은 로봇 음성을 가질 수 있다. 이러한 아바타-특정 음성들은 미리 기록될 수 있거나, 임의의 사용자의 음성이 개 음성처럼 들리도록 변환될 수 있도록, 원래의 사운드 상에 수학적 연산들을 실행함으로써 발생할 수 있는 특정 주파수 또는 오디오 변환들과 연관될 수 있다. 그러나, 각각의 사용자의 개 음성은 수행되는 특정 오디오 변환에 적어도 부분적으로 기초하여 상이하게 들릴 수 있다.At block 116, device 106 may present an avatar (also referred to as a doll and/or animoji) 118 on the screen. Device 106 may also be configured with a speaker 120 capable of playing audio associated with a video clip. In this example, block 116 corresponds to the same time point as block 110, and user 108 may have opened his mouth, but not yet spoken. As such, the avatar 118 may be presented with its mouth open, but the audio is not yet presented from the speaker 120. At block 122 corresponding to block 112 where user 108 said "hello", the avatar process may present the avatar 118 with an avatar-specific voice. In other words, at block 122, a predefined dog voice may be used to say the word “hello”. The dog spoken word "Hello" may be presented by speaker 120. As described in more detail below, there are a variety of different animal (and other character) avatars available for selection by the user 108. In some examples, each avatar may be associated with a specific predefined voice that best fits that avatar. For example, a dog may have a dog voice, a cat may have a cat voice, a pig may have a pig voice, and a robot may have a robot voice. These avatar-specific voices can be pre-recorded or be associated with specific frequency or audio conversions that can occur by executing mathematical operations on the original sound so that any user's voice can be converted to sound like a dog's voice. I can. However, each user's dog voice may sound different based at least in part on the particular audio conversion being performed.

블록(124)에서, 아바타 프로세스는 음성 단어(예를 들어, "짖기")를 아바타-특정 단어로 대체할 수 있다. 이러한 예에서, 개 짖는 소리(예를 들어, 기록되거나 모사된 개 짖는 소리)의 사운드가 (예를 들어, 단어 "짖기" 대신에) 오디오 데이터 내로 삽입될 수 있어서, 비디오 클립의 제시 동안 재생될 때, "멍(woof)"이 스피커(120)에 의해 제시될 수 있다. 일부 예들에서, 124에서, 상이한 아바타-특정 단어들이 상이한 아바타 선택들에 적어도 부분적으로 기초하여 제시될 것이고, 다른 예들에서, 동일한 아바타-특정 단어가 아바타 선택들에 관계없이 제시될 수 있다. 예를 들어, 사용자(108)가 "짖기"라고 말하면, 개 아바타가 선택될 때 "멍"이 제시될 수 있다. 그러나, 이러한 동일한 경우에, 사용자(108)가 이후에 동일한 흐름에 대해 고양이 아바타를 선택하는 경우에, 오디오를 개정하기 위한 몇몇 옵션들이 있다. 한 가지 예에서, 고양이가 "멍"하는 것이 적합하지 않더라도 프로세스는 "짖기"를 "멍"으로 전환할 수 있다. 상이한 예에서, 프로세스는 고양이 아바타의 선택에 적어도 부분적으로 기초하여, "짖기"를 기록되거나 모사된 "야옹"으로 전환할 수 있다. 그리고 또 다른 예에서, 프로세스는 개 아바타 이외의 아바타들에 대한 "짖기"를 무시할 수 있다. 이와 같이, 114에서의 추출 후에도 수행되는 오디오 특징부 분석의 제2 레벨이 있을 수 있다. 비디오 및 오디오 특징부들은 또한 아바타 특정 발화들에 대한 프로세싱에 영향을 줄 수 있다. 예를 들어, 사용자가 "짖기"라고 말하는 레벨 및 음조 및 억양이 오디오 특징부 추출의 일부로서 검출될 수 있고, 이는 시스템이 미리보기 프로세스 전에 및/또는 그 동안에 특정 "멍" 샘플을 선택하거나 그러한 샘플을 변환하도록 지시할 수 있다.At block 124, the avatar process may replace the spoken word (eg, “barking”) with an avatar-specific word. In this example, the sound of a dog barking sound (eg, recorded or simulated dog barking) may be inserted into the audio data (eg, instead of the word “barking”) to be played during presentation of the video clip. In this case, a “woof” may be presented by the speaker 120. In some examples, at 124, different avatar-specific words will be presented based at least in part on different avatar selections, and in other examples, the same avatar-specific word may be presented regardless of the avatar selections. For example, if user 108 says "barking", then "bruising" may be presented when a dog avatar is selected. However, in this same case, if the user 108 later selects a cat avatar for the same flow, there are several options to revise the audio. In one example, the process can convert "barking" to "bruising" even if it is not suitable for a cat to "bruise." In a different example, the process may convert "barking" into recorded or simulated "meow" based at least in part on the selection of a cat avatar. And in another example, the process may ignore "barking" for avatars other than dog avatars. As such, there may be a second level of audio feature analysis performed even after extraction at 114. Video and audio features can also affect processing for avatar specific utterances. For example, the level and tonality and intonation that the user says "barking" may be detected as part of the audio feature extraction, which allows the system to select a specific "bruise" sample before and/or during the preview process, or You can instruct the sample to be converted.

도 2는 사용자의 기록물에서 검출된 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 효과들을 제공하기 위한 예시적인 흐름(200)을 예시하는 다른 단순화된 블록도이다. 예시적인 흐름(200)에서, 도 1의 예시적인 흐름(100)과 매우 유사하게, 두 개의 별개의 세션들, 즉, 기록 세션(202) 및 재생 세션(204)이 있다. 기록 세션(202)에서, 디바이스(206)는 블록(210)에서 사용자(208)의 오디오 성분을 갖는 비디오를 캡처할 수 있다. 비디오 및 오디오의 캡처는 사용자(208)에 의한 기록 어포던스의 선택에 적어도 부분적으로 기초하여 트리거될 수 있다. 일부 예들에서, 사용자(208)는 블록(212)에서 단어 "안녕하세요"를 말할 수 있다. 추가적으로, 블록(212)에서, 디바이스(206)는 사용자의 액션들의 비디오 및/또는 오디오 성분들을 계속해서 캡처할 수 있다. 블록(214)에서, 디바이스(206)는 비디오 및 오디오 성분들을 계속해서 캡처할 수 있고, 이러한 예에서, 사용자(208)는 그의 입을 열린 상태로 유지하지만, 아무 말도 하지 않을 수 있다. 블록(214)에서, 디바이스(206)는 또한 비디오로부터 얼굴 표정들을 추출할 수 있다. 그러나, 다른 예들에서, 얼굴 특징부 추출(또는 임의의 비디오 특징부 추출)은 기록 세션(202)이 완료된 후에 실제로 일어날 수 있다. 여전히, 추출(예를 들어, 비디오의 분석)은 기록 세션(202)이 여전히 진행되는 동안 실시간으로 수행될 수 있다. 어느 경우든, 디바이스(206)에 의해 실행되는 아바타 프로세스는 사용자가 자신의 입을 잠시(예를 들어, 아무것도 말하지 않고서) 열었음을 추출을 통해 식별할 수 있고, 어떤 오디오 및/또는 비디오 효과들을 구현할지를 결정하기 위해 일부 로직을 채용할 수 있다. 일부 예들에서, 사용자가 아무 말도 하지 않고 그들의 입을 열린 상태로 유지했다는 결정은 오디오 및 비디오 둘 모두의 추출 및 분석을 필요로 할 수 있다. 예를 들어, 얼굴 특징부 특성들(예를 들어, 열린 입)의 추출은 충분하지 않을 수 있고, 프로세스는 또한 사용자(208)가 동일한 기록 기간 동안 아무 말도 하지 않았음을 검출할 필요가 있을 수 있다. 비디오 및 오디오 특징부들은 또한 아바타 특정 발화들에 대한 프로세싱에 영향을 줄 수 있다. 예를 들어, 입을 열고, 눈을 뜨고 있는 등의 지속기간은 시스템이 특정 "멍" 샘플을 선택하거나, 미리보기 프로세스 전에 및/또는 그 동안에 그러한 샘플을 변환하도록 지시할 수 있다. 하나의 그러한 변환은 사용자의 입의 검출된 열림 및 닫음에 부합하도록 멍의 레벨 및/또는 지속기간을 변경하는 것이다.2 is another simplified block diagram illustrating an exemplary flow 200 for providing audio and/or video effects based at least in part on audio and/or video features detected in a user's recording. In the exemplary flow 200, very similar to the exemplary flow 100 of FIG. 1, there are two separate sessions, a recording session 202 and a playback session 204. In the recording session 202, the device 206 can capture a video with the audio component of the user 208 at block 210. The capture of video and audio may be triggered based, at least in part, on the selection of recording affordance by the user 208. In some examples, user 208 can say the word “hello” at block 212. Additionally, at block 212, device 206 may continue to capture video and/or audio components of the user's actions. At block 214, device 206 may continue to capture video and audio components, and in this example, user 208 may keep his mouth open, but not say anything. At block 214, device 206 may also extract facial expressions from the video. However, in other examples, facial feature extraction (or any video feature extraction) may actually occur after the recording session 202 is complete. Still, extraction (eg, analysis of the video) may be performed in real time while recording session 202 is still in progress. In either case, the avatar process executed by device 206 can identify through extraction that the user has briefly opened his or her mouth (e.g., without saying anything), and implement some audio and/or video effects. Some logic can be employed to determine whether or not. In some examples, the determination that a user has said nothing and kept their mouths open may require extraction and analysis of both audio and video. For example, extraction of facial feature features (e.g., open mouth) may not be sufficient, and the process may also need to detect that the user 208 has not said anything during the same recording period. have. Video and audio features can also affect processing for avatar specific utterances. For example, the duration of mouth open, eye open, etc. may instruct the system to select a particular “bruise” sample, or to convert such a sample before and/or during the preview process. One such transformation is to change the level and/or duration of the bruise to match the detected opening and closing of the user's mouth.

예로서, 기록 세션(202)은 사용자(208)가 기록 어포던스를 다시 선택하거나(예를 들어, 기록을 종료하고자 하는 요구를 나타냄), 종료 기록 어포던스를 선택할 때(예를 들어, 기록 어포던스는 기록 중에 종료 기록 어포던스로서 작용할 수 있음), 또는 기간(예를 들어, 20초, 30초 등)의 만료에 적어도 부분적으로 기초하여 종료될 수 있다. 기록이 종료되면, 사용자(208)는 사용자(208)가 기록물의 미리보기를 보기를 원하는 것을 나타내는 미리보기 어포던스를 선택할 수 있다. 하나의 옵션은 임의의 시각적 또는 오디오 효과들 없이 원래의 기록을 재생하는 것일 수 있다. 그러나, 다른 옵션은 기록물의 개정된 버전을 재생하는 것일 수 있다. 얼굴 표정(예를 들어, 열린 입)의 검출에 적어도 부분적으로 기초하여, 아바타 프로세스는 비디오 클립의 오디오 및/또는 비디오를 개정했을 수 있다.As an example, the recording session 202 may be performed when the user 208 reselects the recording affordance (e.g., indicating a request to end recording), or selects the termination recording affordance (e.g., recording affordance May be terminated based at least in part upon expiration of an end recording affordance), or an expiration of a period (eg, 20 seconds, 30 seconds, etc.). When recording is finished, the user 208 may select a preview affordance indicating that the user 208 wishes to view a preview of the recording. One option may be to play the original recording without any visual or audio effects. However, another option could be to play the revised version of the recording. Based at least in part on detection of a facial expression (eg, an open mouth), the avatar process may have revised the audio and/or video of the video clip.

블록(216)에서, 디바이스(206)는 디바이스(206)의 화면 상에 아바타(인형 및/또는 애니모지로도 불림)(218)를 제시할 수 있다. 디바이스(206)는 또한 비디오 클립과 연관된 오디오를 재생할 수 있는 스피커(220)로 구성될 수 있다. 이러한 예에서, 블록(216)은 블록(210)과 동일한 시점에 대응하며 사용자(208)는 아직 말하지 않았을 수 있다. 이와 같이, 아바타(218)는 그의 입을 열은 상태로 제시될 수 있지만, 오디오는 스피커(220)로부터 아직 제시되지 않는다. 사용자(208)가 "안녕하세요"라고 말한 블록(212)에 대응하는 블록(222)에서, 아바타 프로세스는 아바타-특정 음성과 함께 아바타(218)를 제시할 수 있다(상술한 바와 같음).At block 216, device 206 may present an avatar (also referred to as a doll and/or animoji) 218 on the screen of device 206. Device 206 may also be configured with a speaker 220 capable of playing audio associated with a video clip. In this example, block 216 corresponds to the same time point as block 210 and user 208 may not have spoken yet. As such, the avatar 218 may be presented with its mouth open, but the audio has not yet been presented from the speaker 220. At block 222 corresponding to block 212 where user 208 says "hello", the avatar process may present the avatar 218 with an avatar-specific voice (as described above).

블록(224)에서, 아바타 프로세스는 블록(214)에서 식별된 침묵을 아바타-특정 단어로 대체할 수 있다. 이러한 예에서, 개 짖는 소리(예를 들어, 기록되거나 모사된 개 짖는 소리)의 사운드가 (예를 들어, 침묵 대신에) 오디오 데이터 내로 삽입될 수 있어서, 비디오 클립의 제시 동안 재생될 때, "멍"이 스피커(220)에 의해 제시될 수 있다. 일부 예들에서, 224에서, 상이한 아바타-특정 단어들이 상이한 아바타 선택들에 적어도 부분적으로 기초하여 제시될 것이고, 다른 예들에서, 동일한 아바타-특정 단어가 아바타 선택들에 관계없이 제시될 수 있다. 예를 들어, 사용자(208)가 그의 입을 열은 상태로 유지한 경우, 개 아바타가 선택될 때 "멍"이 제시될 수 있고, 고양이 아바타에 대해 "야옹" 사운드가 제시되는 등일 수 있다. 일부 경우들에서, 각각의 아바타는, 사용자(208)가 말하지 않고 소정량의 시간(예를 들어, 0.5초, 1초 등)동안 그의 입을 열은 상태로 유지한 것이 검출될 때 재생될 미리 정의된 사운드를 가질 수 있다. 그러나, 일부 예들에서, 프로세스는 그러한 얼굴 특징부에 대해 미리 정의된 효과를 갖지 않는 아바타들에 대해 열린 입의 검출을 무시할 수 있다. 추가적으로, 214에서의 추출 후에도 수행되는 오디오 특징부 분석의 제2 레벨이 있을 수 있다. 예를 들어, 프로세스가 (예를 들어, 열린 입의 검출에 기초하여) "멍"이 개 아바타에 대해 삽입되어야 한다고 결정하는 경우, 프로세스는 또한 (예를 들어, 사용자가 "짖기"를 나타내는 데 사용되는 시간의 2배로 입을 열은 상태로 유지하고 있는 경우) 얼마나 많은 "멍" 사운드들이 삽입될 것인지를 검출할 수 있거나, 또는 (예를 들어, 도 1의 시나리오에서, 사용자가 "짖기"라고 말하여 "멍" 사운드가 삽입되어야 함을 나타내는 경우) 요청된 "짖기"의 수만큼 삽입하는 것이 가능하지 않은지 여부를 검출할 수 있다. 따라서, 위의 두 가지 예들에 기초하여, 사용자(208)가 그들의 얼굴 표정 및 음성 표현으로 재생(예를 들어, 기록된 아바타 메시지)의 효과들을 제어할 수 있다는 것이 명백할 것이다. 또한, 도 1 또는 도 2에 명시적으로 도시되지 않았지만, 사용자 디바이스는 아바타 프로세스(예를 들어, A/V 정보를 캡처하고, 특징부들을 추출하고, 데이터를 분석하고, 로직을 구현하고, 오디오 및/또는 비디오 파일들을 개정하고, 미리보기들을 렌더링하는 것)를 실행하기 위한 소프트웨어뿐만 아니라, 사용자가 아바타 메시지들을 구축하고, 후속적으로 이들을 다른 사용자 디바이스들로 전송할 수 있게 하는 애플리케이션(예를 들어, 그 자신의 UI를 갖는 아바타 애플리케이션)을 실행하기 위한 소프트웨어로 구성될 수 있다.At block 224, the avatar process may replace the silence identified at block 214 with an avatar-specific word. In this example, the sound of a dog barking sound (eg, recorded or simulated dog barking) can be inserted into the audio data (eg, instead of silence), so that when played during presentation of a video clip, " A bruise may be presented by speaker 220. In some examples, at 224, different avatar-specific words will be presented based at least in part on different avatar selections, and in other examples, the same avatar-specific word may be presented regardless of the avatar selections. For example, if the user 208 keeps his mouth open, a “bruise” may be presented when a dog avatar is selected, a “meow” sound may be presented for a cat avatar, and the like. In some cases, each avatar is a predefined to be played when it is detected that the user 208 has not spoken and has kept his mouth open for a certain amount of time (e.g., 0.5 seconds, 1 second, etc.) You can have a good sound. However, in some examples, the process may ignore detection of an open mouth for avatars that do not have a predefined effect on such facial features. Additionally, there may be a second level of audio feature analysis performed even after extraction at 214. For example, if the process determines that a "bruise" should be inserted for a dog avatar (e.g., based on detection of an open mouth), the process may also (for example, allow the user to indicate "barking"). It is possible to detect how many "bruise" sounds will be inserted (when keeping the mouth open at twice the time used), or (for example, in the scenario of Figure 1, the user says "barking" In other words, it is possible to detect whether it is not possible to insert as many as the requested number of "barks"), indicating that the "bruising" sound should be inserted. Thus, based on the above two examples, it will be apparent that the user 208 can control the effects of playback (eg, recorded avatar messages) with their facial expressions and voice expressions. Also, although not explicitly shown in Figure 1 or Figure 2, the user device can process avatars (e.g., capture A/V information, extract features, analyze data, implement logic, And/or software for executing revising video files and rendering previews, as well as an application that allows a user to build avatar messages and subsequently transfer them to other user devices (e.g. , An avatar application having its own UI).

도 3은 상기 및 하기에 설명된 아바타 프로세스에 의해 활용되는 컴포넌트들(예를 들어, 소프트웨어 모듈들)을 예시하는 단순화된 블록도(300)이다. 일부 예들에서, 사용자의 기록물에서 검출된 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 효과들의 제공을 구현하기 위해 더 많거나 더 적은 모듈들이 활용될 수 있다. 일부 예들에서, 디바이스(302)는 카메라(304), 마이크로폰(306), 및 UI 및 아바타 미리보기들(예를 들어, 기록 전의 초기 미리보기뿐만 아니라 전송 전의 기록물의 미리보기)을 제시하기 위한 디스플레이 화면으로 구성될 수 있다. 일부 예들에서, 아바타 프로세스는 아바타 엔진(308) 및 음성 엔진(310)으로 구성된다. 아바타 엔진(308)은 아바타들의 목록을 관리하고, 비디오 특징부들(예를 들어, 얼굴 특징부 특성들)을 프로세싱하고, 비디오 정보를 개정하고, 적절할 때 음성 엔진(301)과 통신하고, 모든 프로세싱이 완료되고 효과들이 구현되었을 때(또는 폐기되었을 때) 아바타(312)의 비디오를 렌더링할 수 있다. 비디오 정보의 개정은 비디오 파일과 연관된 메타데이터를 조정하거나 달리 편집하는 것을 포함할 수 있다. 이러한 방식으로, 비디오 메타데이터(조정되거나 조정되지 않음)가 인형들을 렌더링하는 데 사용될 때, 얼굴 특징부들은 인형에 맵핑될 수 있다. 일부 예들에서, 음성 엔진(310)은 오디오 정보를 저장할 수 있고, 어떤 효과들을 구현할지를 결정하기 위한 로직을 수행하고, 오디오 정보를 개정하고, 모든 프로세싱이 완료되고 효과들이 구현되었을 때(또는 폐기되었을 때) 개정된 오디오(314)를 제공할 수 있다.3 is a simplified block diagram 300 illustrating components (eg, software modules) utilized by the avatar process described above and below. In some examples, more or fewer modules may be utilized to implement the provision of audio and/or video effects based at least in part on audio and/or video features detected in the user's recording. In some examples, the device 302 is a display to present camera 304, microphone 306, and UI and avatar previews (e.g., a preview of the recording prior to transmission as well as an initial preview before recording). It can be composed of a screen. In some examples, the avatar process consists of an avatar engine 308 and a speech engine 310. The avatar engine 308 manages the list of avatars, processes video features (e.g., facial feature features), revise video information, communicates with the voice engine 301 when appropriate, and processes all When this is complete and the effects have been implemented (or discarded), the video of the avatar 312 can be rendered. Revision of video information may include adjusting or otherwise editing metadata associated with the video file. In this way, when video metadata (adjusted or not adjusted) is used to render dolls, facial features can be mapped to the doll. In some examples, the speech engine 310 can store audio information, perform logic to determine which effects to implement, revise the audio information, and when all processing is complete and effects have been implemented (or discarded). When) the revised audio 314 can be provided.

일부 예들에서, 사용자가 새로운 아바타 비디오 클립을 기록하도록 선택하면, 비디오 특징부들(316)은 카메라(304)에 의해 캡처될 수 있고, 오디오 특징부들(318)은 마이크로폰(306)에 의해 캡처될 수 있다. 일부 경우들에서, 비디오 특징부들(316) 내에서 50개(또는 그 초과)까지의 얼굴 특징부들이 검출될 수 있다. 예시적인 비디오 특징부들은, 표정들의 지속기간, 열린 입, 찡그림, 웃음, 눈썹 올림 또는 눈살 찌푸림 등을 포함하지만, 이들로 제한되는 것은 아니다. 추가적으로, 비디오 특징부들(316)은 얼굴 특징부들(예를 들어, 사용자의 얼굴 상의 어느 위치들이 움직였는지 또는 어떤 위치의 어디인지를 나타내는 데이터 포인트들) 각각을 식별하는 메타데이터만을 포함할 수 있다. 또한, 비디오 특징부들(316)은 아바타 엔진(308) 및 음성 엔진(310)으로 전달될 수 있다. 아바타 엔진(308)에서, 비디오 특징부들(316)과 연관된 메타데이터는 저장 및 분석될 수 있다. 일부 예들에서, 아바타 엔진(308)은 메타데이터를 저장하기 전에 비디오 파일로부터의 특징부 추출을 수행할 수 있다. 그러나, 다른 예들에서, 특징부 추출은 비디오 특징부들(316)이 아바타 엔진으로 전송되기 전에 수행될 수 있다(이러한 경우에, 비디오 특징부들(316)은 메타데이터 자체일 것이다). 음성 엔진(310)에서, 비디오 특징부들(316)은 (예를 들어, 소정 오디오 및 비디오 특징부들이 동시에 발생하는지를 보기 위해) 어떤 오디오 특징부들이 어떤 비디오 특징부들에 대응하는지를 매칭하는 것이 도움이 될 때 오디오 특징부들(318)과 비교될 수 있다.In some examples, if the user chooses to record a new avatar video clip, the video features 316 may be captured by the camera 304 and the audio features 318 may be captured by the microphone 306. have. In some cases, up to 50 (or more) facial features may be detected within video features 316. Exemplary video features include, but are not limited to, duration of facial expressions, open mouth, frowns, laughter, raised eyebrows or frowns, and the like. Additionally, the video features 316 may only include metadata that identifies each of the facial features (eg, data points indicating which locations on the user's face have moved or where they are). In addition, video features 316 may be passed to avatar engine 308 and voice engine 310. In the avatar engine 308, metadata associated with the video features 316 may be stored and analyzed. In some examples, avatar engine 308 may perform feature extraction from the video file prior to storing the metadata. However, in other examples, feature extraction may be performed before the video features 316 are sent to the avatar engine (in this case, the video features 316 will be the metadata itself). In speech engine 310, video features 316 may be helpful to match which audio features correspond to which video features (e.g., to see if certain audio and video features occur simultaneously). When compared to the audio features 318.

일부 경우들에서, 오디오 특징부들은 또한 저장을 위해 음성 엔진(310)에 전달된다. 예시적인 오디오 특징부들은 레벨, 음조, 다이내믹스(예를 들어, 레벨, 음조, 음성, 형식들, 지속시간 등의 변화들)를 포함하지만, 이에 제한되지 않는다. 원시 오디오(320)는 그것이 캡처된 때의 프로세싱되지 않은 오디오 파일을 포함한다. 원시 오디오(320)는 추가 프로세싱 및 잠재적인(예를 들어, 궁극적인) 개정을 위해 음성 엔진(310)에 전달될 수 있고, 그것은 또한, 원하는 경우 원래의 오디오가 사용될 수 있도록 개별적으로 저장될 수 있다. 원시 오디오(320)는 또한 음성 인식 모듈(322)로 전달될 수 있다. 음성 인식 모듈(322)은 단어를 인식(spot)하고 사용자의 음성으로부터 그들의 의도를 식별하는 데 사용될 수 있다. 예를 들어, 음성 인식 모듈(322)은 사용자가 화를 내거나 슬프거나 행복할 때 등을 결정할 수 있다. 추가적으로, 사용자가 핵심 단어(예를 들어, 전술된 바와 같은 "짖기")를 말하는 경우, 음성 인식 모듈(322)은 이를 검출할 것이다. 이어서, 음성 인식 모듈(322)에 의해 검출 및/또는 수집된 정보는 추가적인 로직 및/또는 프로세싱을 위해 음성 엔진(310)으로 전달될 수 있다. 언급된 바와 같이, 일부 예들에서, 오디오 특징부들은 미리보기 자체 동안 실시간으로 추출된다. 이들 오디오 특징부들은 아바타 특정이며, 연관 아바타가 미리보기되고 있는 경우에만 생성될 수 있다. 오디오 특징부들은 아바타 애그노스틱이며, 모든 아바타들에 대해 생성될 수 있다. 오디오 신호는 또한, 이들 실시간 오디오 특징부 추출들에 부분적으로 기초하여, 그리고 기록 프로세스 동안 또는 그 후에, 그러나 미리보기 전에 생성된 미리 저장된 추출된 비디오 특징부들을 이용하여 조정될 수 있다. 추가적으로, 일부 특징부 추출은 336에서 렌더링 동안 음성 엔진(310)에 의해 수행될 수 있다. 일부 미리 저장된 사운드들(338)은 빈칸들을 채우거나 추출된 다른 사운드들을 대체하기 위해 음성 엔진(310)에 의해 적절하게 사용될 수 있다.In some cases, audio features are also passed to speech engine 310 for storage. Exemplary audio features include, but are not limited to, level, tone, dynamics (eg, changes in level, tone, voice, formats, duration, etc.). Raw audio 320 contains the unprocessed audio file as it was captured. The raw audio 320 can be passed to the speech engine 310 for further processing and potential (e.g., ultimate) revision, which can also be stored separately so that the original audio can be used if desired. have. Raw audio 320 may also be passed to the speech recognition module 322. The speech recognition module 322 can be used to spot words and identify their intent from the user's speech. For example, the voice recognition module 322 may determine when the user is angry, sad, or happy. Additionally, if the user speaks a key word (eg, “barking” as described above), the speech recognition module 322 will detect it. The information detected and/or collected by the speech recognition module 322 may then be passed to the speech engine 310 for further logic and/or processing. As mentioned, in some examples, audio features are extracted in real time during the preview itself. These audio features are avatar specific and can only be created if the associated avatar is being previewed. The audio features are avatar agnostic and can be created for all avatars. The audio signal can also be adjusted based in part on these real-time audio feature extractions, and during or after the recording process, but using pre-stored extracted video features created before previewing. Additionally, some feature extraction may be performed by speech engine 310 during rendering at 336. Some of the pre-stored sounds 338 may be suitably used by the speech engine 310 to fill in the blanks or replace other extracted sounds.

일부 예들에서, 음성 엔진(310)은 음성 인식 모듈(322)로부터 추출된 정보를 어떻게 처리해야 할지에 관한 결정을 내릴 것이다. 일부 예들에서, 음성 엔진(310)은 어느 특징부들이 음성 인식 모듈(322)에 의해 추출된 데이터에 대응하는지를 결정하기 위해 음성 인식 모듈(322)로부터 특징부 모듈(324)로 정보를 전달할 수 있다. 예를 들어, 특징부 모듈(324)은 음성 인식 모듈(322)에 의해 검출된 슬픈 음성이 음성의 음조의 상승 또는 음성의 속도 또는 카덴스(cadence)의 감속에 대응함을 (예를 들어, 규칙들의 세트 및/또는 로직에 기초하여) 나타낼 수 있다. 다시 말하면, 특징부 모듈(322)은 추출된 오디오 특징부들을 특정 음성 특징부들에 맵핑할 수 있다. 이어서, 효과 유형 모듈(326)은 특정 음성 특징부들을 원하는 효과로 맵핑할 수 있다. 음성 엔진(310)은 또한 각각의 가능한 아바타에 대한 각각의 특정 음성을 저장하는 것을 담당할 수 있다. 예를 들어, 각각의 아바타에 대한 표준 또는 하드코딩된(hardcoded) 음성들이 있을 수 있다. 임의의 다른 변경들이 이루어지지 않고, 사용자가 특정 아바타를 선택하는 경우, 음성 엔진(310)은 재생과 함께 사용하기 위한 적절한 표준 음성을 선택할 수 있다. 이러한 경우에, 수정된 오디오(314)는 단지 선택된 아바타에 기초하여 적절한 아바타 음성으로 변환된 원시 오디오(320)일 수 있다. 사용자가 아바타들을 통해 스크롤하고 상이한 것들을 선택함에 따라, 음성 엔진(310)은 새롭게 선택된 아바타처럼 들리게 하도록 즉석에서(on the fly) 원시 오디오(320)를 수정할 수 있다. 따라서, 이러한 변경을 행하기 위해 아바타 유형(328)이 음성 엔진(310)에 제공될 필요가 있다. 그러나, 효과가 제공되어야 하는 경우(예를 들어, 음조, 톤, 또는 실제 단어들이 오디오 파일 내에서 변경되어야 함), 음성 엔진(310)은 원시 오디오 파일(320)을 개정하고 수정된 오디오(314)를 제공할 수 있다. 일부 예들에서, 사용자는 온/오프(on/off)(330)에서 원래의 오디오 파일을 사용하는 옵션을 제공받을 것이다. 사용자가 "오프"(예를 들어, 효과 오프)를 선택하는 경우, 원시 오디오(320)는 아바타(312)의 비디오(예를 들어, 변경되지 않은 비디오에 대응함)와 조합되어 A/V 출력(332)을 생성할 수 있다. A/V 출력(332)은 디바이스(302)의 UI 상에 제시된 아바타 애플리케이션에 제공될 수 있다.In some examples, the speech engine 310 will make a decision as to how to process the information extracted from the speech recognition module 322. In some examples, the speech engine 310 can pass information from the speech recognition module 322 to the feature module 324 to determine which features correspond to data extracted by the speech recognition module 322. . For example, the feature module 324 indicates that the sad voice detected by the speech recognition module 322 corresponds to an increase in pitch of the voice or a deceleration of the speed or cadence of the voice (e.g., On a set and/or logic basis). In other words, the feature module 322 may map the extracted audio features to specific voice features. The effect type module 326 may then map the specific voice features to the desired effect. Speech engine 310 may also be responsible for storing each specific voice for each possible avatar. For example, there may be standard or hardcoded voices for each avatar. If no other changes are made and the user selects a particular avatar, the speech engine 310 may select an appropriate standard voice for use with playback. In this case, the modified audio 314 may only be the original audio 320 converted to an appropriate avatar voice based on the selected avatar. As the user scrolls through the avatars and selects different ones, the speech engine 310 can modify the raw audio 320 on the fly to sound like a newly selected avatar. Thus, the avatar type 328 needs to be provided to the speech engine 310 to make this change. However, if an effect is to be provided (e.g., tones, tones, or actual words have to be changed within the audio file), the speech engine 310 revises the original audio file 320 and modifies the audio 314 ) Can be provided. In some examples, the user will be provided an option to use the original audio file on/off 330. When the user selects “off” (e.g., effect off), the raw audio 320 is combined with the video of the avatar 312 (e.g., corresponding to the unchanged video) to output A/V ( 332) can be created. The A/V output 332 may be provided to the avatar application presented on the UI of the device 302.

아바타 엔진(308)은 아바타 유형(328)의 선택에 적어도 부분적으로 기초하여 초기 아바타 이미지를 제공하는 것을 담당할 수 있다. 추가적으로, 아바타 엔진(308)은 비디오 특징부들(316)을 각각의 아바타의 적절한 얼굴 마커들에 맵핑하는 것을 담당한다. 예를 들어, 비디오 특징부들(316)이 사용자가 웃고 있음을 나타내는 경우, 웃음을 나타내는 메타데이터는 아바타가 아바타(312)의 비디오에서 웃는 것처럼 보이도록 선택된 아바타의 입 영역에 맵핑될 수 있다. 추가적으로, 아바타 엔진(308)은 적절히 음성 엔진으로부터 타이밍 변경들(334)을 수신할 수 있다. 예를 들어, 음성 엔진(310)이 음성 효과가 (예를 들어, 특징부 모듈(324) 및/또는 효과 유형(326) 및 또는 아바타 유형에 기초하여) 오디오를 다소 속삭이는 음성으로 만드는 것으로 결정하고, 음성을 속삭이는 음성으로 수정하는 경우, 이러한 효과 변경은 감소된 레벨 및 기타 형식 및 음조 변경들에 더하여 음성 자체의 속도 저하를 포함할 수 있다. 따라서, 음성 엔진은 오디오 클립을 위한 원래의 오디오 파일에 비해 재생 속도가 더 느린 수정된 오디오를 생성할 수 있다. 이러한 시나리오에서, 음성 엔진(310)은 타이밍 변경들(334)을 통해 아바타 엔진(308)에 지시하여 비디오 파일이 적절히 감속될 수 있도록 할 필요가 있을 것이며, 그렇지 않으면, 비디오 및 오디오는 동기화되지 않을 것이다.The avatar engine 308 may be responsible for providing the initial avatar image based at least in part on the selection of the avatar type 328. Additionally, the avatar engine 308 is responsible for mapping the video features 316 to the appropriate facial markers of each avatar. For example, when the video features 316 indicate that the user is smiling, metadata representing the smile may be mapped to the mouth area of the selected avatar so that the avatar appears to be smiling in the video of the avatar 312. Additionally, the avatar engine 308 may properly receive timing changes 334 from the speech engine. For example, the voice engine 310 determines that the voice effect makes the audio somewhat whispering voice (e.g., based on the feature module 324 and/or effect type 326 and/or avatar type) and In the case of modifying the voice to a whispering voice, this effect change may include a slowdown of the voice itself, in addition to reduced level and other format and tone changes. Thus, the speech engine can produce modified audio with a slower playback speed compared to the original audio file for the audio clip. In this scenario, the speech engine 310 will need to instruct the avatar engine 308 via timing changes 334 to allow the video file to be properly slowed down, otherwise the video and audio will not be synchronized. will be.

언급된 바와 같이, 사용자는 상이한 아바타들을 선택하기 위해 디바이스(302)의 아바타 애플리케이션을 사용할 수 있다. 일부 예들에서, 음성 효과는 이러한 선택에 적어도 부분적으로 기초하여 변경될 수 있다. 그러나, 다른 예들에서, 사용자는 주어진 아바타에 대한 상이한 음성(예를 들어, 개 아바타에 대한 고양이 음성 등)을 선택할 기회가 주어질 수 있다. 이러한 유형의 자유 형태의 음성 효과 변경은 UI 상에서의 선택을 통해, 또는 일부 경우들에서 음성 활성화 또는 얼굴 움직임에 의해 사용자에 의해 실행될 수 있다. 예를 들어, 소정 얼굴 표정이 음성 엔진(310)을 트리거하여 주어진 아바타에 대한 음성 효과를 변경할 수 있다. 또한, 일부 예들에서, 음성 엔진(310)은 어린이의 음성들을 더 높은 음조로 들리게 만들도록 구성될 수 있거나, 또는 대안적으로, 어린이의 음성에 대한 원시 오디오(320)가 이미 높은 음조일 수 있다는 것을 고려하면 부적절하게 들릴 수 있기 때문에 어린이의 음성을 더 높은 음조로 만들지 않도록 결정하도록 구성될 수 있다. 효과에 대해 이러한 사용자 특정 결정을 내리는 것은 추출된 오디오 특징부들에 의해 부분적으로 구동될 수 있고, 이러한 경우에 그러한 특징부들은 기록물 전체에 걸쳐 음조 값들 및 범위들을 포함할 수 있다.As mentioned, the user can use the avatar application of the device 302 to select different avatars. In some examples, the voice effect can be changed based at least in part on this selection. However, in other examples, the user may be given the opportunity to select a different voice for a given avatar (eg, cat voice for a dog avatar, etc.). This type of free-form voice effect change can be executed by the user through selection on the UI, or in some cases by voice activation or facial movement. For example, a predetermined facial expression may trigger the voice engine 310 to change a voice effect for a given avatar. Also, in some examples, the voice engine 310 may be configured to make the child's voices sound higher in pitch, or alternatively, the raw audio 320 for the child's voice may already be in high pitch. Considering that it may sound inappropriate, it can be configured to decide not to make a child's voice higher pitched. Making such user specific decisions about the effect may be driven in part by the extracted audio features, in which case those features may include tonal values and ranges throughout the recording.

일부 예들에서, 음성 인식 모듈(322)은 인식 엔진, 단어 인식기, 음조 분석기, 및/또는 형식 분석기를 포함할 수 있다. 음성 인식 모듈(322)에 의해 수행되는 분석은 사용자가 속상해 하거나, 화가 났거나, 행복한지 등을 식별할 수 있을 것이다. 추가적으로, 음성 인식 모듈(322)은 사용자의 음성의 컨텍스트 및/또는 억양을 식별할 수 있을 뿐만 아니라, 글귀(wording)의 의도를 변경하고/하거나 사용자의 프로파일(예를 들어, 가상 신원)을 결정할 수 있다.In some examples, the speech recognition module 322 may include a recognition engine, a word recognizer, a tone analyzer, and/or a format analyzer. The analysis performed by the speech recognition module 322 may identify whether the user is upset, angry, happy, or the like. Additionally, the speech recognition module 322 can not only identify the context and/or intonation of the user's speech, but also change the intention of wording and/or determine the user's profile (eg, virtual identity). I can.

일부 예들에서, 아바타 프로세스(300)는 아바타(312)의 비디오와 수정된 오디오(314) 또는 원시 오디오(320)를 A/V 출력(332)으로 조합함으로써 비디오 클립을 패키징/렌더링하도록 구성될 수 있다. 이 둘을 패키징하기 위해, 음성 엔진(310)은 단지 아바타(312)의 비디오와 연관된 메타데이터에 대한 ID를 알 필요가 있다(예를 들어, 실제로 아바타 (312)의 비디오를 필요로 하지 않고, 단지 메타데이터의 ID를 필요로 함). 메시징 애플리케이션(예를 들어, 아바타 애플리케이션) 내의 메시지는 다른 컴퓨팅 디바이스들로 송신될 수 있으며, 메시지는 A/V 출력(332)을 포함한다. 사용자가 UI에서 "전송" 어포던스를 선택하는 경우, 미리보기될 최근 비디오 클립이 전송될 수 있다. 예를 들어, 사용자가 개 아바타로 자신의 비디오 클립을 미리보기하고, 이어서 미리보기를 위해 고양이 아바타로 전환하면, 사용자가 "전송"을 선택할 때 고양이 아바타 비디오가 전송될 것이다. 추가적으로, 최근 미리보기의 상태가 저장되고 나중에 사용될 수 있다. 예를 들어, 전송된 최근 메시지(예를 들어, 아바타 비디오 클립)가 특정 효과를 사용했다면, 생성되는 다음 메시지의 제1 미리보기는 그 특정 효과를 활용할 수 있다.In some examples, the avatar process 300 may be configured to package/render a video clip by combining the video of the avatar 312 and the modified audio 314 or raw audio 320 into an A/V output 332. have. To package the two, the speech engine 310 only needs to know the ID for the metadata associated with the video of the avatar 312 (e.g., it does not actually need the video of the avatar 312, It just needs the ID of the metadata). Messages within a messaging application (eg, an avatar application) can be sent to other computing devices, and the message includes an A/V output 332. When the user selects the "send" affordance in the UI, the latest video clip to be previewed may be transmitted. For example, if the user previews his video clip with a dog avatar and then switches to a cat avatar for preview, the cat avatar video will be sent when the user selects "Send". Additionally, the state of the recent preview is saved and can be used later. For example, if a recently transmitted message (eg, an avatar video clip) uses a specific effect, the first preview of the next message generated may utilize the specific effect.

음성 엔진(310) 및/또는 아바타 엔진(308)에 의해 구현되는 로직은 소정 큐들 및/또는 특징부들을 확인하고, 이어서 오디오 및/또는 비디오 파일들을 개정하여 원하는 효과를 구현할 수 있다. 일부 예시적인 특징부/효과 쌍들은 사용자가 그의 입을 열고 잠시 멈췄음을 검출하는 것을 포함한다. 이러한 예에서, 원하는 효과가 구현되도록 하기 위해 얼굴 특징부 특성들(예를 들어, 입 열림) 및 오디오 특징부 특성들(예를 들어, 침묵) 둘 모두가 동시에 일어날 필요가 있다. 이러한 특징부/효과 쌍에 대해, 원하는 효과는 아바타가 아바타/동물-특정 사운드를 내는 것처럼 보이도록 오디오 및 비디오를 개정한다. 예를 들어, 개는 짖기 사운드를 내고, 고양이는 야옹 사운드를 내고, 원숭이, 말, 유니콘 등은 그 캐릭터/동물에 적합한 사운드를 낼 것이다. 다른 예시적인 특징부/효과 쌍들은 찡그림이 검출될 때 오디오 음조 및/또는 톤을 낮추는 것을 포함한다. 이러한 예에서, 단지 비디오 특징부 특성들만이 검출될 필요가 있다. 그러나, 일부 예들에서, 이러한 효과는 사용자의 음성의 슬픔을 검출하는 음성 인식 모듈(322)에 적어도 부분적으로 기초하여 구현될 수 있다. 이러한 경우에, 비디오 특징부들(316)은 전혀 필요하지 않을 것이다. 다른 예시적인 특징부/효과 쌍들은 오디오 및 비디오 속도를 늦추고, 톤을 낮추고/낮추거나 변화들을 감소시키기 위해 속삼이는 것을 포함한다. 일부 경우들에서, 비디오 변경들은 오디오의 수정들로 이어질 수 있는 한편, 다른 경우에, 오디오 변경들은 비디오의 수정들로 이어질 수 있다.The logic implemented by the speech engine 310 and/or the avatar engine 308 may identify certain cues and/or features and then revise the audio and/or video files to achieve the desired effect. Some exemplary feature/effect pairs include detecting that the user has opened his mouth and paused. In this example, both facial feature features (eg, mouth open) and audio feature features (eg, silence) need to occur simultaneously in order to achieve the desired effect. For these feature/effect pairs, the desired effect revises the audio and video so that the avatar appears to emit an avatar/animal-specific sound. For example, a dog will make a barking sound, a cat will make a meow sound, and a monkey, horse, unicorn, etc. will make sounds suitable for that character/animal. Other exemplary feature/effect pairs include lowering the audio pitch and/or tone when distortion is detected. In this example, only video feature characteristics need to be detected. However, in some examples, this effect may be implemented based at least in part on the speech recognition module 322 that detects the sadness of the user's voice. In this case, the video features 316 would not be needed at all. Other exemplary feature/effect pairs include slowing down audio and video, lowering tone and/or whipping to reduce changes. In some cases, video changes can lead to modifications of the audio, while in other cases, audio changes can lead to modifications of the video.

위에 언급된 바와 같이, 일부 예들에서, 아바타 엔진(308)은 특징부 추출기로서 작용할 수 있는데, 이러한 경우에 비디오 특징부들(316) 및 오디오 특징부들(318)은 아바타 엔진(308)으로 전송되기 전에 존재하지 않을 수 있다. 대신에, 원시 오디오(320) 및 원시 비디오와 연관된 메타데이터가 아바타 엔진(308)으로 전달될 수 있으며, 아바타 엔진(308)은 오디오 특징부 특성들 및 비디오(예를 들어, 얼굴) 특징부 특성들을 추출할 수 있다. 다시 말하면, 도 3에서 이러한 방식으로 도시되지 않았지만, 아바타 엔진(308)의 부분들이 실제로 카메라(304) 내에 존재할 수 있다. 추가적으로, 일부 예들에서, 비디오 특징부들(316)과 연관된 메타데이터는 보안 컨테이너 내에 저장될 수 있고, 음성 엔진(310)이 실행 중일 때, 그것은 컨테이너로부터 메타데이터를 판독할 수 있다.As mentioned above, in some examples, the avatar engine 308 can act as a feature extractor, in which case the video features 316 and audio features 318 are prior to being sent to the avatar engine 308. It may not exist. Instead, the raw audio 320 and metadata associated with the raw video may be passed to the avatar engine 308, where the avatar engine 308 may include audio feature features and video (eg, face) feature features. Can be extracted. In other words, although not shown in this manner in FIG. 3, portions of the avatar engine 308 may actually exist within the camera 304. Additionally, in some examples, metadata associated with video features 316 can be stored within a secure container, and when speech engine 310 is running, it can read metadata from the container.

일부 경우들에서, 아바타의 미리보기 비디오 클립은 실시간으로 디스플레이되지 않기 때문에(예를 들어, 비디오가 기록된 후에, 그리고 때때로 단지 재생 어포던스의 선택에 응답하여서만 렌더링되고 디스플레이됨), 오디오 및 비디오 정보는 오프라인으로(예를 들어, 실시간이 아님) 프로세싱될 수 있다. 이와 같이, 아바타 엔진(308) 및 음성 엔진(310)은 오디오 및 비디오 정보를 미리 판독할 수 있고, 앞서서 컨텍스트 결정들을 내릴 수 있다. 이어서, 음성 엔진(310)은 그에 따라 오디오 파일을 개정할 수 있다. 미리 판독하고 오프라인에서 결정들을 내릴 수 있는 이러한 능력은 특히 더 긴 기록물들의 경우 시스템의 효율을 크게 증가시킬 것이다. 추가적으로, 이는 추가적인 로직이 프로세싱될 수 있는 제2 분석 단계를 가능하게 한다. 따라서, 전체 오디오 파일은 임의의 최종 결정들을 내리기 전에 분석될 수 있다. 예를 들어, 사용자가 연이어 2회 "짖기"라고 말하되, 단어들 "짖기"를 너무 밀접하게 함께 말한 경우, 미리 기록된 실제 "멍" 사운드가 사용자가 "짖기, 짖기"라고 말하는 데 걸린 시간에 맞춰지는 것이 가능하지 않을 수 있다. 이러한 경우에, 음성 엔진(310)은 음성 인식(322)으로부터 정보를 취하고, 두번째 "짖기"를 무시하는 것으로 결정할 수 있는데, 왜냐하면 오디오 파일에 "멍" 사운드들 둘 모두를 포함하는 것이 가능하지 않을 것이기 때문이다.In some cases, since the avatar's preview video clip is not displayed in real time (e.g., it is rendered and displayed only after the video has been recorded, and sometimes only in response to a selection of playback affordance), audio and video information. Can be processed offline (eg, not in real time). As such, the avatar engine 308 and speech engine 310 can read audio and video information in advance and make context decisions in advance. The speech engine 310 can then revise the audio file accordingly. This ability to read ahead and make decisions offline will greatly increase the efficiency of the system, especially for longer recordings. Additionally, this allows for a second analysis step in which additional logic can be processed. Thus, the entire audio file can be analyzed before making any final decisions. For example, if a user says "barking" twice in a row, but the words "barking" are spoken too closely together, the actual prerecorded "bruising" sound will be in the amount of time it takes the user to say "barking, barking". It may not be possible to fit. In this case, the speech engine 310 may take the information from the speech recognition 322 and decide to ignore the second “barking”, because it would not be possible to include both “bruise” sounds in the audio file. Because it is.

위에 언급된 바와 같이, 오디오 파일과 비디오가 함께 A/V 출력(332)을 하기 위해 패키징될 때, 음성 엔진은 아바타(312)의 비디오에 실제로 액세스할 필요가 없다. 대신에, 비디오 파일(예를 들어,.mov 형식 파일 등)은 메타데이터 파일에 기록된 특징부들(예를 들어, 부동 소수점 값들)의 어레이에 액세스함으로써 비디오가 재생되고 있을 때 생성된다. 그러나, 오디오 및 비디오 파일들에 대한 모든 순열들/조정들은 사전에 수행될 수 있고, 심지어 일부는 오디오 및 비디오가 추출될 때 실시간으로 행해질 수 있다. 추가적으로, 일부 예들에서, 각각의 수정된 비디오 클립은 일시적으로 저장될 수 있어서(예를 들어, 캐싱됨), 사용자가 이미 미리보기되었던 아바타를 재선택하는 경우, 그러한 특정 미리보기를 생성/렌더링하기 위한 프로세싱이 중복될 필요가 없다. 미리보기 섹션 동안 동일한 아바타가 선택될 때마다 개정된 비디오 클립을 재렌더링하는 것과는 반대로, 앞서 언급된 렌더링된 비디오 클립들의 캐싱은, 특히 더 긴 기록물들 및/또는 많은 수의 효과들을 갖는 기록물들에 대해, 프로세서 전력 및 초당 명령 수(instructions per second, IPS)에 있어서 큰 절감의 실현을 가능하게 할 것이다.As mentioned above, when the audio file and video are packaged together for A/V output 332, the voice engine does not need to actually access the video of the avatar 312. Instead, a video file (eg, a .mov format file, etc.) is created when the video is playing by accessing an array of features (eg, floating point values) recorded in the metadata file. However, all permutations/adjustments to audio and video files can be done in advance, and even some can be done in real time when audio and video are extracted. Additionally, in some examples, each modified video clip can be temporarily stored (e.g., cached) so that if the user reselects an avatar that has already been previewed, generating/rendering that particular preview. There is no need to duplicate the processing for this. As opposed to re-rendering the revised video clip each time the same avatar is selected during the preview section, the caching of the previously mentioned rendered video clips is particularly useful for longer recordings and/or recordings with a large number of effects. On the other hand, it will enable the realization of significant savings in processor power and instructions per second (IPS).

추가적으로, 일부 예들에서, 마이크로폰(306)에 의해 캡처된 사운드가 사용자의 음성 이외의 사운드들을 포함하는 경우를 처리하기 위해 잡음 억제 알고리즘들이 채용될 수 있다. 예를 들어, 사용자가 바람이 부는 지역 또는 시끄러운 방(예를 들어, 식당 또는 바)에 있을 때이다. 이러한 예들에서, 잡음 억제 알고리즘은 오디오 기록물의 소정 부분들의 데시벨 출력을 낮출 수 있다. 대안적으로 또는 추가적으로, 상이한 음성들이 분리될 수 있고/있거나 다만 소정 시야각들(예를 들어, 사용자의 얼굴의 각도)로부터 나오는 오디오만이 수집될 수 있고, 다른 음성들은 무시되거나 억제될 수 있다. 다른 경우들에서, 아바타 프로세스(300)가 잡음 레벨들이 너무 크거나 프로세싱하기 어려울 것이라고 결정하는 경우, 프로세스(300)는 기록 옵션을 디스에이블시킬 수 있다.Additionally, in some examples, noise suppression algorithms may be employed to handle the case where the sound captured by the microphone 306 includes sounds other than the user's voice. For example, when the user is in a windy area or in a noisy room (eg, a restaurant or bar). In these examples, the noise suppression algorithm can lower the decibel output of certain portions of the audio recording. Alternatively or additionally, different voices may be separated and/or only audio coming from certain viewing angles (eg, the angle of the user's face) may be collected, and other voices may be ignored or suppressed. In other cases, if the avatar process 300 determines that the noise levels are too large or difficult to process, the process 300 may disable the write option.

도 4는 적어도 몇몇 실시예들에 따른, 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 다양한 오디오 및/또는 비디오 효과들을 구현하기 위한 프로세스(400)를 도시하는 예시적인 흐름도를 예시한다. 일부 예들에서, 도 1의 컴퓨팅 디바이스(106) 또는 (예를 들어, 적어도 도 3의 아바타 프로세스(300)를 활용하는) 다른 유사한 사용자 디바이스는 도 4의 프로세스(400)를 수행할 수 있다.4 illustrates an example flow diagram illustrating a process 400 for implementing various audio and/or video effects based at least in part on audio and/or video features, in accordance with at least some embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (eg, utilizing at least the avatar process 300 of FIG. 3) may perform process 400 of FIG. 4.

블록(402)에서, 컴퓨팅 디바이스(106)는 오디오 성분을 갖는 비디오를 캡처할 수 있다. 일부 예들에서, 비디오 및 오디오는 두 개의 상이한 하드웨어 컴포넌트들에 의해 캡처될 수 있다(예를 들어, 카메라는 비디오 정보를 캡처할 수 있는 한편, 마이크로폰은 오디오 정보를 캡처할 수 있음). 그러나, 일부 경우들에서, 단일 하드웨어 컴포넌트는 오디오 및 비디오 둘 모두를 캡처하도록 구성될 수 있다. 임의의 이벤트에서, 비디오 및 오디오 정보는 (예를 들어, ID, 타임스탬프 등을 공유함으로써) 서로 연관될 수 있다. 이와 같이, 비디오가 오디오 성분을 가질 수 있거나(예를 들어, 이들이 동일한 파일의 일부임), 또는 비디오가 오디오 성분과 링크될 수 있다(예를 들어, 함께 연관된 두 개의 파일들임).At block 402, computing device 106 can capture video with an audio component. In some examples, video and audio can be captured by two different hardware components (eg, a camera can capture video information, while a microphone can capture audio information). However, in some cases, a single hardware component may be configured to capture both audio and video. In any event, video and audio information may be associated with each other (eg, by sharing an ID, timestamp, etc.). As such, the video can have an audio component (eg, they are part of the same file), or the video can be linked with an audio component (eg, are two files associated together).

블록(404)에서, 컴퓨팅 디바이스(106)는 캡처된 비디오 및 오디오 정보로부터 각각 얼굴 특징부들 및 오디오 특징부들을 추출할 수 있다. 일부 경우들에서, 얼굴 특징부 정보는 아바타 엔진(308)을 통해 추출되고 메타데이터로서 저장될 수 있다. 메타데이터는 각각의 얼굴 특징부를 특정 인형에 또는 임의의 애니메이션 또는 가상 얼굴에 맵핑하는 데 사용될 수 있다. 따라서, 실제 비디오 파일이 저장될 필요가 없으므로 메모리 저장 효율 및 상당한 절감을 얻을 수 있다. 오디오 특징부 추출에 관하여, 음성 인식 알고리즘이 상이한 음성 특징부들, 예를 들어, 단어들, 문구들, 음조, 속도 등을 추출하는 데 활용될 수 있다.At block 404, computing device 106 may extract facial features and audio features, respectively, from the captured video and audio information. In some cases, facial feature information may be extracted via avatar engine 308 and stored as metadata. The metadata can be used to map each facial feature to a specific doll or to any animated or virtual face. Therefore, since there is no need to store an actual video file, memory storage efficiency and significant savings can be obtained. With regard to audio feature extraction, a speech recognition algorithm can be utilized to extract different speech features, eg words, phrases, tone, speed, and the like.

블록(406)에서, 컴퓨팅 디바이스(106)는 추출된 특징부들로부터 컨텍스트를 검출할 수 있다. 예를 들어, 컨텍스트는 사용자의 의도, 기분, 설정, 위치, 배경 아이템들, 아이디어들 등을 포함할 수 있다. 컨텍스트는 어떤 효과들을 적용할지 결정하기 위해 로직을 채용할 때 중요할 수 있다. 일부 경우들에서, 컨텍스트는 검출된 음성 단어들과 조합되어, 오디오 파일 및/또는 비디오 파일의 조정 여부 및/또는 조정 방법을 결정할 수 있다. 하나의 예에서, 사용자는 그의 눈썹을 찌푸리고 느리게 말할 수 있다. 눈썹의 찌푸림은 블록(404)에서 추출되었을 수 있는 비디오 특징부이고, 느린 스피치는 블록(404)에서 추출되었을 수 있는 오디오 특징부이다. 개별적으로, 이들 두 개의 특징부들은 상이한 것을 의미할 수 있지만, 함께 조합될 때, 아바타 프로세스는 사용자가 무언가에 대해 걱정하고 있다고 결정할 수 있다. 이러한 경우에, 메시지의 컨텍스트는 부모가 어린이에게 말하고 있거나, 또는 친구가 심각하거나 또는 걱정되는 문제에 대해 다른 친구에게 말하고 있는 것일 수 있다.At block 406, computing device 106 can detect the context from the extracted features. For example, the context may include the user's intentions, moods, settings, location, background items, ideas, and the like. Context can be important when employing logic to determine which effects to apply. In some cases, the context may be combined with the detected spoken words to determine whether and/or how to adjust the audio file and/or video file. In one example, the user can frown his eyebrows and speak slowly. The brow frown is a video feature that may have been extracted at block 404 and slow speech is an audio feature that may have been extracted at block 404. Separately, these two features may mean different things, but when combined together, the avatar process may determine that the user is worried about something. In this case, the context of the message may be that the parent is speaking to the child, or the friend is speaking to another friend about a serious or worrying problem.

블록(408)에서, 컴퓨팅 디바이스(106)는 컨텍스트에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 파일들을 렌더링하기 위한 효과들을 결정할 수 있다. 위에 언급된 바와 같이, 하나의 컨텍스트는 걱정일 수 있다. 이와 같이, 특정 비디오 및/또는 오디오 특징부가 이러한 효과를 위해 채용될 수 있다. 예를 들어, 음성 파일은 더욱 우울하게, 또는 감속되게 들리도록 조정될 수 있다. 다른 예들에서, 메시지의 심각성을 전달하기 위해 아바타-특정 음성은 원래의(예를 들어, 원시) 오디오의 버전으로 대체될 수 있다. 다양한 다른 효과들이 다양한 다른 컨텍스트들에 대해 채용될 수 있다. 다른 예들에서, 컨텍스트는 (예를 들어, 사용자가 "짖기" 또는 "야옹" 등을 말하는 것에 기초하여 동물 소리들일 수 있다. 이러한 경우에, 결정된 효과는 음성 단어 "짖기"를 개가 짖는 사운드로 대체하는 것일 것이다.At block 408, computing device 106 may determine effects for rendering audio and/or video files based at least in part on the context. As mentioned above, one context can be a concern. As such, certain video and/or audio features may be employed for this effect. For example, a voice file can be adjusted to sound more depressing or slowing down. In other examples, the avatar-specific voice may be replaced with a version of the original (eg, raw) audio to convey the severity of the message. A variety of different effects can be employed for a variety of different contexts. In other examples, the context may be animal sounds (eg, based on the user saying “barking” or “meow”, etc. In this case, the determined effect is to replace the spoken word “barking” with the sound of a dog barking. Will be.

블록(410)에서, 컴퓨팅 디바이스(106)는 추가적인 효과들을 위한 추가적인 로직을 수행할 수 있다. 예를 들어, 사용자가 짖기를 연이어 2회 말함으로써 짖기 효과를 달성하려고 시도한 경우, 추가적인 짖기가 기술적으로 실현가능한지 여부를 결정하기 위해 추가적인 로직이 활용될 필요가 있을 수 있다. 일례로서, 원시 오디오 정보 내의 음성 단어를 대체하는 데 사용되는 짖기의 오디오 클립이 0.5초 길이이지만, 사용자가 0.7초 기간 동안 "짖기"라고 2회 말하는 경우, 추가적인 로직은 두 개의 짖기 사운드들은 이용가능한 0.7초 내에 맞춰질 수 없음을 결정할 수 있다. 따라서, 오디오 및 비디오 파일이 두 개의 짖기 사운드들에 맞도록 확장될 필요가 있을 수 있거나, 짖기 사운드가 (예를 들어, 저장된 짖기 사운드를 프로세싱함으로써) 단축될 필요가 있을 수 있거나, 또는 제2 음성 단어 짖기가 무시될 필요가 있을 수 있다.At block 410, computing device 106 may perform additional logic for additional effects. For example, if the user attempts to achieve the barking effect by saying the barking twice in a row, additional logic may need to be utilized to determine whether the additional barking is technically feasible. As an example, if the audio clip of the bark used to replace the spoken word in the raw audio information is 0.5 seconds long, but the user says "barking" twice over a 0.7 second period, the additional logic is that the two barking sounds are available. It can be determined that it cannot be set within 0.7 seconds. Thus, the audio and video file may need to be expanded to fit the two barking sounds, the barking sound may need to be shortened (e.g., by processing the stored barking sound), or the second voice Word barking may need to be ignored.

블록(412)에서, 컴퓨팅 디바이스(106)는 결정된 효과들 및/또는 추가적인 효과들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 정보를 개정할 수 있다. 일부 예들에서, 단지 효과들의 한 세트만이 사용될 수 있다. 그러나, 어느 경우든, 원시 오디오 파일은 추가 사운드들이 추가되고/되거나 제거되면서 새로운 오디오 파일을 형성하도록 조정(예를 들어, 개정)될 수 있다. 예를 들어, "짖기"를 사용하는 경우에, 음성 단어 "짖기"는 오디오 파일로부터 제거될 것이고, 실제 개 짖는 소리를 표현하는 새로운 사운드가 삽입될 것이다. 새로운 파일은 상이한 ID로, 또는 첨부된 ID(그것이 원래의 것이 아니라는 것을 나타내기 위해.v2 식별자를 갖는 원시 오디오 ID)로 저장할 수 있다. 추가적으로, 원시 오디오 파일은 추가적인 아바타들에 대해 재사용될 수 있도록 그리고/또는 사용자가 결정된 효과들을 사용하지 않기로 결정할 경우 별도로 저장될 것이다.At block 412, computing device 106 may revise the audio and/or video information based at least in part on the determined effects and/or additional effects. In some examples, only one set of effects can be used. However, in either case, the original audio file can be adjusted (eg, revised) to form a new audio file as additional sounds are added and/or removed. For example, in the case of using "barking", the spoken word "barking" will be removed from the audio file and a new sound representing the actual dog barking will be inserted. The new file can be saved with a different ID, or with an attached ID (original audio ID with a .v2 identifier to indicate that it is not the original). Additionally, the raw audio file will be stored separately so that it can be reused for additional avatars and/or if the user decides not to use the determined effects.

블록(414)에서, 컴퓨팅 디바이스(106)는 사용자로부터 아바타의 선택을 수신할 수 있다. 사용자는 컴퓨팅 디바이스(106)에 의해 실행되고 있는 아바타 애플리케이션의 UI를 통해 복수의 상이한 아바타들 중 하나를 선택할 수 있다. 아바타들은 스크롤 휠, 드롭 다운 메뉴, 또는 아이콘 메뉴를 통해 선택될 수 있다(예를 들어, 여기서 각각의 아바타는 그 자신의 위치에서 화면 상에서 볼 수 있음).At block 414, computing device 106 may receive a selection of an avatar from a user. The user can select one of a plurality of different avatars through the UI of the avatar application being executed by the computing device 106. Avatars can be selected via a scroll wheel, a drop-down menu, or an icon menu (eg, where each avatar can be viewed on the screen at its own location).

블록(416)에서, 컴퓨팅 디바이스(106)는 선택된 아바타에 적어도 부분적으로 기초하여 개정된 오디오와 함께 개정된 비디오를 제시할 수 있다. 이러한 예에서, 각각의 조정된 비디오 클립(예를 들어, 오디오를 조정하고/하거나 비디오를 조정한 아바타에 대한 최종 클립)은 사용자에 의한 아바타의 선택 이전에 각각 각자의 아바타에 대해 생성될 수 있다. 이러한 방식으로, 프로세싱은 이미 완료되었고, 조정된 비디오 클립은 아바타의 선택 즉시 제시될 준비가 된다. 이는 아바타 선택 이전에 추가적인 IPS를 요구할 수 있지만, 이는 제시를 가속화할 것이다. 추가적으로, 각각의 조정된 비디오 클립의 프로세싱은 사용자가 제1 미리보기(예를 들어, UI에 제시된 제1/디폴트 아바타에 대응하는 미리보기)를 검토하고 있는 동안 수행될 수 있다.At block 416, computing device 106 can present the revised video with revised audio based at least in part on the selected avatar. In this example, each adjusted video clip (e.g., a final clip for an avatar that adjusted audio and/or adjusted video) may be created for each respective avatar prior to selection of the avatar by the user. . In this way, the processing has already been completed and the adjusted video clip is ready to be presented upon selection of the avatar. This may require an additional IPS prior to avatar selection, but this will accelerate the presentation. Additionally, processing of each adjusted video clip may be performed while the user is reviewing the first preview (eg, the preview corresponding to the first/default avatar presented in the UI).

도 5는 적어도 몇몇 실시예들에 따른, 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 다양한 오디오 및/또는 비디오 효과들을 구현하기 위한 프로세스(500)를 도시하는 예시적인 흐름도를 예시한다. 일부 예들에서, 도 1의 컴퓨팅 디바이스(106) 또는 (예를 들어, 적어도 도 3의 아바타 프로세스(300)를 활용하는) 다른 유사한 사용자 디바이스는 도 5의 프로세스(500)를 수행할 수 있다.5 illustrates an example flow diagram illustrating a process 500 for implementing various audio and/or video effects based at least in part on audio and/or video features, in accordance with at least some embodiments. In some examples, computing device 106 of FIG. 1 or other similar user device (eg, utilizing at least the avatar process 300 of FIG. 3) may perform process 500 of FIG. 5.

블록(502)에서, 컴퓨팅 디바이스(106)는 오디오 성분을 갖는 비디오를 캡처할 수 있다. 도 4의 블록(402)에서와 같이, 비디오 및 오디오는 두 개의 상이한 하드웨어 컴포넌트들에 의해 캡처될 수 있다(예를 들어, 카메라는 비디오 정보를 캡처할 수 있는 한편, 마이크로폰은 오디오 정보를 캡처할 수 있음). 언급된 바와 같이, 비디오가 오디오 성분을 가질 수 있거나(예를 들어, 이들이 동일한 파일의 일부임), 또는 비디오가 오디오 성분과 링크될 수 있다(예를 들어, 함께 연관된 두 개의 파일들임).At block 502, computing device 106 can capture video with an audio component. As in block 402 of FIG. 4, video and audio can be captured by two different hardware components (e.g., a camera can capture video information, while a microphone can capture audio information. Can). As mentioned, the video can have an audio component (eg, they are part of the same file), or the video can be linked with an audio component (eg, are two files associated together).

블록(504)에서, 컴퓨팅 디바이스(106)는 캡처된 비디오 및 오디오 정보로부터 각각 얼굴 특징부들 및 오디오 특징부들을 추출할 수 있다. 상기와 같이, 얼굴 특징부 정보는 아바타 엔진(308)을 통해 추출되고 메타데이터로서 저장될 수 있다. 메타데이터는 각각의 얼굴 특징부를 특정 인형에 또는 임의의 애니메이션 또는 가상 얼굴에 맵핑하는 데 사용될 수 있다. 따라서, 실제 비디오 파일이 저장될 필요가 없으므로 메모리 저장 효율 및 상당한 절감을 얻을 수 있다. 오디오 특징부 추출에 관하여, 음성 인식 알고리즘이 상이한 음성 특징부들, 예를 들어, 단어들, 문구들, 음조, 속도 등을 추출하는 데 활용될 수 있다. 추가적으로, 일부 예들에서, 아바타 엔진(308) 및/또는 음성 엔진(310)은 오디오 특징부 추출을 수행할 수 있다.At block 504, computing device 106 may extract facial features and audio features, respectively, from the captured video and audio information. As described above, the facial feature information may be extracted through the avatar engine 308 and stored as metadata. The metadata can be used to map each facial feature to a specific doll or to any animated or virtual face. Therefore, since there is no need to store an actual video file, memory storage efficiency and significant savings can be obtained. With regard to audio feature extraction, a speech recognition algorithm can be utilized to extract different speech features, eg words, phrases, tone, speed, and the like. Additionally, in some examples, the avatar engine 308 and/or the speech engine 310 may perform audio feature extraction.

블록(506)에서, 컴퓨팅 디바이스(106)는 추출된 특징부들로부터 컨텍스트를 검출할 수 있다. 예를 들어, 컨텍스트는 사용자의 의도, 기분, 설정, 위치, 아이디어들, 신원 등을 포함할 수 있다. 컨텍스트는 어떤 효과들을 적용할지 결정하기 위해 로직을 채용할 때 중요할 수 있다. 일부 경우들에서, 컨텍스트는 음성 단어들과 조합되어, 오디오 파일 및/또는 비디오 파일의 조정 여부 및/또는 조정 방법을 결정할 수 있다. 하나의 예에서, 사용자의 연령은 얼굴 및/또는 음성 특징부들에 적어도 부분적으로 기초하여 컨텍스트(예를 들어, 어린이, 성인 등)로서 검출될 수 있다. 예를 들어, 어린이의 얼굴은 식별될 수 있는 특정 특징부들(예를 들어, 큰 눈들, 작은 코, 및 비교적 작은 머리 등)을 가질 수 있다. 이와 같이, 어린이 컨텍스트가 검출될 수 있다.At block 506, computing device 106 can detect the context from the extracted features. For example, the context may include the user's intentions, moods, settings, location, ideas, identity, and the like. Context can be important when employing logic to determine which effects to apply. In some cases, the context may be combined with spoken words to determine whether and/or how to adjust the audio file and/or video file. In one example, the user's age may be detected as a context (eg, child, adult, etc.) based at least in part on facial and/or voice features. For example, a child's face may have certain features that can be identified (eg, large eyes, a small nose, and a relatively small head, etc.). In this way, a child context can be detected.

블록(508)에서, 컴퓨팅 디바이스(106)는 사용자로부터 아바타의 선택을 수신할 수 있다. 사용자는 컴퓨팅 디바이스(106)에 의해 실행되고 있는 아바타 애플리케이션의 UI를 통해 복수의 상이한 아바타들 중 하나를 선택할 수 있다. 아바타들은 스크롤 휠, 드롭 다운 메뉴, 또는 아이콘 메뉴를 통해 선택될 수 있다(예를 들어, 여기서 각각의 아바타는 그 자신의 위치에서 화면 상에서 볼 수 있음).At block 508, computing device 106 may receive a selection of an avatar from a user. The user can select one of a plurality of different avatars through the UI of the avatar application being executed by the computing device 106. Avatars can be selected via a scroll wheel, a drop-down menu, or an icon menu (eg, where each avatar can be viewed on the screen at its own location).

블록(510)에서, 컴퓨팅 디바이스(106)는 컨텍스트 및 선택된 아바타에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 파일들을 렌더링하기 위한 효과들을 결정할 수 있다. 이러한 예에서, 각각의 아바타에 대한 효과들은 일제히 되는 것과는 반대로, 각각의 아바타의 선택 시에 생성될 수 있다. 일부 경우들에서, 이것은 상당한 프로세서 및 메모리 절감의 실현을 가능하게 할 것인데, 그 이유는 한 번에, 효과들의 하나의 세트 및 아바타 렌더링만이 수행될 것이기 때문이다. 이러한 절감들은 특히 사용자가 미리보기할 다수의 아바타들을 선택하지 않을 때 실현될 수 있다.At block 510, computing device 106 may determine effects for rendering audio and/or video files based at least in part on the context and the selected avatar. In this example, effects for each avatar may be created upon selection of each avatar, as opposed to being unified. In some cases, this will enable the realization of significant processor and memory savings because only one set of effects and avatar rendering will be performed at a time. These savings can be realized especially when the user does not select multiple avatars to preview.

블록(512)에서, 컴퓨팅 디바이스(106)는, 도 4의 블록(410)에 대해 전술된 것과 유사하게, 추가적인 효과들에 대한 추가적인 로직을 수행할 수 있다. 블록(514)에서, 컴퓨팅 디바이스(106)는, 도 4의 블록(412)에 대해 전술된 것과 유사하게, 선택된 아바타에 대한 결정된 효과들 및/또는 추가적인 효과들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 정보를 개정할 수 있다. 블록(516)에서, 컴퓨팅 디바이스(106)는, 도 4의 블록(416)에 대해 전술된 것과 유사하게, 선택된 아바타에 적어도 부분적으로 기초하여 개정된 오디오와 함께 개정된 비디오를 제시할 수 있다.At block 512, computing device 106 may perform additional logic for additional effects, similar to that described above for block 410 of FIG. 4. At block 514, computing device 106 may perform audio and/or audio and/or additional effects based at least in part on the determined effects and/or additional effects for the selected avatar, similar to that described above for block 412 of FIG. 4. Or you can revise the video information. At block 516, computing device 106 may present the revised video with revised audio based at least in part on the selected avatar, similar to that described above with respect to block 416 of FIG. 4.

일부 예들에서, 아바타 프로세스(300)는 이력 정보에 적어도 부분적으로 기초하여 흐름(400)을 수행할 지 또는 흐름(500)을 수행할지 여부를 결정할 수 있다. 예를 들어, 사용자가 일반적으로 매번 동일한 아바타를 사용하는 경우, 흐름(500)이 더 효율적일 것이다. 그러나, 사용자가 아바타들 사이에서 규칙적으로 전환하고 비디오 클립당 다수의 상이한 아바타들을 미리보기하는 경우, 흐름(400)이 더 효율적일 수 있다.In some examples, the avatar process 300 may determine whether to perform the flow 400 or the flow 500 based at least in part on the history information. For example, if the user generally uses the same avatar every time, flow 500 would be more efficient. However, if the user regularly switches between avatars and previews a number of different avatars per video clip, flow 400 may be more efficient.

도 6은 사용자가 아바타 애플리케이션(예를 들어, 아바타 애플리케이션 어포던스(602)에 대응함)을 활용할 수 있게 하기 위한 예시적인 UI(600)를 예시한다. 일부 예들에서, UI(600)는 아바타 애플리케이션 어포던스(602)가 선택될 때까지 상이하게 보일 수 있다(예를 들어, 그것은 표준 텍스트(예를 들어, 단문 메시징 서비스(SMS)) 메시징 애플리케이션으로서 보일 수 있음). 언급된 바와 같이, 아바타 애플리케이션은 아바타 프로세스(예를 들어, 도 3의 아바타 프로세스(300))와 통신하여, 오디오 및/ 비디오의 캡처, 프로세싱(예를 들어, 특징부들 추출하기, 로직 실행하기 등) 및 조정을 위한 요청들을 한다. 예를 들어, 사용자가 기록 어포던스(예를 들어, 기록/전송 비디오 클립 어포던스(604))를 선택할 때, 아바타 애플리케이션은, 적절한 하드웨어 컴포넌트들을 사용하여 비디오 및 오디오 정보를 캡처하기 시작하기 위해 아바타 프로세스에 대한 애플리케이션 프로그래밍 인터페이스(API) 호출을 할 수 있다. 일부 예에서, 기록/전송 비디오 클립 어포던스(604)는 기록 세션 시작 전에 적색 원(또는 도 6에 도시된 라인이 없는 평원)으로서 표현될 수 있다. 이러한 방식으로, 어포던스는 표준 기록 버튼과 더 유사하게 보일 것이다. 세션을 기록하는 동안, 기록/전송 비디오 클립 어포던스(604)의 외관은 (예를 들어, 비디오 클립 기록들의 길이가 제한되는 경우) 시계 카운트다운 또는 타이머의 다른 표현과 같이 보이도록 변경될 수 있다. 그러나, 다른 예들에서, 기록/전송 비디오 클립 어포던스(604)는 아바타 애플리케이션이 기록하고 있음을 나타내기 위해 색상들만을 변경할 수 있다. 타이머가 없거나 기록물의 길이에 대한 한계가 없는 경우, 사용자는 기록을 종료하기 위해 다시 기록/전송 비디오 클립 어포던스(604)를 선택할 필요가 있을 수 있다.6 illustrates an exemplary UI 600 for enabling a user to utilize an avatar application (eg, corresponding to the avatar application affordance 602 ). In some examples, UI 600 may look different until avatar application affordance 602 is selected (e.g., it may be viewed as a standard text (e.g., Short Message Service (SMS)) messaging application. has exist). As mentioned, the avatar application communicates with the avatar process (e.g., avatar process 300 in FIG. 3) to capture and process audio and/video (e.g., extracting features, executing logic, etc.) ) And requests for mediation. For example, when the user selects a recording affordance (e.g., recording/transfer video clip affordance 604), the avatar application will enter the avatar process to begin capturing video and audio information using the appropriate hardware components. Application programming interface (API) calls can be made. In some examples, the record/transfer video clip affordance 604 may be represented as a red circle (or a plain without lines shown in FIG. 6) prior to the start of the recording session. In this way, affordance will look more like a standard record button. While recording a session, the appearance of the record/transfer video clip affordance 604 can be changed to look like a clock countdown or other representation of a timer (eg, if the length of the video clip records is limited). However, in other examples, the recording/transfer video clip affordance 604 may only change the colors to indicate that the avatar application is recording. If there is no timer or no limit on the length of the recording, the user may need to select the record/transfer video clip affordance 604 again to end recording.

일부 예들에서, 사용자는 아바타를 선택하기 위해 아바타 선택 어포던스(606)를 사용할 수 있다. 이는 아바타 비디오 클립의 기록 전에 그리고/또는 아바타 비디오 클립의 기록 후에 행해질 수 있다. 기록 전에 선택될 때, 사용자의 움직임들 및 얼굴 특성들의 초기 미리보기가 선택된 아바타로서 제시될 것이다. 추가적으로, 기록은 기록물의 라이브(예를 들어, 실시간) 미리보기를 제시하면서 수행될 것이며, 이때 사용자의 얼굴은 선택된 아바타로 표현된다. 기록이 완료되면, 선택된 아바타를 다시 사용하여 제2 미리보기(예를 들어, 실제 기록물의 재생)가 제시될 것이다. 그러나, 이 단계에서, 사용자는 아바타 선택 어포던스(606)를 스크롤하여, 기록물 미리보기를 보기 위해 새로운 아바타를 선택할 수 있다. 일부 경우들에서, 새로운 아바타의 선택 시, UI는 선택된 아바타를 사용하여 기록을 미리보기하기 시작할 것이다. 새로운 미리보기는 오디오/비디오 효과들과 함께 또는 원래 기록된 상태로 제시될 수 있다. 언급된 바와 같이, 효과가 적용된 버전을 제시할 지 또는 원본을 제시할지 여부에 관한 결정은 사용되는 최근 재생 방법에 적어도 부분적으로 기초할 수 있다. 예를 들어, 최근 재생에서 효과들을 사용한 경우, 새로운 아바타 선택 후의 제1 재생에서 효과들을 사용할 수 있다. 그러나, 최근 재생에서 효과들을 사용하지 않은 경우, 새로운 아바타 선택 후의 제1 재생에서 효과들을 사용하지 않을 수 있다. 일부 예들에서, 사용은 효과 미리보기 어포던스(608)를 선택함으로써 비디오 클립을 효과들과 함께 재생하거나 또는 원래의 미리보기 어포던스(610)를 선택함으로써 효과들 없이 재생할 수 있다. 비디오 클립(예를 들어, 메시지)에 만족하면, 사용자는 기록/전송 비디오 클립 어포던스(604)를 사용하여 다른 컴퓨팅 디바이스로 메시지 내의 아바타 비디오를 전송할 수 있다. 비디오 클립은 (예를 들어, 효과들이 있거나 없는) 최근 미리보기에 대응하는 형식을 사용하여 전송될 것이다. 임의의 시간에, 사용자가 원한다면, 아바타 비디오를 삭제하고, 아바타 및/또는 메시징 애플리케이션들을 시작하거나 종료하기 위해 삭제 비디오 클립 어포던스(612)가 선택될 수 있다.In some examples, the user may use the avatar selection affordance 606 to select an avatar. This may be done prior to recording of the avatar video clip and/or after recording of the avatar video clip. When selected before recording, an initial preview of the user's movements and facial features will be presented as the selected avatar. Additionally, the recording will be performed by presenting a live (eg, real-time) preview of the recording, where the user's face is represented by the selected avatar. When the recording is complete, a second preview (eg, playback of the actual recording) will be presented using the selected avatar again. However, at this stage, the user can scroll through the avatar selection affordance 606 and select a new avatar to view the recording preview. In some cases, upon selection of a new avatar, the UI will begin to preview the recording using the selected avatar. The new preview can be presented with audio/video effects or as originally recorded. As mentioned, the decision as to whether to present the version with the effect applied or the original may be based at least in part on the recent playback method used. For example, when effects are used in recent playback, effects may be used in a first playback after selecting a new avatar. However, if effects are not used in the recent playback, effects may not be used in the first playback after selecting a new avatar. In some examples, use may play the video clip with effects by selecting the effect preview affordance 608 or play without effects by selecting the original preview affordance 610. Once satisfied with the video clip (eg, the message), the user can use the record/transmit video clip affordance 604 to send the avatar video in the message to another computing device. The video clip will be transmitted using a format corresponding to the recent preview (eg, with or without effects). At any time, if the user wishes, the delete video clip affordance 612 may be selected to delete the avatar video and to start or end the avatar and/or messaging applications.

도 7는 적어도 몇몇 실시예들에 따른, 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 다양한 오디오 및/또는 비디오 효과들을 구현하기 위한 프로세스(예를 들어, 컴퓨터 구현 방법)(700)를 도시하는 예시적인 흐름도를 예시한다. 일부 예들에서, 도 1의 컴퓨팅 디바이스(106) 또는 (예를 들어, 적어도 도 6에 도시된 것과 유사한 아바타 애플리케이션 및 도 3의 아바타 프로세스(300)를 활용하는) 다른 유사한 사용자 디바이스는 도 7의 프로세스(700)를 수행할 수 있다.7 shows a process (e.g., a computer implemented method) 700 for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least some embodiments. Illustrates an exemplary flow chart. In some examples, the computing device 106 of FIG. 1 or another similar user device (e.g., utilizing at least an avatar application similar to that shown in FIG. 6 and the avatar process 300 of FIG. 3) is the process of FIG. 700 can be performed.

블록(702)에서, 컴퓨팅 디바이스(106)는 가상 아바타 생성 인터페이스를 디스플레이할 수 있다. 가상 아바타 생성 인터페이스는 도 6에 예시된 UI와 유사하게 보일 수 있다. 그러나, 본 명세서에 설명된 동일한 특징부들을 가능하게 하도록 구성된 임의의 UI가 사용될 수 있다.At block 702, computing device 106 can display a virtual avatar creation interface. The virtual avatar creation interface may look similar to the UI illustrated in FIG. 6. However, any UI configured to enable the same features described herein may be used.

블록(704)에서, 컴퓨팅 디바이스(106)는 가상 아바타의 제1 미리보기 콘텐츠를 디스플레이할 수 있다. 일부 예들에서, 제1 미리보기 콘텐츠는 움직임 및 얼굴 표정들을 포함하는 사용자의 얼굴의 실시간 표현일 수 있다. 그러나, 제1 미리보기는 사용자의 얼굴의 이미지 대신에 사용자의 얼굴을 나타내기 위해 아바타(예를 들어, 만화 캐릭터, 디지털/가상 인형)를 제공할 것이다. 이러한 제1 미리보기는 비디오 전용일 수 있거나, 또는 적어도 사운드가 없는 아바타의 렌더링일 수 있다. 일부 예들에서, 이러한 제1 미리보기는 기록되지 않으며, 컴퓨팅 디바이스(106)의 배터리 전력 또는 메모리 공간 이외에는 제한없이 사용자가 원하는 한 활용될 수 있다.At block 704, computing device 106 may display the first preview content of the virtual avatar. In some examples, the first preview content may be a real-time representation of the user's face including movements and facial expressions. However, the first preview will provide an avatar (eg, cartoon character, digital/virtual doll) to represent the user's face instead of an image of the user's face. This first preview may be video only, or at least may be a rendering of an avatar without sound. In some examples, this first preview is not recorded and can be utilized as long as the user desires without limitation other than battery power or memory space of computing device 106.

블록(706)에서, 컴퓨팅 디바이스(106)는 가상 아바타 생성 인터페이스에서 입력(예를 들어, 도 6의 기록/전송 비디오 클립 어포던스(604))의 선택을 검출할 수 있다. 이러한 선택은 UI가 제1 미리보기 콘텐츠를 디스플레이하는 동안 이루어질 수 있다.At block 706, computing device 106 may detect a selection of an input (eg, record/transmit video clip affordance 604 of FIG. 6) in a virtual avatar generation interface. This selection may be made while the UI is displaying the first preview content.

블록(708)에서, 컴퓨팅 디바이스(106)는 블록(706)에서 검출된 입력에 적어도 부분적으로 기초하여 비디오 및 오디오 신호들을 캡처하기 시작할 수 있다. 설명된 바와 같이, 비디오 및 오디오 신호들은 적절한 하드웨어 컴포넌트들에 의해 캡처될 수 있고, 그러한 컴포넌트들 중 하나 또는 이들의 조합에 의해 캡처될 수 있다.At block 708, computing device 106 can begin capturing video and audio signals based at least in part on the input detected at block 706. As described, video and audio signals can be captured by suitable hardware components, and can be captured by one or a combination of those components.

블록(710)에서, 컴퓨팅 디바이스(106)는 위에서 상세히 설명된 바와 같이 오디오 특징부 특성들 및 얼굴 특징부 특성들을 추출할 수 있다. 언급된 바와 같이, 추출은 도 3의 아바타 프로세스(300)의 특정 모듈들에 의해 또는 아바타 애플리케이션 및/또는 컴퓨팅 디바이스(106)의 다른 추출 및/또는 분석 컴포넌트들에 의해 수행될 수 있다.At block 710, computing device 106 may extract audio feature features and facial feature features as detailed above. As mentioned, the extraction may be performed by specific modules of the avatar process 300 of FIG. 3 or by other extraction and/or analysis components of the avatar application and/or computing device 106.

블록(712)에서, 컴퓨팅 디바이스(106)는 얼굴 특징부 특성들 및 오디오 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호를 생성할 수 있다. 예를 들어, 블록(708)에서 캡처된 오디오 파일은 새로운 사운드들, 새로운 단어들 등을 포함하도록, 그리고/또는 조정된 원래의 음조, 톤, 볼륨 등을 갖도록 영구적으로(또는 일시적으로) 개정(예를 들어, 조정)될 수 있다. 이러한 조정들은 얼굴 특징부 특성들 및 오디오 특징부 특성들의 분석을 통해 검출된 컨텍스트에 적어도 부분적으로 기초하여 이루어질 수 있다. 추가적으로, 조정들은, 기록 세션 동안, 선택된 아바타의 유형에 기초하여 그리고/또는 사용자에 의해 수행된(예를 들어, 사용자의 얼굴에 의해 표현되는) 특정 움직임들, 얼굴 표정들, 단어들, 문구들, 또는 액션들에 기초하여 이루어질 수 있다.At block 712, computing device 106 may generate an adjusted audio signal based at least in part on facial feature characteristics and audio feature characteristics. For example, the audio file captured at block 708 may be permanently (or temporarily) revised (or temporarily) to have the original tone, tone, volume, etc. adjusted, and/or to include new sounds, new words, etc. For example, it can be adjusted). These adjustments may be made based at least in part on the context detected through analysis of facial feature features and audio feature features. Additionally, adjustments may include certain movements, facial expressions, words, phrases performed by the user (e.g., expressed by the user's face) based on the type of avatar selected and/or during the recording session. , Or based on actions.

블록(714)에서, 컴퓨팅 디바이스(106)는 조정된 오디오 신호에 따라 UI에서의 가상 아바타의 제2 미리보기 콘텐츠를 생성할 수 있다. 생성된 제2 미리보기 콘텐츠는 현재 선택된 아바타 또는 일부 디폴트 아바타에 적어도 부분적으로 기초할 수 있다. 제2 미리보기 콘텐츠가 생성되면, 컴퓨팅 디바이스(106)는 블록(716)에서 UI에 제2 미리보기 콘텐츠를 제시할 수 있다.In block 714, the computing device 106 may generate the second preview content of the virtual avatar in the UI according to the adjusted audio signal. The generated second preview content may be based at least in part on the currently selected avatar or some default avatar. When the second preview content is generated, the computing device 106 may present the second preview content to the UI at block 716.

도 8은 적어도 몇몇 실시예들에 따른, 오디오 및/또는 비디오 특징부들에 적어도 부분적으로 기초하여 다양한 오디오 및/또는 비디오 효과들을 구현하기 위한 프로세스(예를 들어, 실행될 수 있는 컴퓨터 판독가능 메모리 상에 저장된 명령어들)(800)를 도시하는 예시적인 흐름도를 예시한다. 일부 예들에서, 도 1의 컴퓨팅 디바이스(106) 또는 (예를 들어, 적어도 도 6에 도시된 것과 유사한 아바타 애플리케이션 및 도 3의 아바타 프로세스(300)를 활용하는) 다른 유사한 사용자 디바이스는 도 8의 프로세스(800)를 수행할 수 있다.8 is a process for implementing various audio and/or video effects based at least in part on audio and/or video features, according to at least some embodiments (e.g., on an executable computer-readable memory Stored instructions) 800 illustrates an example flow diagram. In some examples, the computing device 106 of FIG. 1 or another similar user device (e.g., utilizing at least an avatar application similar to that shown in FIG. 6 and the avatar process 300 of FIG. 3) is the process of FIG. (800) can be performed.

블록(802)에서, 컴퓨팅 디바이스(106)는 가상 아바타의 아바타 비디오 클립을 생성하기 위한 요청을 검출할 수 있다. 일부 예들에서, 요청은 도 6의 전송/기록 비디오 클립 어포던스(604)의 사용자의 선택에 적어도 부분적으로 기초할 수 있다.At block 802, computing device 106 may detect a request to create an avatar video clip of the virtual avatar. In some examples, the request may be based at least in part on a user's selection of the transmit/record video clip affordance 604 of FIG. 6.

블록(804)에서, 컴퓨팅 디바이스(106)는 카메라의 시야에서 얼굴과 연관된 비디오 신호를 캡처할 수 있다. 블록(806)에서, 컴퓨팅 디바이스(106)는 (예를 들어, 카메라에 의해 캡처되고 있는 얼굴로부터 나오는) 비디오 신호에 대응하는 오디오 신호를 캡처할 수 있다.At block 804, computing device 106 can capture a video signal associated with the face in the camera's field of view. At block 806, computing device 106 can capture an audio signal corresponding to a video signal (eg, coming from a face being captured by a camera).

블록(808)에서, 컴퓨팅 디바이스(106)는 오디오 신호로부터 음성 특징부 특성들을 추출할 수 있고, 블록(810)에서, 컴퓨팅 디바이스(106)는 비디오 신호로부터 얼굴 특징부 특성들을 추출할 수 있다.At block 808, computing device 106 can extract voice feature features from the audio signal, and at block 810, computing device 106 can extract facial feature features from the video signal.

블록(812)에서, 컴퓨팅 디바이스(106)는 아바타 비디오 클립을 미리보기하기 위한 요청을 검출할 수 있다. 이러한 요청은 도 6의 아바타 선택 어포던스(606)를 통한 새로운 아바타의 사용자의 선택에 적어도 부분적으로 기초할 수 있거나, 또는 도 6의 효과 미리보기 어포던스(608)의 사용자의 선택에 적어도 부분적으로 기초할 수 있다.At block 812, computing device 106 can detect a request to preview the avatar video clip. Such a request may be based at least in part on the user's selection of a new avatar through the avatar selection affordance 606 of FIG. 6, or may be based at least in part on the user's selection of the effect preview affordance 608 of FIG. I can.

블록(814)에서, 컴퓨팅 디바이스(106)는 얼굴 특징부 특성들 및 음성 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호를 생성할 수 있다. 예를 들어, 블록(806)에서 캡처된 오디오 파일은 새로운 사운드들, 새로운 단어들 등을 포함하도록, 그리고/또는 조정된 원래의 음조, 톤, 볼륨 등을 갖도록 개정(예를 들어, 조정)될 수 있다. 이러한 조정들은 얼굴 특징부 특성들 및 음성 특징부 특성들의 분석을 통해 검출된 컨텍스트에 적어도 부분적으로 기초하여 이루어질 수 있다. 추가적으로, 조정들은, 기록 세션 동안, 선택된 아바타의 유형에 기초하여 그리고/또는 사용자에 의해 수행된(예를 들어, 사용자의 얼굴에 의해 표현되는) 특정 움직임들, 얼굴 표정들, 단어들, 문구들, 또는 액션들에 기초하여 이루어질 수 있다.At block 814, computing device 106 may generate an adjusted audio signal based at least in part on facial feature characteristics and voice feature characteristics. For example, the audio file captured at block 806 will be revised (e.g., adjusted) to contain new sounds, new words, etc., and/or to have the original pitch, tone, volume, etc. I can. These adjustments may be made based at least in part on the context detected through analysis of facial feature features and speech feature features. Additionally, adjustments may include certain movements, facial expressions, words, phrases performed by the user (e.g., expressed by the user's face) based on the type of avatar selected and/or during the recording session. , Or based on actions.

블록(816)에서, 컴퓨팅 디바이스(106)는 조정된 오디오 신호에 따라 UI의 가상 아바타의 미리보기를 생성할 수 있다. 생성된 미리보기는 현재 선택된 아바타 또는 일부 디폴트 아바타에 적어도 부분적으로 기초할 수 있다. 미리보기가 생성되면, 컴퓨팅 디바이스(106)는 또한 블록(816)에서 UI에 제2 미리보기 콘텐츠를 제시할 수 있다.At block 816, computing device 106 may generate a preview of the virtual avatar of the UI according to the adjusted audio signal. The generated preview may be based at least in part on the currently selected avatar or some default avatar. Once the preview is generated, computing device 106 can also present the second preview content to the UI at block 816.

도 9는 적어도 하나의 실시예에 따른, 본 명세서에 설명된 특징부들을 구현하기 위한 예시적인 아키텍처 (900)를 예시하는 단순화된 블록도이다. 일부 예들에서, 예시적인 아키텍처(900)를 갖는 컴퓨팅 디바이스(902)(예를 들어, 도 1의 컴퓨팅 디바이스(106))는 관련 UI들을 제시하고, 오디오 및 비디오 정보를 캡처하고, 관련 데이터를 추출하고, 로직을 수행하고, 오디오 및 비디오 정보를 개정하고, 애니모지 비디오들을 제시하도록 구성될 수 있다.9 is a simplified block diagram illustrating an exemplary architecture 900 for implementing features described herein, in accordance with at least one embodiment. In some examples, computing device 902 (e.g., computing device 106 of FIG. 1) with exemplary architecture 900 presents relevant UIs, captures audio and video information, and extracts relevant data. And perform logic, revise audio and video information, and present Animoji videos.

컴퓨팅 디바이스(902)는 가상 아바타 비디오 클립들을 기록, 미리보기, 및/또는 전송하기 위한 사용자 인터페이스(예를 들어, 도 6의 사용자 인터페이스(600))를 제공하는 것과 같은, 그러나 이로 제한되지 않는, 설명된 기술들을 수행하기 위한 애플리케이션들 또는 명령어들을 실행하거나 달리 관리하도록 구성될 수 있다. 컴퓨팅 디바이스(602)는 사용자 인터페이스에서 사용자로부터 (예를 들어, 터치 화면과 같은 I/O 디바이스(들)(904)를 활용하여) 입력들을 수신하고, 정보를 캡처하고, 정보를 프로세싱하고, 이어서 또한 I/O 디바이스(들)(904)(예를 들어, 컴퓨팅 디바이스(902)의 스피커)를 활용하여 미리보기들로서 비디오 클립들을 제시할 수 있다. 컴퓨팅 디바이스(902)는 캡처된 비디오로부터 추출된 얼굴 특징부들 및/또는 캡처된 오디오로부터 추출된 음성 특징부들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 파일들을 개정하도록 구성될 수 있다.Computing device 902 provides a user interface for recording, previewing, and/or transmitting virtual avatar video clips (e.g., user interface 600 of FIG. 6), such as, but not limited to, It may be configured to execute or otherwise manage applications or instructions for performing the described techniques. Computing device 602 receives inputs from a user (e.g., utilizing I/O device(s) 904 such as a touch screen) in a user interface, captures information, processes the information, and then I/O device(s) 904 (eg, a speaker of computing device 902) may also be utilized to present video clips as previews. Computing device 902 may be configured to revise audio and/or video files based at least in part on facial features extracted from the captured video and/or speech features extracted from the captured audio.

컴퓨팅 디바이스(902)는 이동 전화기(예를 들어, 스마트폰), 태블릿 컴퓨터, PDA(personal digital assistant), 랩톱 컴퓨터, 데스크톱 컴퓨터, 씬 클라이언트(thin-client) 디바이스, 스마트 워치, 무선 헤드셋 등과 같은, 그러나 이에 제한되지 않는 임의의 유형의 컴퓨팅 디바이스일 수 있다.Computing device 902 is a mobile phone (e.g., a smart phone), a tablet computer, a personal digital assistant (PDA), a laptop computer, a desktop computer, a thin-client device, a smart watch, a wireless headset, etc. However, it may be any type of computing device, but not limited thereto.

하나의 예시적인 구성에서, 컴퓨팅 디바이스(902)는 적어도 하나의 메모리(914) 및 하나 이상의 프로세싱 유닛들(또는 프로세서(들))(916)을 포함할 수 있다. 프로세서(들)(916)는 적절하게는 하드웨어, 컴퓨터 실행가능 명령어들, 또는 이들의 조합들로 구현될 수 있다. 프로세서(들)(916)의 컴퓨터 실행가능 명령어 또는 펌웨어 구현예들은 설명된 다양한 기능들을 수행하기 위해 임의의 적합한 프로그램 언어로 기록된 컴퓨터 실행가능 또는 머신 실행가능 명령어들을 포함할 수 있다.In one exemplary configuration, computing device 902 may include at least one memory 914 and one or more processing units (or processor(s)) 916. The processor(s) 916 may be implemented as appropriate in hardware, computer-executable instructions, or combinations thereof. Computer-executable instructions or firmware implementations of processor(s) 916 may include computer-executable or machine-executable instructions written in any suitable program language to perform the various functions described.

메모리(914)는 프로세서(들)(916) 상에 로딩가능하고 실행가능한 프로그램 명령어들, 및 이 프로그램들의 실행 동안 생성되는 데이터를 저장할 수 있다. 컴퓨팅 디바이스(902)의 구성 및 유형에 따라, 메모리(914)는 휘발성(예컨대, RAM(random access memory)) 및/또는 비휘발성(예컨대, ROM(read-only memory), 플래시 메모리 등)일 수 있다. 컴퓨팅 디바이스(902)는 또한 자기 저장소, 광 디스크, 및/또는 테이프 저장소를 포함하지만 이에 제한되지 않는 추가의 제거가능 저장소 및/또는 제거불가능 저장소(926)를 포함할 수 있다. 디스크 드라이브들 및 이들의 연관된 비일시적 컴퓨터 판독가능 매체는 컴퓨터 판독가능 명령어들, 데이터 구조들, 프로그램 모듈들 및 다른 데이터의 비휘발성 저장을 컴퓨팅 디바이스에 제공할 수 있다. 일부 구현예들에서, 메모리(914)는 다수의 상이한 유형의 메모리, 예컨대, SRAM(static random access memory), DRAM(dynamic random access memory), 또는 ROM을 포함할 수 있다. 본 명세서에서 설명되는 휘발성 메모리가 RAM으로 지칭될 수 있지만, 일단 호스트 및/또는 전원으로부터 플러그해제되면 저장되어 있는 데이터를 유지할 수 없는 임의의 휘발성 메모리가 적절할 것이다.The memory 914 may store program instructions that are loadable and executable on the processor(s) 916 and data generated during execution of the programs. Depending on the configuration and type of computing device 902, memory 914 may be volatile (e.g., random access memory (RAM)) and/or non-volatile (e.g., read-only memory (ROM), flash memory, etc.). have. Computing device 902 may also include additional removable storage and/or non-removable storage 926, including but not limited to magnetic storage, optical disk, and/or tape storage. Disk drives and their associated non-transitory computer readable media may provide non-volatile storage of computer readable instructions, data structures, program modules, and other data to a computing device. In some implementations, the memory 914 may include a number of different types of memory, such as static random access memory (SRAM), dynamic random access memory (DRAM), or ROM. Although the volatile memory described herein may be referred to as RAM, any volatile memory that is not capable of holding stored data once unplugged from the host and/or power source would be suitable.

제거가능한 그리고 제거불가능한 메모리(914) 및 추가의 저장소(926) 모두가 비일시적 컴퓨터 판독가능 저장 매체의 예들이다. 예를 들어, 비일시적 컴퓨터 판독가능 저장 매체는 컴퓨터 판독가능 명령어들, 데이터 구조들, 프로그램 모듈들, 또는 다른 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현되는 휘발성 또는 비휘발성, 제거가능 또는 제거불가능 매체를 포함할 수 있다. 메모리(914) 및 추가의 저장소(926) 둘 모두는 비일시적 컴퓨터 저장 매체의 예들이다. 컴퓨팅 디바이스(902) 내에 존재할 수 있는 추가 유형의 컴퓨터 저장 매체는 PRAM(phase-change RAM), SRAM, DRAM, RAM, ROM, EEPROM(electrically erasable programmable read-only memory), 플래시 메모리 또는 다른 메모리 기술, CD-ROM(compact disc read-only memory), DVD(digital video disc) 또는 다른 광학 저장소, 자기 카세트, 자기 테이프, 자기 디스크 저장소 또는 다른 자기 저장 디바이스들, 또는 원하는 정보를 저장하는 데 사용될 수 있고 컴퓨팅 디바이스(902)에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있지만 이에 제한되지 않는다. 임의의 상기한 것의 조합들이 또한 비일시적 컴퓨터 판독가능 저장 매체의 범주 내에 포함되어야 한다.Both removable and non-removable memory 914 and additional storage 926 are examples of non-transitory computer-readable storage media. For example, a non-transitory computer-readable storage medium may be a volatile or nonvolatile, removable, volatile or nonvolatile, technology implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules, or other data. It may include removable or non-removable media. Both memory 914 and additional storage 926 are examples of non-transitory computer storage media. Additional types of computer storage media that may exist within computing device 902 include phase-change RAM (PRAM), SRAM, DRAM, RAM, ROM, electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM (compact disc read-only memory), DVD (digital video disc) or other optical storage, magnetic cassette, magnetic tape, magnetic disk storage or other magnetic storage devices, or can be used to store desired information and computing It may include, but is not limited to, any other medium that can be accessed by device 902. Combinations of any of the above should also be included within the scope of non-transitory computer-readable storage media.

대안적으로, 컴퓨터 판독가능 통신 매체는 컴퓨터 판독가능 명령어들, 프로그램 모듈들, 또는 데이터 신호, 예컨대, 반송파, 또는 다른 송신물 내에서 송신되는 다른 데이터를 포함할 수 있다. 그러나, 본 명세서에서 사용되는 바와 같이, 컴퓨터 판독가능 저장 매체가 컴퓨터 판독가능 통신 매체를 포함하지는 않는다.Alternatively, a computer-readable communication medium may include computer-readable instructions, program modules, or other data transmitted within a data signal such as a carrier wave, or other transmission. However, as used herein, computer-readable storage media does not include computer-readable communication media.

컴퓨팅 디바이스(902)는 또한 하나 이상의 네트워크를 통해 컴퓨팅 디바이스(902)가 데이터 저장소, 다른 컴퓨팅 디바이스 또는 서버, 사용자 터미널들, 및/또는 다른 디바이스들과 통신할 수 있게 하는 통신 접속부(들)(928)를 포함할 수 있다. 그러한 네트워크들은 케이블 네트워크들, 인터넷, 무선 네트워크들, 셀룰러 네트워크들, 위성 네트워크들, 다른 사설 및/또는 공중 네트워크들, 또는 이들의 임의의 조합과 같은 많은 상이한 유형의 네트워크들 중 임의의 하나 또는 이들의 조합을 포함할 수 있다. 컴퓨팅 디바이스(902)는 또한 I/O 디바이스(들)(904), 예컨대, 터치 입력 디바이스, 키보드, 마우스, 펜, 음성 입력 디바이스, 디스플레이, 스피커, 프린터 등을 포함할 수 있다.Computing device 902 also includes communication connection(s) 928 that enable computing device 902 to communicate with a data store, other computing device or server, user terminals, and/or other devices via one or more networks. ) Can be included. Such networks may be any one or both of many different types of networks, such as cable networks, the Internet, wireless networks, cellular networks, satellite networks, other private and/or public networks, or any combination thereof. It may include a combination of. Computing device 902 may also include I/O device(s) 904, such as touch input devices, keyboards, mice, pens, voice input devices, displays, speakers, printers, and the like.

메모리(914)의 콘텐츠들을 더 상세히 참조하면, 메모리(914)는 운영 체제(932) 및/또는 사용자 인터페이스 모듈(934), 아바타 제어 모듈(936), 아바타 애플리케이션 모듈(938), 및 메시징 모듈(940)을 포함하는 본 명세서에 개시된 특징부들을 구현하기 위한 하나 이상의 애플리케이션 프로그램들 또는 서비스들을 포함할 수 있다. 메모리(914)는 또한 오디오 및 비디오 출력을 생성하는 데 사용될 하나 이상의 오디오 및 비디오 파일들을 저장하도록 구성될 수 있다. 이러한 방식으로, 컴퓨팅 디바이스(902)는 본 명세서에 설명된 동작들 모두를 수행할 수 있다.Referring to the contents of the memory 914 in more detail, the memory 914 includes an operating system 932 and/or a user interface module 934, an avatar control module 936, an avatar application module 938, and a messaging module ( 940) and one or more application programs or services for implementing the features disclosed herein. Memory 914 may also be configured to store one or more audio and video files to be used to generate audio and video output. In this manner, computing device 902 may perform all of the operations described herein.

일부 예들에서, 사용자 인터페이스 모듈(934)은 컴퓨팅 디바이스(902)의 사용자 인터페이스를 관리하도록 구성될 수 있다. 예를 들어, 사용자 인터페이스 모듈(934)은 컴퓨팅 디바이스(902)에 의해 요청된 임의의 수의 다양한 UI들을 제시할 수 있다. 특히, 사용자 인터페이스 모듈(934)은 전술한 바와 같이 비디오 및 오디오 정보를 캡처하는 것, 적절한 얼굴 특징 및 음성 특징 정보를 추출하는 것, 및 생성된 아바타 비디오 클립들의 제시 전에 비디오 및 오디오 정보를 개정하는 것을 담당하는 도 3의 아바타 프로세스(300)와의 통신을 포함하여 본 명세서에 설명된 특징부들의 구현을 가능하게 하는 도 6의 UI(600)를 제시하도록 구성될 수 있다.In some examples, the user interface module 934 can be configured to manage a user interface of the computing device 902. For example, the user interface module 934 can present any number of various UIs requested by the computing device 902. In particular, the user interface module 934 captures video and audio information as described above, extracts appropriate facial feature and voice feature information, and revises the video and audio information prior to presentation of the generated avatar video clips. It may be configured to present the UI 600 of FIG. 6 that enables implementation of the features described herein, including communication with the avatar process 300 of FIG. 3 responsible for doing so.

일부 예들에서, 아바타 제어 모듈(936)은 아바타 프로세스(300)를 구현하도록(예를 들어, 구현하기 위한 명령어들을 실행하도록) 구성된 한편, 아바타 애플리케이션 모듈(938)은 사용자 대면 애플리케이션을 구현하도록 구성된다. 위에 언급된 바와 같이, 아바타 애플리케이션 모듈(938)은 아바타 제어 모듈(936)에 정보를 요청 및/또는 제공하기 위한 하나 이상의 API들을 활용할 수 있다.In some examples, the avatar control module 936 is configured to implement the avatar process 300 (eg, execute instructions to implement), while the avatar application module 938 is configured to implement a user-facing application. . As mentioned above, the avatar application module 938 may utilize one or more APIs to request and/or provide information to the avatar control module 936.

일부 실시예들에서, 메시징 모듈(940)은 아바타 제어 모듈(936) 및/또는 아바타 애플리케이션 모듈(938)과 통신할 수 있는 임의의 독립형 또는 애드 온(add-on) 메시징 애플리케이션을 구현할 수 있다. 일부 예들에서, 메시징 모듈(940)은 아바타 애플리케이션 모듈(938)과 완전히 통합될 수 있고(예를 들어, 도 6의 UI(600)에서 보여지는 바와 같음), 아바타 애플리케이션은 메시징 애플리케이션의 일부인 것처럼 보인다. 그러나, 다른 예들에서, 메시징 애플리케이션(940)은, 사용자가 아바타 비디오 클립을 생성할 것을 요청하는 경우 아바타 애플리케이션 모듈(938)로 호출할 수 있고, 아바타 애플리케이션 모듈(938)은 메시징 모듈(940)과 통합되는 새로운 애플리케이션을 함께 열 수 있다.In some embodiments, the messaging module 940 may implement any standalone or add-on messaging application capable of communicating with the avatar control module 936 and/or avatar application module 938. In some examples, the messaging module 940 may be fully integrated with the avatar application module 938 (eg, as shown in the UI 600 of FIG. 6 ), and the avatar application appears to be part of the messaging application. . However, in other examples, the messaging application 940 may call the avatar application module 938 when the user requests to create an avatar video clip, and the avatar application module 938 may be connected to the messaging module 940 New applications that are integrated can be opened together.

컴퓨팅 디바이스(902)는 또한 적어도 도 3에 도시된 바와 같이 카메라 및 마이크로폰을 구비할 수 있고, 프로세서들(916)은 가상 아바타의 제1 미리보기를 디스플레이하기 위한 명령어들을 실행하도록 구성될 수 있다. 일부 예들에서, 가상 아바타의 제1 미리보기를 디스플레이하는 동안, 입력은 사용자 인터페이스 모듈(934)에 의해 제시된 가상 아바타 생성 인터페이스를 통해 검출될 수 있다. 일부 경우들에서, 가상 아바타 생성 인터페이스에서 입력을 검출하는 것에 응답하여, 아바타 제어 모듈(936)은, 카메라를 통해, 카메라의 시야 내의 얼굴과 연관된 비디오 신호를 캡처하는 것, 마이크로폰을 통해, 캡처된 비디오 신호와 연관된 오디오 신호를 캡처하는 것, 캡처된 오디오 신호로부터 오디오 특징부 특성들을 추출하는 것, 및 캡처된 비디오 신호로부터 얼굴과 연관된 얼굴 특징부 특성들을 추출하는 것을 포함하는 캡처 세션을 개시할 수 있다. 추가적으로, 캡처 세션의 만료를 검출하는 것에 응답하여, 아바타 제어 모듈(936)은 오디오 특징부 특성들 및 얼굴 특징부 특성들에 적어도 부분적으로 기초하여 조정된 오디오 신호를 생성하고, 얼굴 특징부 특성들 및 조정된 오디오 신호에 따라 가상 아바타 생성 인터페이스에서 가상 아바타의 제2 미리보기를 디스플레이할 수 있다.Computing device 902 can also include a camera and a microphone, as shown at least in FIG. 3, and processors 916 can be configured to execute instructions for displaying a first preview of the virtual avatar. In some examples, while displaying the first preview of the virtual avatar, input may be detected through the virtual avatar creation interface presented by the user interface module 934. In some cases, in response to detecting an input in the virtual avatar generation interface, the avatar control module 936 captures, via a camera, a video signal associated with a face within the camera's field of view, via a microphone, captured Initiating a capture session comprising capturing an audio signal associated with the video signal, extracting audio feature features from the captured audio signal, and extracting facial feature features associated with the face from the captured video signal. have. Additionally, in response to detecting the expiration of the capture session, the avatar control module 936 generates an adjusted audio signal based at least in part on the audio feature features and facial feature features, and And display a second preview of the virtual avatar on the virtual avatar generation interface according to the adjusted audio signal.

음성 및/또는 얼굴 특징부 특성들에 적어도 부분적으로 기초하여 오디오 및/또는 비디오 콘텐츠를 조정하기 위한 다양한 기술들을 제공하기 위한 예시적인 방법들, 컴퓨터 판독가능 매체, 및 시스템들이 전술되었다. 이들 시스템들 및 방법들의 일부 또는 모두는 상기의 적어도 도 1 내지 도 9에 도시된 것들과 같은 아키텍처들에 의해 적어도 부분적으로 구현될 수 있지만, 그럴 필요는 없다. 많은 실시예들이 메시징 애플리케이션들을 참조하여 전술되었지만, 위의 기술들 중 임의의 것이 실시간 비디오 재생 또는 실시간 비디오 메시징 애플리케이션들을 포함하는 임의의 유형의 애플리케이션 내에서 사용될 수 있다는 것을 이해해야 한다. 설명을 목적으로, 구체적인 구성들 및 상세사항들은 예들의 완전한 이해를 제공하기 위해 설명된다. 그러나, 일부 예들이 구체적인 상세사항들 없이 실시될 수 있다는 것은 당업자에게는 또한 명백할 것이다. 추가로, 공지된 특징부들은 설명되는 예를 모호하게 하지 않도록 때때로 생략되거나 단순화되었다.Exemplary methods, computer readable media, and systems for providing various techniques for adjusting audio and/or video content based at least in part on voice and/or facial feature characteristics have been described above. Some or all of these systems and methods may, but need not, be implemented at least in part by architectures such as those shown at least in FIGS. 1-9 above. While many embodiments have been described above with reference to messaging applications, it should be understood that any of the above techniques may be used within any type of application, including real-time video playback or real-time video messaging applications. For illustrative purposes, specific configurations and details are set forth to provide a thorough understanding of examples. However, it will also be apparent to those skilled in the art that some examples may be practiced without specific details. Additionally, well-known features have sometimes been omitted or simplified so as not to obscure the illustrated examples.

다양한 실시예들은 추가로, 일부 경우들에서 다수의 애플리케이션들 중 임의의 것을 동작시키는 데 사용될 수 있는 하나 이상의 사용자 컴퓨터들, 컴퓨팅 디바이스들 또는 프로세싱 디바이스들을 포함할 수 있는, 매우 다양한 동작 환경들에서 구현될 수 있다. 사용자 또는 클라이언트 디바이스들은 표준 운영 체제를 실행하는 데스크톱 또는 랩톱 컴퓨터들과 같은 다수의 범용 개인용 컴퓨터들은 물론, 모바일 소프트웨어를 실행하고 다수의 네트워킹 및 메시징 프로토콜들을 지원할 수 있는 셀룰러, 무선 및 핸드헬드 디바이스들 중 임의의 것을 포함할 수 있다. 그러한 시스템은 또한 다양한 상업적으로 이용가능한 운영 체제들 그리고 개발 및 데이터베이스 관리와 같은 목적들을 위한 다른 공지된 애플리케이션들 중 임의의 것을 실행하는 다수의 워크스테이션들을 포함할 수 있다. 이들 디바이스들은 또한, 더미 단말들, 씬 클라이언트들, 게이밍 시스템들 및 네트워크를 통해 통신할 수 있는 다른 디바이스들과 같은, 다른 전자 디바이스들을 포함할 수 있다.Various embodiments are further implemented in a wide variety of operating environments, which in some cases may include one or more user computers, computing devices or processing devices that may be used to operate any of a number of applications. Can be. User or client devices are among a number of general purpose personal computers, such as desktop or laptop computers running standard operating systems, as well as cellular, wireless and handheld devices capable of running mobile software and supporting a number of networking and messaging protocols. It can include anything. Such a system may also include a number of workstations running any of a variety of commercially available operating systems and other known applications for purposes such as development and database management. These devices may also include other electronic devices, such as dummy terminals, thin clients, gaming systems and other devices capable of communicating over the network.

대부분의 실시예들은, TCP/IP, OSI, FTP, UPnP, NFS, CIFS, 및 AppleTalk와 같은, 다양한 상업적으로 이용가능한 프로토콜들 중 임의의 것을 사용하여 통신을 지원하기 위한 본 기술분야의 통상의 기술자에게 익숙할 적어도 하나의 네트워크를 활용한다. 네트워크는, 예를 들어, 로컬 영역 네트워크, 광역 네트워크, 가상 사설 네트워크, 인터넷, 인트라넷, 엑스트라넷, 공중 교환 전화 네트워크, 적외선 네트워크, 무선 네트워크, 및 이들의 임의의 조합일 수 있다.Most embodiments are those of ordinary skill in the art to support communications using any of a variety of commercially available protocols, such as TCP/IP, OSI, FTP, UPnP, NFS, CIFS, and AppleTalk. Use at least one network that you are familiar with. The network may be, for example, a local area network, a wide area network, a virtual private network, the Internet, an intranet, an extranet, a public switched telephone network, an infrared network, a wireless network, and any combination thereof.

네트워크 서버를 활용하는 실시예들에서, 네트워크 서버는, HTTP 서버들, FTP 서버들, CGI 서버들, 데이터 서버들, Java 서버들, 및 비즈니스 애플리케이션 서버들을 포함한, 다양한 서버 또는 미드-티어(mid-tier) 애플리케이션들 중 임의의 것을 실행할 수 있다. 서버(들)는 또한, 예컨대 Java^®, C, C# 또는 C++, 또는, Perl, Python 또는 TCL과 같은, 임의의 스크립팅 언어는 물론, 이들의 조합들과 같은, 임의의 프로그래밍 언어로 작성된 하나 이상의 스크립트 또는 프로그램으로서 구현될 수 있는 하나 이상의 애플리케이션들을 실행하는 것에 의해서와 같이, 사용자 디바이스들로부터의 요청들에 응답하여 프로그램들 또는 스크립트들을 실행할 수 있다. 서버(들)는 또한, Oracle^®, Microsoft^®, Sybase^®, 및 IBM^®으로부터 상업적으로 이용가능한 것들을 제한 없이 포함한, 데이터베이스 서버들을 포함할 수 있다.In embodiments utilizing a network server, the network server is a variety of servers or mid-tiers, including HTTP servers, FTP servers, CGI servers, data servers, Java servers, and business application servers. tier) can run any of the applications. The server(s) may also be one or more scripts written in any programming language, such as Java ^® , C, C# or C++, or any scripting language, such as Perl, Python or TCL, as well as combinations thereof. Alternatively, programs or scripts may be executed in response to requests from user devices, such as by executing one or more applications that may be implemented as a program. Server (s), also, Oracle ^{^®,} Microsoft ^®, can include database servers, including without limitation those commercially available from using a Sybase ^®, and IBM ^®.

환경은 앞서 논의된 바와 같은 다양한 데이터 저장소들 및 다른 메모리 및 저장 매체들을 포함할 수 있다. 이들은 컴퓨터들 중 하나 이상에 로컬인(및/또는 그에 존재하는) 또는 네트워크를 거쳐 컴퓨터들 중 일부 또는 전부로부터 원격에 있는 저장 매체 상에서와 같이, 다양한 위치들에 존재할 수 있다. 특정의 실시예들의 세트에서, 정보는 본 기술분야의 통상의 기술자에게 익숙한 스토리지 영역 네트워크(storage-area network)(SAN)에 존재할 수 있다. 이와 유사하게, 컴퓨터들, 서버들 또는 다른 네트워크 디바이스들에 귀속된 기능들을 수행하기 위한 임의의 필요한 파일들이, 적절한 경우, 로컬로 그리고/또는 원격으로 저장될 수 있다. 시스템이 컴퓨터화된 디바이스들을 포함하는 경우, 각각의 그러한 디바이스는 버스를 통해 전기적으로 커플링될 수 있는 하드웨어 요소들을 포함할 수 있고, 이 요소들은, 예를 들어, 적어도 하나의 중앙 프로세싱 유닛(CPU), 적어도 하나의 입력 디바이스(예를 들어, 마우스, 키보드, 컨트롤러, 터치 스크린 또는 키패드), 및 적어도 하나의 출력 디바이스(예를 들어, 디스플레이 디바이스, 프린터 또는 스피커)를 포함한다. 그러한 시스템은 또한, 디스크 드라이브들, 광학 저장 디바이스들, 및 RAM 또는 ROM과 같은 솔리드 스테이트 저장 디바이스들은 물론, 이동식 매체 디바이스들, 메모리 카드들, 플래시 카드들 등과 같은, 하나 이상의 저장 디바이스들을 포함할 수 있다.The environment may include various data stores and other memory and storage media as discussed above. They may reside in a variety of locations, such as on a storage medium that is local to (and/or resides in) one or more of the computers or remote from some or all of the computers over a network. In a particular set of embodiments, the information may reside in a storage-area network (SAN) that is familiar to those of skill in the art. Similarly, any necessary files to perform functions belonging to computers, servers or other network devices may be stored locally and/or remotely, where appropriate. Where the system comprises computerized devices, each such device may comprise hardware elements that can be electrically coupled via a bus, which elements are, for example, at least one central processing unit (CPU). ), at least one input device (eg, mouse, keyboard, controller, touch screen or keypad), and at least one output device (eg, display device, printer or speaker). Such a system may also include one or more storage devices, such as disk drives, optical storage devices, and solid state storage devices such as RAM or ROM, as well as removable media devices, memory cards, flash cards, and the like. have.

그러한 디바이스들은 또한 컴퓨터 판독가능 저장 매체 판독기, 통신 디바이스(예를 들어, 모뎀, 네트워크 카드(무선 또는 유선), 적외선 통신 디바이스 등), 및 앞서 설명된 바와 같은 작업 메모리를 포함할 수 있다. 컴퓨터 판독가능 저장 매체 판독기는, 원격, 로컬, 고정식, 및/또는 이동식 저장 디바이스들을 나타내는 비일시적 컴퓨터 판독가능 저장 매체는 물론, 컴퓨터 판독가능 정보를 일시적으로 및/또는 보다 영구적으로 포함, 저장, 송신, 및 검색하기 위한 저장 매체와 연결될 수 있거나 이를 수납하도록 구성될 수 있다. 시스템 및 다양한 디바이스들은 또한 전형적으로, 운영 체제 및, 클라이언트 애플리케이션 또는 브라우저와 같은, 애플리케이션 프로그램들을 포함한, 적어도 하나의 작업 메모리 디바이스 내에 위치된 다수의 소프트웨어 애플리케이션들, 모듈들, 서비스들 또는 다른 요소들을 포함할 것이다. 대안의 실시예들이 앞서 설명된 것으로부터 수많은 변형들을 가질 수 있다는 것이 인식되어야 한다. 예를 들어, 커스터마이즈된 하드웨어가 또한 사용될 수 있고/있거나 특정의 요소들이 하드웨어, 소프트웨어(애플릿들과 같은, 포터블 소프트웨어(portable software)를 포함함) 또는 둘 다로 구현될 수 있다. 게다가, 네트워크 입력/출력 디바이스들과 같은 다른 컴퓨팅 디바이스들에 대한 연결이 이용될 수 있다.Such devices may also include a computer readable storage medium reader, a communication device (eg, a modem, a network card (wireless or wired), an infrared communication device, etc.), and a working memory as described above. A computer-readable storage medium reader may temporarily and/or more permanently contain, store, or transmit computer-readable information as well as non-transitory computer-readable storage media representing remote, local, fixed, and/or removable storage devices. , And a storage medium for retrieval or may be configured to receive the same. The system and various devices also typically include an operating system and a number of software applications, modules, services or other elements located within at least one working memory device, including application programs, such as a client application or browser. something to do. It should be appreciated that alternative embodiments may have numerous variations from those described above. For example, customized hardware may also be used and/or certain elements may be implemented in hardware, software (including portable software, such as applets), or both. In addition, connections to other computing devices, such as network input/output devices, may be used.

코드 또는 코드의 부분들을 포함하기 위한 비일시적 저장 매체 및 컴퓨터 판독가능 저장 매체는, RAM, ROM, EEPROM(Electrically Erasable Programmable Read-Only Memory), 플래시 메모리 또는 다른 메모리 기술, CD-ROM, DVD 또는 다른 광학 저장소, 자기 카세트들, 자기 테이프, 자기 디스크 저장소 또는 다른 자기 저장 디바이스들을 포함하는, 컴퓨터 판독가능 명령어들, 데이터 구조들, 프로그램 모듈들, 또는 다른 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현되는 휘발성 및 비휘발성, 제거가능 및 제거불가능 매체와 같은, 그러나 이에 제한되지 않는 당업계에서 공지되거나 사용되는 임의의 적절한 매체(캐리어 웨이브들 등과 같은 일시적인 매체 제외), 또는 원하는 정보를 저장하는 데 사용될 수 있고 시스템 디바이스에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있다. 본 명세서에 제공된 개시내용 및 교시내용에 기초하여, 본 기술분야의 통상의 기술자는 다양한 실시예들을 구현하기 위한 다른 방식들 및/또는 방법들을 인식할 것이다. 그렇지만, 위에 언급된 바와 같이, 컴퓨터 판독가능 저장 매체들은 반송파들 또는 이와 유사한 것과 같은 일시적 매체들을 포함하지 않는다.Non-transitory storage media and computer-readable storage media for containing code or portions of code include RAM, ROM, Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory or other memory technology, CD-ROM, DVD or other Any method for storage of information such as computer readable instructions, data structures, program modules, or other data, including optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or Any suitable medium known or used in the art, such as, but not limited to, volatile and nonvolatile, removable and non-removable media implemented with technology (except temporary media such as carrier waves, etc.), or storing desired information And may include any other media that can be used to and accessed by a system device. Based on the disclosure and teaching provided herein, one of ordinary skill in the art will recognize other ways and/or methods for implementing various embodiments. However, as mentioned above, computer-readable storage media do not include transitory media such as carriers or the like.

따라서, 명세서 및 도면은 제한적인 의미보다는 예시적인 의미로 간주되어야 한다. 그렇지만, 청구항들에 기재된 바와 같은 본 개시내용의 보다 광의의 사상 및 범주를 벗어나지 않으면서 그에 대한 다양한 수정들 및 변경들이 행해질 수 있다는 것이 명백할 것이다.Accordingly, the specification and drawings are to be regarded in an illustrative rather than a restrictive sense. However, it will be apparent that various modifications and changes may be made thereto without departing from the broader spirit and scope of the present disclosure as set forth in the claims.

다른 변형들이 본 개시내용의 사상 내에 있다. 따라서, 개시된 기술들에 대해 다양한 수정들 및 대안의 구성들이 가능하지만, 그의 소정 예시된 실시예들이 도면에 예시되어 있으며, 앞서 상세히 설명되었다. 그렇지만, 본 개시내용을 개시된 특정 형태 또는 형태들로 제한하려는 의도는 없고, 그와 달리, 의도는, 첨부된 청구항들에 의해 한정되는 바와 같은, 본 개시내용의 사상 및 범주 내에 속하는 모든 수정들, 대안의 구조들 및 등가물들을 커버하는 것임이 이해되어야 한다.Other variations are within the spirit of this disclosure. Thus, while various modifications and alternative configurations are possible to the disclosed techniques, certain illustrated embodiments thereof are illustrated in the drawings and described in detail above. However, there is no intention to limit the present disclosure to the specific form or form disclosed, and to the contrary, the intention is to make all modifications falling within the spirit and scope of the present disclosure, as defined by the appended claims, It should be understood that it covers alternative structures and equivalents.

개시된 실시예들을 설명하는 것과 관련하여(특히 이하의 청구항들과 관련하여) 용어들 "한(a)", "한(an)" 및 "그(the)" 그리고 유사한 지시어들(referents)의 사용은, 본 명세서에 달리 지시되지 않거나 문맥에 의해 명확하게 모순되지 않는 한, 단수 및 복수 둘 다를 커버하는 것으로 해석되어야 한다. 용어들 "포함하는(comprising)", "갖는(having)", "포함하는(including)", 및 "함유하는(containing)"은, 달리 언급되지 않는 한, 개방형 용어들(open-ended terms)(즉, "포함하지만 이에 한정되지 않는"을 의미함)로서 해석되어야 한다. 용어 "연결된"은, 개재하는 무언가가 있는 경우에도, 부분적으로 또는 전체적으로 그 내에 포함되거나, 그에 부착되거나, 또는 서로 조인되는 것으로 해석되어야 한다. 문구 "~에 기초하여"는 개방형이고 결코 제한하는 것이 아닌 것으로 이해되어야 하며, 적절한 경우 "~에 적어도 부분적으로 기초하여"로서 해석되거나 달리 읽히도록 의도되어 있다. 본 명세서에서의 값들의 범위들의 열거는, 본 명세서에 달리 지시되지 않는 한, 단지 그 범위 내에 속하는 각각의 별개의 값을 개별적으로 언급하는 약기 방법(shorthand method)으로서 역할하도록 의도되며, 각각의 별개의 값은 마치 본 명세서에 개별적으로 열거된 것처럼 본 명세서에 포함된다. 본 명세서에 설명된 모든 방법들은, 본 명세서에 달리 지시되지 않거나 문맥에 의해 달리 명확하게 모순되지 않는 한, 임의의 적합한 순서로 수행될 수 있다. 본 명세서에 제공된 임의의 및 모든 예들, 또는 예시 표현(language)(예를 들어, "~와 같은")의 사용은 단지 본 개시내용의 실시예들을 보다 명확히 하도록 의도되며, 달리 청구되지 않는 한 본 개시내용의 범주에 대한 제한을 부과하지 않는다. 명세서에서의 어떠한 표현도 임의의 비-청구된 요소를 본 개시내용의 실시에 필수적인 것으로 나타내는 것으로 해석되어서는 안 된다.Use of the terms “a”, “an” and “the” and similar referents in connection with describing the disclosed embodiments (especially in connection with the claims below) Is to be construed as covering both the singular and the plural unless otherwise indicated herein or clearly contradicted by context. The terms "comprising", "having", "including", and "containing" are open-ended terms, unless otherwise stated. (Ie, meaning “including but not limited to”). The term “connected” is to be interpreted as partially or wholly contained within, attached to, or joined to each other, even if there is something intervening. The phrase “based on” is to be understood as being open and in no way limiting, and is intended to be interpreted or otherwise read as “based at least in part on” where appropriate. The enumeration of ranges of values herein is intended to serve only as a shorthand method, individually referring to each distinct value falling within that range, unless otherwise indicated herein, and each distinct The values of are incorporated herein as if individually recited herein. All methods described herein may be performed in any suitable order, unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or illustrative language (eg, “such as”) provided herein is only intended to clarify embodiments of the present disclosure, and unless otherwise claimed. It does not impose restrictions on the scope of the disclosure. No expression in the specification is to be construed as representing any non-claimed element as essential to the practice of the present disclosure.

문구 "X, Y, 또는 Z 중 적어도 하나"와 같은 논리합 표현(disjunctive language)은, 구체적으로 달리 언급되지 않는 한, 항목, 항, 등이 X, Y, 또는 Z 중 어느 하나, 또는 이들의 임의의 조합(예를 들어, X, Y, 및/또는 Z)일 수 있다는 것을 제시하기 위해 일반적으로 사용되는 바와 같이 문맥 내에서 달리 이해된다. 이와 같이, 그러한 논리합 표현은 일반적으로 소정 실시예들이 X 중 적어도 하나, Y 중 적어도 하나, 또는 Z 중 적어도 하나가 각각 존재할 것을 요구한다는 것을 암시하는 것으로 의도되지 않고 암시해서는 안된다. 추가적으로, 문구 "X, Y, 및 Z 중 적어도 하나"와 같은 논리곱 표현(conjunctive language)은, "X, Y, 및/또는 Z를 포함한, X, Y, Z, 또는 이들의 임의의 조합을 의미하는 것으로 또한 이해되어야 한다.A disjunctive language such as the phrase "at least one of X, Y, or Z", unless specifically stated otherwise, means that an item, term, etc. is any one of X, Y, or Z, or any of these. It is understood differently within the context as commonly used to suggest that it can be a combination of (eg, X, Y, and/or Z). As such, such an OR expression is generally not intended and should not be implied that certain embodiments require at least one of X, at least one of Y, or at least one of Z, respectively. Additionally, a conjunctive language such as the phrase “at least one of X, Y, and Z” refers to “X, Y, Z, or any combination thereof, including “X, Y, and/or Z”. It should also be understood as meaning.

본 개시내용을 수행하기 위한 본 발명자들에 공지된 최상의 모드를 포함한, 본 개시내용의 바람직한 실시예들이 본 명세서에 설명된다. 그 바람직한 실시예들의 변형들이 전술한 설명을 읽을 때 본 기술분야의 통상의 기술자에게 명백해질 수 있다. 본 발명자들은 통상의 기술자가 적절한 경우 그러한 변형들을 이용할 것으로 예상하며, 본 발명자들은 본 개시내용이 본 명세서에 구체적으로 설명된 것과 다른 방식으로 실시되도록 의도한다. 그에 따라, 본 개시내용은, 적용가능한 법에 의해 허용되는 바와 같이, 본 명세서에 첨부된 청구항들에 열거된 주제의 모든 변형들 및 등가물들을 포함한다. 더욱이, 앞서 설명된 요소들의 모든 가능한 변형들에서의 그 요소들의 임의의 조합은, 본 명세서에 달리 지시되지 않거나 문맥에 의해 달리 명확하게 모순되지 않는 한, 본 개시내용에 의해 포괄된다.Preferred embodiments of the present disclosure are described herein, including the best mode known to the inventors for carrying out the present disclosure. Variations of the preferred embodiments may become apparent to those skilled in the art upon reading the foregoing description. The inventors expect those skilled in the art to use such variations where appropriate, and the inventors intend for the present disclosure to be practiced in a manner different from that specifically described herein. Accordingly, the present disclosure includes all variations and equivalents of the subject matter recited in the claims appended hereto, as permitted by applicable law. Moreover, any combination of the elements in all possible variations of the elements described above is encompassed by the present disclosure unless otherwise indicated herein or otherwise clearly contradicted by context.

본 명세서에 인용된, 간행물들, 특허 출원들, 및 특허들을 포함한, 모든 참고문헌들은 이로써 각각의 참고문헌이 개별적으로 그리고 구체적으로 참고로 포함되는 것으로 지시되고 본 명세서에 그 전체가 기재된 경우와 동일한 정도로 참고로 포함된다.All references cited herein, including publications, patent applications, and patents, are hereby indicated that each reference is individually and specifically incorporated by reference and is the same as if herein incorporated in its entirety. It is included as a reference.

Claims

As a method,
At least in an electronic device with a camera and a microphone:
Displaying a virtual avatar creation interface;
Displaying the first preview content of the virtual avatar in the virtual avatar generation interface-The first preview content of the virtual avatar is associated with real-time preview video frames of the user's headshot within the field of view of the camera and the appearance Corresponds to headshot changes -;
Detecting an input in the virtual avatar creation interface while displaying the first preview content of the virtual avatar;
In response to detecting the input in the virtual avatar creation interface:
Capturing, via the camera, a video signal associated with the user headshot during a recording session;
Capturing, via the microphone, a user audio signal during the recording session;
Extracting audio feature characteristics from the captured user audio signal; And
Extracting facial feature features associated with the face from the captured video signal; And
In response to detecting the expiration of the recording session:
Generating an adjusted audio signal from the captured audio signal based at least in part on the facial feature characteristics and the audio feature characteristics;
Generating a second preview content of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal; And
And presenting the second preview content in the virtual avatar generation interface.

The method of claim 1, further comprising storing facial feature metadata associated with the facial feature features extracted from the video signal, and storing audio metadata associated with the audio feature features extracted from the audio signal. The method further comprising the step of.

3. The method of claim 2, further comprising generating adjusted facial feature metadata from the facial feature metadata based at least in part on the facial feature features and the audio feature features.

The method of claim 3, wherein the second preview of the virtual avatar is further displayed according to the adjusted face metadata.

As an electronic device,
camera;
microphone; And
One or more processors in communication with the camera and the microphone, the one or more processors:
While displaying the first preview of the virtual avatar, detecting an input in the virtual avatar generation interface;
In response to detecting the input in the virtual avatar creation interface:
Capturing, via the camera, a video signal associated with a face within the camera's field of view;
Capturing, via the microphone, an audio signal associated with the captured video signal;
Extracting audio feature characteristics from the captured audio signal; And
Initiating a capture session comprising extracting facial feature features associated with the face from the captured video signal;
In response to detecting the expiration of the capture session:
Generate an adjusted audio signal based at least in part on the audio feature characteristics and the facial feature characteristics;
The electronic device, configured to display a second preview of the virtual avatar in the virtual avatar generation interface according to the facial feature characteristics and the adjusted audio signal.

6. The electronic device of claim 5, wherein the audio signal is further adjusted based at least in part on a type of the virtual avatar.

The electronic device of claim 6, wherein the type of the virtual avatar is received based at least in part on an avatar type selection affordance presented in the virtual avatar generation interface.

7. The electronic device of claim 6, wherein the type of the virtual avatar comprises an animal type, and the adjusted audio signal is generated based at least in part on a predetermined sound associated with the animal type.

6. The electronic device of claim 5, wherein the one or more processors are further configured to determine whether a portion of the audio signal corresponds to the face in the field of view.

10. The method of claim 9, wherein the one or more processors are further configured to store the portion of the audio signal for use in generating the adjusted audio signal, upon determining that the portion of the audio signal corresponds to the face. Consisting of, an electronic device.

10. The electronic device of claim 9, wherein the one or more processors are further configured to discard at least the portion of the audio signal upon determining that the portion of the audio signal does not correspond to the face.

6. The electronic device of claim 5, wherein the audio feature characteristics include speech features associated with the face within the field of view.

6. The electronic device of claim 5, wherein the one or more processors are further configured to store facial feature metadata associated with the facial feature features extracted from the video signal.

14. The electronic device of claim 13, wherein the one or more processors are further configured to generate adjusted facial metadata based at least in part on the facial feature characteristics and the audio feature characteristics.

The electronic device of claim 14, wherein the second preview of the virtual avatar is generated according to the adjusted face metadata and the adjusted audio signal.

A computer-readable storage medium storing computer-executable instructions, the computer-executable instructions, when executed by one or more processors, the one or more processors:
In response to detecting a request to create an avatar video clip of the virtual avatar:
Capturing, via a camera of the electronic device, a video signal associated with a face within the camera's field of view;
Capturing an audio signal via a microphone of the electronic device;
Extracting speech feature characteristics from the captured audio signal; And
Extracting facial feature features associated with the face from the captured video signal; And
In response to detecting a request to preview the avatar video clip:
Generating an adjusted audio signal based at least in part on the facial feature characteristics and the speech feature characteristics; And
The computer-readable storage medium configured to perform operations comprising displaying a preview of the video clip of the virtual avatar using the adjusted audio signal.

17. The computer readable storage medium of claim 16, wherein the audio signal is adjusted based at least in part on facial expressions identified in the facial feature characteristics associated with the face.

17. The computer readable storage medium of claim 16, wherein the adjusted audio signal is further adjusted by inserting one or more pre-stored audio samples.

The method of claim 16, wherein the audio signal is a level, pitch, duration, variable playback rate, speech spectral-format positions, speech spectral-format levels, instantaneous playback rate, or the face and A computer-readable storage medium that is adjusted based at least in part on changes in associated speech.

17. The computer readable storage medium of claim 16, wherein the one or more processors are further configured to perform the operations comprising transmitting the video clip of the virtual avatar to another electronic device.