KR20220034396A

KR20220034396A - Device, method and computer program for generating face video

Info

Publication number: KR20220034396A
Application number: KR1020200116702A
Authority: KR
Inventors: 김민철
Original assignee: 주식회사 케이티
Priority date: 2020-09-11
Filing date: 2020-09-11
Publication date: 2022-03-18

Abstract

A device that generates a face image comprises: an input part that receives the speech data and an original face image for a speaker; a speech characteristic information extraction part that analyzes the speech data and extracts the speech characteristic information; a face image extraction part that extracts a face image through a face image storage part based on the extracted speech characteristic information; a learning part that learns a face generation model based on the speech data and the extracted face image; and a generation part that generates the face image for the unidentified speech data through the learned face generation model. Therefore, the present invention is capable of generating a real face image with high accuracy.

Description

DEVICE, METHOD AND COMPUTER PROGRAM FOR GENERATING FACE VIDEO

음성 데이터에 기초하여 얼굴 영상을 생성하는 장치, 방법 및 컴퓨터 프로그램에 관한 것이다.It relates to an apparatus, method, and computer program for generating a face image based on voice data.

일반적으로 음성 데이터에 기초하여 얼굴 영상을 생성하는 방법은, 음성 데이터를 주파수 영상으로 변환하여 변환된 주파수 영상 정보만으로 얼굴 영상을 생성하도록 구성되며, 실제 발화자의 얼굴이 아닌 애니메이션으로 구성된 임의의 얼굴을 생성한다.In general, a method of generating a face image based on voice data is configured to generate a face image only with frequency image information converted by converting voice data into a frequency image, and an arbitrary face composed of animations rather than the face of the actual speaker create

따라서, 종래 기술로는 음성 데이터만으로 발화자의 정확한 실제 얼굴 영상을 생성하거나, 발화자에 대한 정보를 추론하는데 어려움이 존재한다. 이와 관련하여, 선행기술인 한국공개특허공보 제10-2019-0046371호는 얼굴 표정 생성 장치 및 방법을 개시하고 있다.Accordingly, in the prior art, there is a difficulty in generating an accurate actual face image of a speaker or inferring information about the speaker only with voice data. In this regard, Korean Patent Application Laid-Open No. 10-2019-0046371, which is a prior art, discloses an apparatus and method for generating facial expressions.

종래의 얼굴 표정 생성 장치는, 입력 받은 발화자의 음성에 포함된 모음을 추정하고, 추정된 모음으로 기 정의된 복수개의 표준 표정에 가중치를 반영하고 조합하여 가상의 캐릭터 얼굴 표정을 생성할 수 있다. The conventional facial expression generating apparatus may generate a virtual character facial expression by estimating a vowel included in the input speaker's voice, and reflecting and combining weights with a plurality of standard expressions predefined as the estimated vowel.

그러나, 종래의 얼굴 표정 생성 장치는, 전술한 바와 같이, 발화자의 음성에 포함된 모음 정보만으로 기정의된 표준 표정들을 조합함으로써 발화자의 실제 얼굴 영상이 아닌 가상의 캐릭터 얼굴 표정을 생성할 뿐이므로, 발화자의 실제 얼굴 영상 및 발화자에 대한 정보를 생성하는데 어려움이 존재한다.However, as described above, the conventional facial expression generating apparatus only generates a virtual character facial expression, not an actual facial image of the speaker, by combining predefined standard facial expressions only with vowel information included in the speaker's voice. There is a difficulty in generating the actual face image of the speaker and information about the speaker.

한국공개특허공보 제10-2019-0046371호 (2019. 5. 7. 공개)Korean Patent Publication No. 10-2019-0046371 (published on May 7. 2019) 한국등록특허공보 제10-2096598호 (2020. 3. 27. 등록)Korean Patent Publication No. 10-2096598 (Registered on March 27, 2020)

본 발명은 전술한 종래 기술의 문제점을 해결하기 위한 것으로서, 음성 데이터를 분석한 결과에 기초하여 발화자의 실제 얼굴 영상을 생성할 수 있는 얼굴 영상 생성 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.The present invention is to solve the problems of the prior art, and to provide a face image generating apparatus, method, and computer program capable of generating an actual face image of a speaker based on a result of analyzing voice data.

또한, 본 발명은 음성 데이터에 포함된 발화 내용도 함께 분석하여 발화자의 실제 얼굴 영상을 생성할 뿐만 아니라, 발화자에 대한 심리 상태까지 도출하여 발화자 정보를 생성하고 이를 바탕으로 비대면 발화자(용의자)를 추적(검거)할 수 있는 얼굴 영상 생성 장치, 방법 및 컴퓨터 프로그램을 제공하고자 한다.In addition, the present invention not only generates the actual face image of the speaker by analyzing the contents of speech included in the voice data, but also derives the psychological state of the speaker to generate speaker information, and based on this, the non-face-to-face speaker (suspect) is identified. An object of the present invention is to provide an apparatus, method, and computer program for generating a face image that can be tracked (arrested).

다만, 본 실시예가 이루고자 하는 기술적 과제는 상기된 바와 같은 기술적 과제들로 한정되지 않으며, 또 다른 기술적 과제들이 존재할 수 있다. However, the technical problems to be achieved by the present embodiment are not limited to the technical problems described above, and other technical problems may exist.

상술한 기술적 과제를 달성하기 위한 수단으로서, 본 발명의 일 실시예는, 음성 데이터에 기초하여 얼굴 영상을 생성하는 장치에 있어서, 발화자에 대한 음성 데이터 및 원본 얼굴 영상을 입력받는 입력부; 상기 음성 데이터를 분석하여 음성 특징 정보를 추출하는 음성 특징 정보 추출부; 상기 추출된 음성 특징 정보에 기초하여 얼굴 영상 저장부를 통해 얼굴 영상을 추출하는 얼굴 영상 추출부; 상기 음성 데이터 및 상기 추출된 얼굴 영상에 기초하여 얼굴 생성 모델을 학습시키는 학습부; 및 상기 학습된 얼굴 생성 모델을 통해 비식별된 음성 데이터에 대한 얼굴 영상을 생성하는 생성부를 포함하는, 얼굴 영상 생성 장치를 제공 할 수 있다. As a means for achieving the above-described technical problem, an embodiment of the present invention provides an apparatus for generating a face image based on voice data, comprising: an input unit for receiving voice data and an original face image of a speaker; a voice characteristic information extraction unit that analyzes the voice data and extracts voice characteristic information; a face image extracting unit for extracting a face image through a face image storage unit based on the extracted voice feature information; a learning unit for learning a face generation model based on the voice data and the extracted face image; and a generator for generating a face image with respect to the de-identified voice data through the learned face generating model.

본 발명의 다른 실시예는, 음성 데이터에 기초하여 얼굴 영상을 생성하는 방법에 있어서, 발화자에 대한 음성 데이터 및 원본 얼굴 영상을 입력받는 단계; 상기 음성 데이터를 분석하여 음성 특징 정보를 추출하는 단계; 상기 추출된 음성 특징 정보에 기초하여 얼굴 영상 저장부를 통해 얼굴 영상을 추출하는 단계; 상기 음성 데이터 및 상기 추출된 얼굴 영상에 기초하여 얼굴 생성 모델을 학습시키는 단계; 및 상기 학습된 얼굴 생성 모델을 통해 비식별된 음성 데이터에 대한 얼굴 영상을 생성하는 단계를 포함하는, 얼굴 영상 생성 방법을 제공할 수 있다. Another embodiment of the present invention provides a method for generating a face image based on voice data, the method comprising: receiving voice data and an original face image of a speaker; extracting voice characteristic information by analyzing the voice data; extracting a face image through a face image storage unit based on the extracted voice feature information; training a face generation model based on the voice data and the extracted face image; and generating a face image for de-identified voice data through the learned face generating model.

본 발명의 또 다른 실시예는, 음성 데이터에 기초하여 얼굴 영상을 생성하는 명령어들의 시퀀스를 포함하는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램에 있어서, 상기 컴퓨터 프로그램은 컴퓨팅 장치에 의해 실행될 경우, 발화자에 대한 음성 데이터 및 원본 얼굴 영상을 입력받고, 상기 음성 데이터를 분석하여 음성 특징 정보를 추출하고, 상기 추출된 음성 특징 정보에 기초하여 얼굴 영상 저장부를 통해 얼굴 영상을 추출하고, 상기 음성 데이터 및 상기 추출된 얼굴 영상에 기초하여 얼굴 생성 모델을 학습시키고, 상기 학습된 얼굴 생성 모델을 통해 비식별된 음성 데이터에 대한 얼굴 영상을 생성하도록 하는 명령어들의 시퀀스를 포함하는, 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램을 제공할 수 있다. Another embodiment of the present invention provides a computer program stored in a computer-readable recording medium including a sequence of instructions for generating a face image based on voice data, wherein the computer program is executed by a computing device to be Receive voice data and original face image for a person, extract voice feature information by analyzing the voice data, extract a face image through a face image storage unit based on the extracted voice feature information, and extract the voice data and the extract A computer program stored in a computer-readable recording medium comprising a sequence of instructions for training a face generation model based on the acquired face image and generating a face image for unidentified voice data through the learned face generation model. can provide

상술한 과제 해결 수단은 단지 예시적인 것으로서, 본 발명을 제한하려는 의도로 해석되지 않아야 한다. 상술한 예시적인 실시예 외에도, 도면 및 발명의 상세한 설명에 기재된 추가적인 실시예가 존재할 수 있다.The above-described problem solving means are merely exemplary, and should not be construed as limiting the present invention. In addition to the exemplary embodiments described above, there may be additional embodiments described in the drawings and detailed description.

전술한 본 발명의 과제 해결 수단 중 어느 하나에 의하면, 수집된 음성 데이터를 분석하고 발화자에 대한 신체 정보 및 감정 정보를 생성하여, 발화자의 표정을 반영한 발화자의 실제 얼굴 영상을 정확도 높게 생성할 수 있는 얼굴 영상 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.According to any one of the above-described problem solving means of the present invention, it is possible to analyze the collected voice data and generate body information and emotional information about the talker, so that the actual face image of the speaker reflecting the speaker's expression can be generated with high accuracy. An apparatus, method, and computer program for generating a face image may be provided.

또한, 얼굴 영상 모델을 통해 생성된 얼굴 영상을 발화자 원본 얼굴 영상과 비교 분석하고, 비교 분석한 결과를 발화자 음성 특징 정보와 다시 비교 분석하여 두 가지 분석 결과가 일치하는 경우 얼굴 생성 모델의 학습에 반영시킴으로써, 얼굴 생성 모델을 통해 보다 정확한 발화자의 얼굴 영상을 생성할 수 있는 얼굴 영상 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.In addition, the face image generated through the face image model is compared and analyzed with the original face image of the speaker, and the result of comparative analysis is compared and analyzed again with the speaker's voice feature information. By doing so, it is possible to provide a face image generating apparatus, method, and computer program capable of generating a more accurate face image of the speaker through the face generating model.

또한, 수집된 음성 데이터에 포함된 발화 정보를 분석하여 발화자에 대한 심리 상태를 파악하고 발화자 정보를 생성함으로써, 이를 통해 비대면 발화자(용의자)를 추적(검거)할 수 있는 정보를 제공하는 얼굴 영상 생성 장치, 방법 및 컴퓨터 프로그램을 제공할 수 있다.In addition, by analyzing the speech information included in the collected voice data, the psychological state of the speaker is identified and the speaker information is generated, thereby providing information to track (arrest) the non-face-to-face speaker (suspect). Face image Generating devices, methods, and computer programs may be provided.

도 1은 본 발명의 일 실시예에 따른 얼굴 영상 생성 장치의 구성도이다.
도 2는 본 발명의 일 실시예에 따른 음성 특징 정보 추출부를 설명하는 예시적인 도면이다.
도 3은 본 발명의 일 실시예에 따른 얼굴 영상 추출부를 설명하기 위한 예시적인 도면이다.
도 4는 본 발명의 일 실시예에 따른 학습부를 설명하기 위한 예시적인 도면이다.
도 5는 본 발명의 일 실시예에 따른 생성부를 설명하기 위한 예시적인 도면이다.
도 6은 본 발명의 일 실시예에 따른 얼굴 영상 생성 방법의 순서도이다.1 is a block diagram of an apparatus for generating a face image according to an embodiment of the present invention.
2 is an exemplary view for explaining a voice feature information extracting unit according to an embodiment of the present invention.
3 is an exemplary view for explaining a face image extractor according to an embodiment of the present invention.
4 is an exemplary view for explaining a learning unit according to an embodiment of the present invention.
5 is an exemplary view for explaining a generator according to an embodiment of the present invention.
6 is a flowchart of a method for generating a face image according to an embodiment of the present invention.

아래에서는 첨부한 도면을 참조하여 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 본 발명의 실시예를 상세히 설명한다. 그러나 본 발명은 여러 가지 상이한 형태로 구현될 수 있으며 여기에서 설명하는 실시예에 한정되지 않는다. 그리고 도면에서 본 발명을 명확하게 설명하기 위해서 설명과 관계없는 부분은 생략하였으며, 명세서 전체를 통하여 유사한 부분에 대해서는 유사한 도면 부호를 붙였다. DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art can easily implement them. However, the present invention may be embodied in several different forms and is not limited to the embodiments described herein. And in order to clearly explain the present invention in the drawings, parts irrelevant to the description are omitted, and similar reference numerals are attached to similar parts throughout the specification.

명세서 전체에서, 어떤 부분이 다른 부분과 "연결"되어 있다고 할 때, 이는 "직접적으로 연결"되어 있는 경우뿐 아니라, 그 중간에 다른 소자를 사이에 두고 "전기적으로 연결"되어 있는 경우도 포함한다. 또한 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미하며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다. Throughout the specification, when a part is "connected" with another part, this includes not only the case of being "directly connected" but also the case of being "electrically connected" with another element interposed therebetween. . In addition, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated, and one or more other features However, it is to be understood that the existence or addition of numbers, steps, operations, components, parts, or combinations thereof is not precluded in advance.

본 명세서에 있어서 '부(部)'란, 하드웨어에 의해 실현되는 유닛(unit), 소프트웨어에 의해 실현되는 유닛, 양방을 이용하여 실현되는 유닛을 포함한다. 또한, 1 개의 유닛이 2 개 이상의 하드웨어를 이용하여 실현되어도 되고, 2 개 이상의 유닛이 1 개의 하드웨어에 의해 실현되어도 된다.In this specification, a "part" includes a unit realized by hardware, a unit realized by software, and a unit realized using both. In addition, one unit may be implemented using two or more hardware, and two or more units may be implemented by one hardware.

본 명세서에 있어서 단말 또는 디바이스가 수행하는 것으로 기술된 동작이나 기능 중 일부는 해당 단말 또는 디바이스와 연결된 서버에서 대신 수행될 수도 있다. 이와 마찬가지로, 서버가 수행하는 것으로 기술된 동작이나 기능 중 일부도 해당 서버와 연결된 단말 또는 디바이스에서 수행될 수도 있다.Some of the operations or functions described as being performed by the terminal or device in this specification may be instead performed by a server connected to the terminal or device. Similarly, some of the operations or functions described as being performed by the server may also be performed in a terminal or device connected to the server.

이하 첨부된 도면을 참고하여 본 발명의 일 실시예를 상세히 설명하기로 한다. Hereinafter, an embodiment of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 얼굴 영상 생성 장치의 구성도이다. 도 1을 참조하면, 얼굴 영상 생성 장치(100)는 입력부(110), 음성 특징 정보 추출부(120), 얼굴 영상 추출부(130), 학습부(140) 및 생성부(150)를 포함할 수 있다. 얼굴 영상 생성 장치(100)에 의하여 제어될 수 있는 구성요소들을 예시적으로 도시한 것이다. 1 is a block diagram of an apparatus for generating a face image according to an embodiment of the present invention. Referring to FIG. 1 , the face image generating apparatus 100 may include an input unit 110 , a voice feature information extracting unit 120 , a face image extracting unit 130 , a learning unit 140 , and a generating unit 150 . can Components that can be controlled by the face image generating apparatus 100 are illustrated by way of example.

도 1의 얼굴 영상 생성 장치(100)의 각 구성요소들은 일반적으로 네트워크(network)를 통해 연결된다. 네트워크는 단말들 및 서버들과 같은 각각의 노드 상호 간에 정보 교환이 가능한 연결 구조를 의미하는 것으로, 근거리 통신망(LAN: Local Area Network), 광역 통신망(WAN: Wide Area Network), 인터넷 (WWW: World Wide Web), 유무선 데이터 통신망, 전화망, 유무선 텔레비전 통신망 등을 포함한다. 무선 데이터 통신망의 일례에는 3G, 4G, 5G, 3GPP(3rd Generation Partnership Project), LTE(Long Term Evolution), WIMAX(World Interoperability for Microwave Access), 와이파이(Wi-Fi), 블루투스 통신, 적외선 통신, 초음파 통신, 가시광 통신(VLC: Visible Light Communication), 라이파이(LiFi) 등이 포함되나 이에 한정되지는 않는다. Each component of the face image generating apparatus 100 of FIG. 1 is generally connected through a network. A network refers to a connection structure in which information can be exchanged between each node, such as terminals and servers, and includes a local area network (LAN), a wide area network (WAN), and the Internet (WWW: World). Wide Web), wired and wireless data communication networks, telephone networks, wired and wireless television networks, and the like. Examples of wireless data communication networks include 3G, 4G, 5G, 3rd Generation Partnership Project (3GPP), Long Term Evolution (LTE), World Interoperability for Microwave Access (WIMAX), Wi-Fi, Bluetooth communication, infrared communication, ultrasound Communication, Visible Light Communication (VLC), LiFi, and the like are included, but are not limited thereto.

본 발명의 일 실시예에 따른 입력부(110)는 발화자에 대한 음성 데이터 및 원본 얼굴 영상을 입력 받을 수 있다. 예를 들어, 입력부(110)는 입력 받은 발화자에 대한 음성 데이터를 음성 특징 정보 추출부(120)로 전달할 수 있다.The input unit 110 according to an embodiment of the present invention may receive voice data and an original face image of the speaker. For example, the input unit 110 may transmit voice data of the received speaker to the voice characteristic information extracting unit 120 .

도 2는 본 발명의 일 실시예에 따른 음성 특징 정보 추출부를 설명하는 예시적인 도면이다. 도 2를 참조하면, 음성 특징 정보 추출부(120)는 음성 데이터(20)를 분석하여 음성 특징 정보(230)를 추출할 수 있다.2 is an exemplary view for explaining a voice feature information extracting unit according to an embodiment of the present invention. Referring to FIG. 2 , the voice characteristic information extractor 120 may analyze the voice data 20 to extract the voice characteristic information 230 .

음성 특징 정보 추출부(120)는 음성 데이터(20)를 주파수 영상(210)으로 변환하고, 변환된 주파수 영상(210)을 분석하여 음성 특징 정보(230)를 추출할 수 있다. 예를 들어, 음성 특징 정보 추출부(120)는 음성 데이터(20)를 Mel(Mel-scaled power spectrogram), MFCC(Mel-frequency Cepstral Coefficient), Chroma(power spectrogram Chroma), Contrast(spectral Contrast), Tonnetz(Tonal centroid feature) 등으로 변환하여 음성 분석 모델(220)의 입력으로 사용할 수 있다. 여기서, Mel, MFCC, Chroma, Contrast, Tonnetz는 일종의 예시이며, 음성 분석 모델(220)에 따라 다른 형태의 특징을 사용할 수 있다.The voice characteristic information extractor 120 may convert the voice data 20 into the frequency image 210 , and analyze the converted frequency image 210 to extract the voice characteristic information 230 . For example, the voice feature information extraction unit 120 extracts the voice data 20 from Mel (Mel-scaled power spectrogram), MFCC (Mel-frequency Cepstral Coefficient), Chroma (power spectrogram Chroma), Contrast (spectral contrast), It can be converted into Tonnetz (tonal centroid feature) and the like and used as an input of the voice analysis model 220 . Here, Mel, MFCC, Chroma, Contrast, and Tonnetz are examples, and different types of features may be used according to the voice analysis model 220 .

음성 특징 정보 추출부(120)는, 도 2를 참조하면, 음성 데이터(20)에서 추출한 음성의 주파수 대역, 음성의 세기 및 발화 시간 등의 정보를 다양한 종류의 음성 분석 모델(220)을 사용하여 발화자의 나이, 성별, 인종, 발화 내용 및 감정 등을 분석하여 음성 특징 정보(230)를 출력할 수 있다. 여기서, 도 2에 도시된 음성 분석 모델(220)은 일종의 예시이며, 경우에 따라 다른 정보를 분석하는 음성 분석 모델(220)을 추가 및 삭제할 수 있다. 본 발명의 일 실시예에 따른 음성 특징 정보 추출부(120)는 음성 데이터(20)에 포함된 발화 내용 및 음성의 세기, 높이, 주파수 영역 등을 함께 분석하고 추출된 음성 특징 정보(230)에 기초함으로써 발화자의 표정을 반영한 발화자의 얼굴 영상을 보다 정확하게 생성할 수 있다.Referring to FIG. 2 , the voice feature information extraction unit 120 uses various types of voice analysis models 220 to extract information such as frequency bands, voice strengths, and utterance times of voices extracted from voice data 20 . The voice characteristic information 230 may be output by analyzing the speaker's age, gender, race, utterance content, emotion, and the like. Here, the voice analysis model 220 shown in FIG. 2 is an example, and in some cases, the voice analysis model 220 that analyzes other information may be added or deleted. The voice characteristic information extraction unit 120 according to an embodiment of the present invention analyzes the content of speech and the strength, height, frequency domain, etc. of the voice included in the voice data 20 , and adds the extracted voice characteristic information 230 to the extracted voice characteristic information 230 . Based on this, it is possible to more accurately generate the speaker's face image reflecting the speaker's expression.

본 발명의 일 실시예에 따른 음성 특징 정보(230)는, 발화자의 성별, 나이 및 인종 중 적어도 하나에 대한 정보를 포함하는 발화자의 신체 정보(231)를 포함할 수 있다. 예를 들어, 음성 특징 정보 추출부(120)는 변환된 주파수 영상(210)에 기초하여 나이 분석 모델, 성별 분석 모델 및 인종 분석 모델 등의 음성 분석 모델(220)을 사용하여 음성의 높이 및 세기 등을 분석함으로써, 발화자의 성별(예: 남성), 나이(예: 20대) 및 인종(예: 동양인)에 대한 정보를 추출할 수 있다. The voice characteristic information 230 according to an embodiment of the present invention may include body information 231 of the speaker including information on at least one of gender, age, and race of the speaker. For example, the voice feature information extraction unit 120 uses the voice analysis model 220 such as an age analysis model, a gender analysis model, and a race analysis model based on the converted frequency image 210 to determine the height and intensity of the voice. By analyzing the speaker, information on the speaker's gender (eg, male), age (eg, 20's), and race (eg, Asian) can be extracted.

또한, 음성 특징 정보(230)는 발화자에 대한 감정 정보(232)를 포함할 수 있다. 예를 들어, 음성 특징 정보 추출부(120)는 변환된 주파수 영상(210)에 기초하여 감정 분석 모델 및 발화 내용 분석 모델 등의 음성 분석 모델(220)을 사용하여 음성의 높이 및 세기와 음성 데이터(20)에 포함되어 있는 발화 내용을 분석하여 발화자의 감정(예: 행복)을 분석할 수 있다. 구체적으로, 음성 특징 정보 추출부(120)는, 일 예로, 음성 데이터(20)에 "오늘 정말 행복했어"라는 발화 내용이 포함되어 있는 경우, 발화 내용뿐만 아니라, 음성의 높이, 세기 및 발화 시간 등을 분석하여 발화 내용에 반의적인 표현이 포함되어 있는지를 판단할 수 있고, 이를 바탕으로 발화 당시 발화자의 감정까지 분석할 수 있다. Also, the voice characteristic information 230 may include emotional information 232 about the speaker. For example, the voice feature information extraction unit 120 uses the voice analysis model 220 such as an emotion analysis model and a speech content analysis model based on the converted frequency image 210 to obtain the height and intensity of the voice and voice data. By analyzing the content of the utterance included in (20), the speaker's emotions (eg, happiness) can be analyzed. Specifically, the voice feature information extraction unit 120 may, for example, when the voice data 20 includes utterance contents such as “I was really happy today”, not only the utterance contents, but also the height, intensity, and utterance time of the voice. It is possible to determine whether the content of the utterance contains an objectionable expression by analyzing the utterance, and based on this, it is also possible to analyze the speaker's emotions at the time of the utterance.

도 3은 본 발명의 일 실시예에 따른 얼굴 영상 추출부를 설명하기 위한 예시적인 도면이다. 도 3을 참조하면, 얼굴 영상 추출부(130)는 추출된 음성 특징 정보(310)에 기초하여 얼굴 영상 저장부(320)를 통해 얼굴 영상(330)을 추출할 수 있다. 3 is an exemplary view for explaining a face image extractor according to an embodiment of the present invention. Referring to FIG. 3 , the face image extractor 130 may extract a face image 330 through the face image storage 320 based on the extracted voice feature information 310 .

얼굴 영상 추출부(130)는 얼굴 영상 저장부(320)에서 음성 특징 정보(310)와 매칭되는 적어도 하나 이상의 얼굴 특징 정보(321)를 합성하여 얼굴 영상(330)을 생성할 수 있다. The face image extractor 130 may generate the face image 330 by synthesizing at least one piece of facial feature information 321 matching the voice feature information 310 in the face image storage 320 .

본 발명의 일 실시예에 따른 얼굴 영상 저장부(320)는, 인종에 맞는 피부색, 얼굴형 및 눈동자 색과 나이에 맞는 주름과 성별에 맞는 얼굴형 및 헤어스타일과 감정 및 발화 내용에 맞는 입 모양 및 표정 등에 해당하는 얼굴 영상 정보를 포함할 수 있다.The face image storage unit 320 according to an embodiment of the present invention includes skin color, face shape and eye color suitable for race, wrinkles suitable for age, face type suitable for gender, and mouth shape suitable for emotions and speech contents. and face image information corresponding to facial expressions and the like.

예를 들어, 얼굴 영상 추출부(130)는 추출된 발화자의 신체 정보(311) 및 감정 정보(312)를 사용하여 얼굴 영상 저장부(320)에서 매칭되는 복수의 얼굴 특징 정보(321)를 나열하고 파악하여 검색한 뒤 합성하여 얼굴 영상(330)을 생성할 수 있다. 구체적으로, 얼굴 영상 추출부(130)는, 도 3을 참조하면, 일 예로, 신체 정보(311)에서 나이 정보(예: 20대)에 기초하여 얼굴 영상 저장부(320)의 20대 얼굴 정보에서 얼굴의 피부 및 주름 등의 얼굴 특징 정보(321)를 검색할 수 있고, 신체 정보(311)에서 성별 정보(예: 남성)에 기초하여 얼굴 영상 저장부(320)의 남성 얼굴 정보에서 남성의 코, 입, 얼굴 크기 및 눈매 등의 얼굴 특징 정보(321)를 선택할 수 있다. For example, the face image extraction unit 130 lists a plurality of pieces of facial feature information 321 that are matched in the face image storage unit 320 using the extracted speaker body information 311 and emotion information 312 . Then, it is possible to generate the face image 330 by synthesizing it after finding and searching. Specifically, referring to FIG. 3 , the face image extracting unit 130 may include, for example, information on faces in their twenties in the face image storage unit 320 based on age information (eg, in their twenties) in the body information 311 . can be searched for facial feature information 321 such as facial skin and wrinkles, Facial feature information 321 such as nose, mouth, face size, and eyes may be selected.

다른 일예로, 얼굴 영상 추출부(130)는 추출된 발화자의 신체 정보(311)가 70대 동양인 여성인 경우, 얼굴 영상 저장부(320)에서 70대 동양인 여성의 주름 정보, 피부색, 눈동자 색, 머리카락 색 및 얼굴형 등에 매칭되는 얼굴 특징 정보(321)를 선택할 수 있다.As another example, the face image extraction unit 130 extracts wrinkle information, skin color, eye color, Facial feature information 321 matching the hair color and face shape may be selected.

또한, 얼굴 영상 추출부(130)는, 도 3을 참조하면, 일 예로, 추출된 발화자의 감정 정보(312)가 슬픈 감정인 경우, 얼굴 영상 저장부(320)에서 슬픈 감정일 때 나올 수 있는 표정에 대한 눈, 코, 입 및 얼굴 자세 좌표 정보에 대한 얼굴 특징 정보(321)를 사용할 수 있다. In addition, referring to FIG. 3 , the face image extractor 130 , for example, when the extracted speaker's emotional information 312 is a sad emotion, the facial image storage unit 320 shows an expression that may be displayed when the emotion is sad. Facial feature information 321 for eye, nose, mouth, and facial posture coordinate information for .

이와 같이, 얼굴 영상 추출부(130)는 추출된 발화자의 신체 정보(311) 및 감정 정보(312)와 매칭되는 얼굴 특징 정보(321)를 얼굴 영상 저장부(320)에서 검색 및 합성하여 얼굴 영상(330)을 생성할 수 있다.In this way, the face image extraction unit 130 searches and synthesizes facial feature information 321 matching the extracted speaker body information 311 and emotion information 312 in the face image storage unit 320 to obtain a face image. 330 may be generated.

본 발명의 일 실시예에 따라, 얼굴 영상 추출부(130)는 음성 데이터로부터 추출된 음성 특징 정보(310)의 수에 비례하게 얼굴 영상 저장부(320)에서 얼굴 특징 정보(321)를 검색할 수 있다. 따라서, 음성 데이터로부터 추출된 음성 특징 정보(310)의 수가 많을수록 얼굴 영상 저장부(320)에서 해당 음성 데이터에 대한 얼굴 특징 정보(321)를 많이 검색할 수 있으므로, 검색된 얼굴 특징 정보(321)에 기초하여 발화자의 얼굴 영상을 보다 정확하게 합성할 수 있다.According to an embodiment of the present invention, the face image extractor 130 searches for the facial feature information 321 from the face image storage 320 in proportion to the number of voice feature information 310 extracted from the voice data. can Therefore, as the number of voice feature information 310 extracted from voice data increases, the face image storage unit 320 can search for more facial feature information 321 for the corresponding voice data. Based on this, it is possible to more accurately synthesize the speaker's face image.

도 4는 본 발명의 일 실시예에 따른 학습부를 설명하기 위한 예시적인 도면이다. 도 4를 참조하면, 학습부(140)는 음성 데이터 및 추출된 얼굴 영상(412)에 기초하여 얼굴 생성 모델(420)을 학습시킬 수 있다. 학습부(140)는 얼굴 생성 모델(420)을 통해 생성된 얼굴 영상(430)이 제대로 생성되었는지 여부를 검토하고, 검토 결과를 얼굴 생성 모델(420)의 학습에 반영하여, 얼굴 생성 모델(420)을 통해 보다 정확한 발화자의 얼굴을 생성할 수 있도록 학습할 수 있다.4 is an exemplary view for explaining a learning unit according to an embodiment of the present invention. Referring to FIG. 4 , the learning unit 140 may train the face generation model 420 based on the voice data and the extracted face image 412 . The learning unit 140 examines whether the face image 430 generated through the face generation model 420 has been properly generated, and reflects the review result in the learning of the face generation model 420 , the face generation model 420 . ) to learn to generate a more accurate speaker's face.

예를 들어, 학습부(140)는 발화자의 음성 데이터를 변환한 주파수 영상(411)과 발화자의 음성 데이터에 기초하여 얼굴 영상 추출부(130)에 의해 추출된 얼굴 영상(412)을 결합하여 얼굴 생성 모델(420)에 입력할 입력 영상(410)을 생성할 수 있다. 학습부(140)는 입력 영상(410)을 얼굴 생성 모델(420)에 입력하여 얼굴 영상(430)을 생성하도록 학습시킬 수 있다. 이 때, 학습부(140)는 생성적 적대 신경망(GAN: Generative Adversarial Network) 모델을 사용할 수 있으며, 생성적 적대 신경망은 여러 종류의 모델들로 변경하여 사용될 수 있다. 생성적 적대 신경망은 생성 모델과 판별 모델이 경쟁하면서 실제와 가까운 이미지, 동영상 및 음성 등을 자동으로 만들어내는 기계학습 방식 중 하나로, 얼굴 영상을 생성하는데 대표적으로 활용되고 있는 딥러닝 이미지 생성 모델이다.For example, the learning unit 140 combines the frequency image 411 obtained by converting the speaker's voice data and the face image 412 extracted by the face image extracting unit 130 based on the speaker's voice data to obtain a face. An input image 410 to be input to the generation model 420 may be generated. The learner 140 may input the input image 410 into the face generation model 420 to learn to generate the face image 430 . In this case, the learning unit 140 may use a generative adversarial network (GAN) model, and the generative adversarial network may be used by changing various types of models. A generative adversarial neural network is one of the machine learning methods that automatically creates images, videos, and voices close to the real thing while a generative model and a discriminant model compete with each other.

학습부(140)는 얼굴 생성 모델(420)에 의해 생성된 얼굴 영상(430)과 발화자의원본 얼굴 영상(440) 간의 비교 분석 결과를 반영할 수 있다. 예를 들어, 학습부(140)는 얼굴 생성 모델(420)에 의해 생성된 얼굴 영상(430)과 발화자의 원본 얼굴 영상(440)에 대하여 영상 분석(450)을 진행할 수 있다. The learner 140 may reflect the result of comparison analysis between the face image 430 generated by the face generation model 420 and the speaker's original face image 440 . For example, the learner 140 may perform the image analysis 450 on the face image 430 generated by the face generation model 420 and the original face image 440 of the speaker.

구체적으로, 학습부(140)는 얼굴 생성 모델(420)에 의해 생성된 얼굴 영상(430)과 발화자의 원본 얼굴 영상(440)에서 추출된 딥러닝 특징 값을 매칭하여 특징 값의 차이를 손실 함수를 통해 학습에 반영할 수 있다. 학습부(140)는 얼굴 생성 모델(420)에 의해 생성된 얼굴 영상(430)을 나이 분석 모델, 성별 분석 모델 등의 다양한 종류의 영상 딥러닝 모델을 사용하여 얼굴 영상(430)을 분석할 수 있다. 영상 분석(450)에 사용되는 모델은, 도 4를 참조하면, 나이 분석 모델, 성별 분석 모델, 인종 분석 모델 및 표정 분석 모델 등을 예시로 도시하였으나, 음성 특징 정보(460)에 따라 다양하게 추가 및 삭제될 수 있다.Specifically, the learning unit 140 matches the deep learning feature values extracted from the face image 430 generated by the face generation model 420 and the original face image 440 of the speaker to obtain a difference between the feature values as a loss function. can be reflected in learning. The learning unit 140 may analyze the face image 430 using various types of image deep learning models such as age analysis models and gender analysis models for the face image 430 generated by the face generation model 420 . there is. 4 , the model used for the image analysis 450 is illustrated as an example of an age analysis model, a gender analysis model, a race analysis model, an expression analysis model, and the like, but variously added according to the voice characteristic information 460 . and may be deleted.

학습부(140)는 학습된 얼굴 생성 모델(420)을 통해 생성된 얼굴 영상(430)과 원본 얼굴 영상(440)에 대하여 분석한 영상 분석(450)에 대한 결과를 얼굴 생성 모델(420)의 학습에 반영할 수 있다. The learning unit 140 uses the result of the image analysis 450 analyzed on the face image 430 generated through the learned face generation model 420 and the original face image 440 of the face generation model 420 . can be reflected in learning.

학습부(140)는 얼굴 생성 모델(420)을 통해 생성된 얼굴 영상(430)과 발화자의 음성 특징 정보(460) 간의 비교 분석 결과를 더 반영할 수 있다. 예를 들어, 학습부(140)는 얼굴 생성 모델(420)을 통해 생성된 얼굴 영상(430)과 원본 얼굴 영상(440)을 비교 분석한 영상 분석(450)에 대한 결과를 발화자의 음성 특징 정보(460)와 비교 분석할 수 있다. 구체적으로, 학습부(140)는 영상 분석(450) 결과와 발화자의 음성 특징 정보(460)를 비교 분석하여, 분석 결과가 일치하지 않는 경우엔 얼굴 생성 모델(420)의 학습에 반영하지 않을 수 있다.The learner 140 may further reflect the result of comparison analysis between the face image 430 generated through the face generation model 420 and the speaker's voice characteristic information 460 . For example, the learning unit 140 compares and analyzes the face image 430 generated through the face generation model 420 and the original face image 440 to display the result of the image analysis 450 on the speaker's voice characteristic information. (460) can be compared and analyzed. Specifically, the learning unit 140 compares and analyzes the image analysis 450 result and the speaker's voice characteristic information 460, and if the analysis results do not match, it may not be reflected in the learning of the face generation model 420. there is.

이는 두 가지 분석 결과가 일치해야 정확도가 높은 분석이 될 수 있기 때문에 오류가 있는 분석 결과와의 비교를 학습하는 것은 얼굴 생성 모델(420)의 학습에 오류를 발생시킬 수 있다. This is because an analysis with high accuracy can be obtained only when two analysis results match. Therefore, learning to compare with an erroneous analysis result may cause an error in learning of the face generation model 420 .

이를 통해, 학습부(140)는 얼굴 생성 모델(420)을 통해 생성된 얼굴 영상(430)과 원본 얼굴 영상(440)을 비교 분석한 영상 분석(450)에 대한 결과와 발화자의 음성 특징 정보(460)를 얼굴 생성 모델(420)의 학습에 모두 반영시킴으로써, 보다 정확한 발화자의 얼굴 영상(430)을 생성하도록 얼굴 생성 모델(420)을 학습시킬 수 있다. Through this, the learning unit 140 compares and analyzes the face image 430 generated through the face generation model 420 and the original face image 440, the result of the image analysis 450 and the speaker's voice characteristic information ( By reflecting all of 460 in the training of the face generating model 420 , the face generating model 420 may be trained to generate a more accurate face image 430 of the speaker.

도 5는 본 발명의 일 실시예에 따른 생성부를 설명하기 위한 예시적인 도면이다. 도 5를 참조하면, 생성부(150)는 미리 학습된 얼굴 생성 모델(510)을 통해 비식별된 음성 데이터에 대한 얼굴 영상(520)을 생성할 수 있다.5 is an exemplary view for explaining a generator according to an embodiment of the present invention. Referring to FIG. 5 , the generator 150 may generate a face image 520 for de-identified voice data through a pre-learned face generation model 510 .

예를 들어, 생성부(150)는 비식별된 음성 데이터를 분석하여 얼굴 영상(520)을 생성할 수 있고, 생성된 얼굴 영상(520)을 비식별된 음성 데이터로부터 추출한 음성 특징 정보(540)와 비교 분석하여 매칭되는 얼굴 정보(530)를 토대로 보다 정확한 발화자의 얼굴 영상(520)을 생성할 수 있다.For example, the generator 150 may generate a face image 520 by analyzing de-identified voice data, and voice feature information 540 obtained by extracting the generated face image 520 from the de-identified voice data. A more accurate speaker's face image 520 may be generated based on the matching face information 530 by comparing and analyzing with .

생성부(150)는 비식별된 음성 데이터에 대한 음성 특징 정보(540) 및 얼굴 생성 모델(510)을 통해 생성된 얼굴 영상(520)을 분석한 얼굴 정보(530)에 대한 비교 분석 결과를 사용하여 얼굴 영상(520)에 대한 발화자 정보(550)를 생성할 수 있다. The generator 150 uses the result of comparative analysis on the face information 530 obtained by analyzing the voice feature information 540 for the unidentified voice data and the face image 520 generated through the face generation model 510 . Thus, the speaker information 550 for the face image 520 may be generated.

예를 들어, 생성부(150)는 얼굴 생성 모델(510)을 통해 생성된 얼굴 영상(520)을 나이 분석 모델, 성별 분석 모델, 인종 분석 모델 및 표정 분석 모델 등을 사용하여 분석한 얼굴 정보(530)와 음성 특징 정보(540)를 비교 분석하여 매칭되는 정보를 토대로 발화자의 표정 및 감정 상태 등을 포함하는 발화자 정보(550)를 생성할 수 있다.For example, the generation unit 150 analyzes the face image 520 generated through the face generation model 510 using an age analysis model, a gender analysis model, a race analysis model, and an expression analysis model to analyze facial information ( The speaker information 550 including the speaker's facial expression and emotional state may be generated based on the matching information by comparing and analyzing the 530 and the voice characteristic information 540 .

구체적으로, 일 실시예에 따라 생성되는 발화자 정보(550)는, 도 5를 참조하면, 얼굴 생성 모델(510)을 통해 생성된 얼굴 영상(520)을 나이 분석 모델을 사용하여 분석한 결과(예: 20대)와 음성 특징 정보(540)에 포함되어 있는 나이 정보(예: 20대)를 비교 분석하여 "20대"라는 나이 정보가 일치하는 경우, 해당 얼굴 영상(520)에 대하여 "20대"라는 발화자 정보(550)를 포함할 수 있다. Specifically, as for the speaker information 550 generated according to an embodiment, referring to FIG. 5 , the result of analyzing the face image 520 generated through the face generation model 510 using the age analysis model (eg, : 20's) and age information (eg, 20's) included in the voice feature information 540 are compared and analyzed and when the age information of "twenties" matches, "20's" for the face image 520 It may include the speaker information 550 of ".

다른 예를 들어, 생성부(150)는 얼굴 생성 모델(510)을 통해 얼굴 영상(520)을 생성하고, 생성된 얼굴 영상(520)과 음성 특징 정보(540)를 비교 분석하여 발화자의 얼굴, 표정, 나이, 성별, 인종뿐만 아니라, 특히, 발화 내용의 진위 및 그 당시 심리 상태까지 추론한 발화자 정보(550)를 생성할 수 있으므로, 보이스 피싱범 또는 유괴 협박범 등의 범죄자 얼굴을 추정하고 검거하는데 도움이 되는 정보를 제공할 수 있다.As another example, the generating unit 150 generates a face image 520 through the face generation model 510, and compares and analyzes the generated face image 520 and the voice feature information 540 to determine the speaker's face, In addition to facial expression, age, gender, and race, in particular, since it is possible to generate the speaker information 550 deducing the authenticity of the contents of the utterance and the psychological state at the time, it is possible to estimate and arrest the face of criminals such as voice phishing criminals or kidnapping threats. You can provide helpful information.

구체적으로, 다른 실시예에 따라 생성되는 발화자 정보(550)는, 도 5를 참조하면, 얼굴 생성 모델(510)을 통해 생성된 얼굴 영상(520)을 표정 분석 모델을 사용하여 분석한 결과(예: 슬픔)와 음성 특징 정보(540)에 포함되어 있는 발화 내용 정보(예: "나는 행복하다")를 비교 분석하여 해당 얼굴 영상(520)에 대하여 발화 내용에 대한 진위 여부가 "거짓말"이라는 발화자 정보(550)를 포함할 수 있다. Specifically, as for the speaker information 550 generated according to another embodiment, referring to FIG. 5 , the result of analyzing the face image 520 generated through the face generation model 510 using the expression analysis model (eg, : Sadness) and speech content information (eg, "I'm happy") included in the voice feature information 540 are compared and analyzed to determine the authenticity of the speech content with respect to the corresponding face image 520 is "a lie". information 550 .

이와 같이, 본 발명의 일 실시예에 따른 얼굴 영상 생성 장치(100)는, 비식별된 음성 데이터에 대한 발화자의 실제 얼굴 영상뿐만 아니라 발화자에 대한 정보를 생성함으로써 발화자(용의자)를 추적(검거)하는데 도움을 줄 수 있는 정보를 제공할 수 있다.As described above, the face image generating apparatus 100 according to an embodiment of the present invention tracks (arrests) the speaker (suspect) by generating information about the speaker as well as the actual face image of the speaker for the de-identified voice data. You can provide information that can help you do this.

예를 들어, 얼굴 영상 생성 장치(100)는 비식별된 음성 데이터만으로 발화자의 나이, 성별, 인종 및 발화 내용과 감정 등을 분석할 수 있고, 분석 결과를 토대로 해당 음성 데이터에 매칭되는 발화자의 얼굴 영상을 생성할 수 있고, 생성된 얼굴 영상에 대한 발화자의 발화 내용의 진위 및 발화 당시 심리 상태까지 분석하여 발화자를 추적하는데 도움을 줄 수 있는 발화자 정보를 제공할 수 있다.For example, the face image generating apparatus 100 may analyze the age, gender, race, content and emotion of the speaker only with the unidentified voice data, and the speaker's face matched with the corresponding voice data based on the analysis result An image may be generated, and by analyzing the authenticity of the speaker's utterance with respect to the generated face image and the psychological state at the time of the utterance, it is possible to provide speaker information that can help track the speaker.

다른 예를 들어, 얼굴 영상 생성 장치(100)는 보이스 피싱 범죄 및 유괴 협박 범죄에 대한 용의자를 검거하기 위해, 전화 또는 녹음된 음성을 통해 입력된 음성 데이터만으로 해당 음성 데이터를 분석할 수 있고, 해당 음성 데이터 분석 결과를 토대로 용의자의 얼굴 영상을 생성할 수 있고, 생성된 용의자 얼굴 영상에 대한 용의자의 발화 내용의 진위 및 발화 당시 심리 상태까지 분석하여 용의자를 검거하는데 도움을 줄 수 있는 용의자 정보를 제공할 수 있다. For another example, the face image generating apparatus 100 may analyze only the voice data input through a phone call or a recorded voice in order to arrest a suspect for a voice phishing crime and a kidnapping crime, and the corresponding voice data may be analyzed. It is possible to create a face image of a suspect based on the result of analyzing the voice data, and provides suspect information that can help in arresting the suspect by analyzing the authenticity of the suspect's utterances and the psychological state at the time of the utterance. can do.

도 6은 본 발명의 일 실시예에 따른 얼굴 영상 생성 방법의 순서도이다. 도 6에 도시된 얼굴 영상 생성 방법은 도1 내지 도 5에 도시된 실시예에 따라 시계열적으로 처리되는 단계들을 포함한다. 따라서, 이하 생략된 내용이라고 하더라도 도1 내지 도 5에 도시된 실시예에 따른 얼굴 영상 생성 장치(100)에서 음성 데이터에 기초하여 얼굴 영상을 생성하는 방법에도 적용된다. 6 is a flowchart of a method for generating a face image according to an embodiment of the present invention. The method for generating a face image shown in FIG. 6 includes steps of time-series processing according to the embodiment shown in FIGS. 1 to 5 . Therefore, even if omitted below, the method for generating a face image based on voice data in the face image generating apparatus 100 according to the embodiment shown in FIGS. 1 to 5 is also applied.

단계 S610에서 얼굴 영상 생성 장치는 발화자에 대한 음성 데이터 및 원본 얼굴 영상을 입력 받을 수 있다.In operation S610, the apparatus for generating a face image may receive voice data and an original face image of the speaker.

단계 S620에서 얼굴 영상 생성 장치는 음성 데이터를 분석하여 음성 특징 정보를 추출할 수 있다.In step S620, the face image generating apparatus may extract voice feature information by analyzing the voice data.

단계 S630에서 얼굴 영상 생성 장치는 추출된 음성 특징 정보에 기초하여 얼굴 영상 저장부를 통해 얼굴 영상을 추출할 수 있다.In operation S630, the face image generating apparatus may extract a face image through the face image storage unit based on the extracted voice feature information.

단계 S640에서 얼굴 영상 생성 장치는 음성 데이터 및 추출된 얼굴 영상에 기초하여 얼굴 생성 모델을 학습시킬 수 있다.In operation S640, the face image generating apparatus may train a face generating model based on the voice data and the extracted face image.

단계 S650에서 얼굴 영상 생성 장치는 학습된 얼굴 생성 모델을 통해 비식별된 음성 데이터에 대한 얼굴 영상을 생성할 수 있다.In operation S650 , the face image generating apparatus may generate a face image for unidentified voice data through the learned face generating model.

상술한 설명에서, 단계 S610 내지 S650은 본 발명의 구현 예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 전환될 수도 있다. In the above description, steps S610 to S650 may be further divided into additional steps or combined into fewer steps, according to an embodiment of the present invention. In addition, some steps may be omitted as needed, and the order between the steps may be switched.

도 1 내지 도 6을 통해 설명된 얼굴 영상 생성 장치(100)에서 음성 데이터에 기초하여 얼굴 영상을 생성하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램 또는 컴퓨터에 의해 실행 가능한 명령어들을 포함하는 기록 매체의 형태로도 구현될 수 있다. 또한, 도 1 내지 도 5를 통해 설명된 얼굴 영상 생성 장치(100)에서 음성 데이터에 기초하여 얼굴 영상을 생성하는 방법은 컴퓨터에 의해 실행되는 컴퓨터 판독가능 기록매체에 저장된 컴퓨터 프로그램의 형태로도 구현될 수 있다. The method for generating a face image based on voice data in the face image generating apparatus 100 described with reference to FIGS. 1 to 6 is a computer program stored in a computer-readable recording medium executed by a computer or instructions executable by the computer. It may also be implemented in the form of a recording medium including In addition, the method for generating a face image based on voice data in the face image generating apparatus 100 described with reference to FIGS. 1 to 5 is also implemented in the form of a computer program stored in a computer-readable recording medium executed by a computer. can be

컴퓨터 판독 가능 기록매체는 컴퓨터에 의해 액세스될 수 있는 임의의 가용 매체일 수 있고, 휘발성 및 비휘발성 매체, 분리형 및 비분리형 매체를 모두 포함한다. 또한, 컴퓨터 판독가능 기록매체는 컴퓨터 저장 매체를 포함할 수 있다. 컴퓨터 저장 매체는 컴퓨터 판독가능 명령어, 데이터 구조, 프로그램 모듈 또는 기타 데이터와 같은 정보의 저장을 위한 임의의 방법 또는 기술로 구현된 휘발성 및 비휘발성, 분리형 및 비분리형 매체를 모두 포함한다.A computer-readable recording medium may be any available medium that can be accessed by a computer, and includes both volatile and nonvolatile media, removable and non-removable media. In addition, the computer-readable recording medium may include a computer storage medium. Computer storage media includes both volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다. The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalents should be interpreted as being included in the scope of the present invention. do.

100: 얼굴 영상 생성 장치
110: 입력부
120: 음성 특징 정보 추출부
130: 얼굴 영상 추출부
140: 학습부
150: 생성부100: face image generating device
110: input unit
120: voice feature information extraction unit
130: face image extraction unit
140: study unit
150: generator

Claims

An apparatus for generating a face image based on audio data, the apparatus comprising:
an input unit for receiving voice data and an original face image of the speaker;
a voice characteristic information extraction unit that analyzes the voice data and extracts voice characteristic information;
a face image extractor configured to extract a face image through a face image storage unit based on the extracted voice feature information;
a learning unit for learning a face generation model based on the voice data and the extracted face image; and
A generation unit that generates a face image for unidentified voice data through the learned face generation model
Containing, a face image generating device.

The method of claim 1,
The voice feature information includes body information of the speaker including information on at least one of gender, age, and race of the speaker.

The method of claim 1,
The voice characteristic information includes emotional information about the speaker.

The method of claim 1,
The face image extraction unit,
and generating the face image by synthesizing at least one piece of facial feature information matching the extracted voice feature information from the face image storage unit.

The method of claim 1,
The learning unit,
The face image generating apparatus of claim 1, wherein a result of comparative analysis between the generated face image and the original face image is reflected in the face generating model.

6. The method of claim 5,
The learning unit,
The face image generating apparatus of claim 1, wherein the result of comparison analysis between the generated face image and the voice feature information is further reflected in the face generating model.

The method of claim 1,
The generating unit,
and generating speaker information for the face image based on a result of comparative analysis of voice feature information on the de-identified voice data and the generated face information.

The method of claim 1,
The voice characteristic information extracting unit converts the voice data into a frequency image,
The apparatus for generating a face image, wherein the voice characteristic information extractor analyzes the converted frequency image and extracts the voice characteristic information.

A method for generating a face image based on audio data by a face image generating apparatus, the method comprising:
receiving voice data and an original face image of the speaker;
extracting voice characteristic information by analyzing the voice data;
extracting a face image through a face image storage unit based on the extracted voice feature information;
training a face generation model based on the voice data and the extracted face image; and
generating a face image for unidentified voice data through the learned face generation model;
Including, a face image generation method.

10. The method of claim 9,
The voice feature information includes body information of the speaker including information on at least one of gender, age, and race of the speaker.

10. The method of claim 9,
The voice feature information includes emotional information about the speaker.

10. The method of claim 9,
The step of extracting the face image,
and generating the face image by synthesizing at least one piece of facial feature information matching the extracted voice feature information from the face image storage unit.

10. The method of claim 9,
reflecting a result of comparative analysis between the generated face image and the original face image in the face generation model
The method of generating a face image further comprising a.

14. The method of claim 13,
Reflecting a comparative analysis result between the face image and the voice feature information in the face generation model
The method of generating a face image further comprising a.

10. The method of claim 9,
generating speaker information for the face image based on a result of comparative analysis of voice feature information on the de-identified voice data and the generated face information
The method of generating a face image further comprising a.

10. The method of claim 9,
converting the audio data into a frequency image; and
extracting the voice feature information by analyzing the converted frequency image
The method of generating a face image further comprising a.

A computer program stored in a computer-readable recording medium comprising a sequence of instructions for generating a face image based on voice data, the computer program comprising:
When the computer program is executed by a computing device,
Receive voice data and original face image of the speaker,
extracting voice characteristic information by analyzing the voice data;
extracting a face image through a face image storage unit based on the extracted voice feature information,
learning a face generation model based on the voice data and the extracted face image,
A computer program stored in a computer-readable recording medium comprising a sequence of instructions for generating a face image for de-identified voice data through the learned face generation model.