WO2023090419A1 - コンテンツ生成装置、コンテンツ生成方法、及びプログラム - Google Patents
コンテンツ生成装置、コンテンツ生成方法、及びプログラム Download PDFInfo
- Publication number
- WO2023090419A1 WO2023090419A1 PCT/JP2022/042847 JP2022042847W WO2023090419A1 WO 2023090419 A1 WO2023090419 A1 WO 2023090419A1 JP 2022042847 W JP2022042847 W JP 2022042847W WO 2023090419 A1 WO2023090419 A1 WO 2023090419A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- image
- content
- user
- unit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/205—Three-dimensional [3D] animation driven by audio data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F3/00—Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
- G06F3/16—Sound input; Sound output
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T13/00—Animation
- G06T13/20—Three-dimensional [3D] animation
- G06T13/40—Three-dimensional [3D] animation of characters, e.g. humans, animals or virtual beings
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/04—Details of speech synthesis systems, e.g. synthesiser structure or memory management
- G10L13/047—Architecture of speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/06—Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
- G10L21/10—Transforming into visible information
- G10L2021/105—Synthesis of the lips movements from speech, e.g. for talking heads
Definitions
- the present invention relates to a content generation device, a content generation method, and a program.
- This application claims priority based on Japanese Patent Application No. 2021-188791 filed in Japan on November 19, 2021, the content of which is incorporated herein.
- Patent Literature 1 discloses a technique for promoting document-related communication between users by using an independent and moving character as an avatar.
- voice data that is a recording of the user's voice is prepared in advance, and the facial expression of the avatar is defined to change when the voice data is played back. You can make it look like an avatar is reading it. This allows, for example, an avatar to give a presentation on behalf of the user who is the speaker.
- an object of the present invention is to provide a content generation device, a content generation method, and a program capable of reducing the sense of incongruity given to listeners when reading out text and expressing more like the real person. be.
- a content generation device includes an acquisition unit that acquires text data indicating a first text to be read aloud, a second text to be learned by a user.
- the first text indicated by the acquired text data is obtained by using a speech generation model that has learned how to read the second text by the user's voice based on the voice when the user's voice is read.
- a synthesizing unit for synthesizing the generated synthesized speech and the image of the user to generate synthetic content.
- a content generation method includes an acquisition process in which an acquisition unit acquires text data indicating a first text to be read aloud; The first text indicated by the acquired text data is obtained by using a speech generation model that has learned how to read the second text by the user's voice based on the voice when the text is read out.
- a program comprises an acquisition unit configured to acquire text data indicating a first text to be read aloud, and a second text to be learned by a user based on voice read aloud. generating synthesized speech in which the first text indicated by the acquired text data is read aloud by the user's voice, using a speech generation model that has learned how to read the second text by the user's voice. It functions as a voice generating unit and a synthesizing unit that synthesizes the generated synthetic voice and the user's personal image to generate synthesized content.
- FIG. 4 is a flow chart showing an example of the flow of processing in generating a speech generation model according to the embodiment; 6 is a flow chart showing an example of the flow of processing in generating an image generation model according to the present embodiment; 6 is a flow chart showing an example of the flow of processing in generating a synthetic moving image according to the embodiment; It is a figure which shows an example of the synthetic
- FIG. 11 is a flowchart showing an example of the flow of processing in generating a synthetic moving image according to a modified example of the present embodiment; FIG. It is a figure which shows an example of the synthetic
- FIG. 1 is a diagram showing an example of the configuration of a content generation system according to this embodiment.
- the content generation system 1 includes a user terminal 10 and a content generation device 20 .
- the user terminal 10 and the content generation device 20 are communicably connected via a network NW.
- NW network
- the user terminal 10 and the content generation device 20 may be connected by either wired communication or wireless communication.
- the content generation system 1 is a system for generating content in which a user's digital clone explains materials on behalf of the user.
- the contents are, for example, image contents, web contents, 3D (three-dimensional) contents, 3D hologram contents, and the like.
- Image content is content that displays a digital clone using images such as still images and moving images (video).
- Web content is content that displays a digital clone in a 3D space displayed on a web browser.
- 3D content is content that lets a 3D digital clone do the talking.
- a 3D hologram is content that projects a digital clone using a 3D hologram.
- a digital clone is a digitized copy of a user.
- the digital clone is represented by the user's own image (hereinafter also referred to as "personal image”), and the user's own voice (hereinafter also referred to as "personal voice”) reads out the text of the material.
- the content generation system 1 generates content by synthesizing user's voice, user's image, material image, and the like.
- Content generated by synthesis is hereinafter also referred to as "synthetic content”.
- Synthetic content is generated by synthesizing according to the content used by the user, such as image content, web content, 3D content, and 3D hologram content.
- the content used by the user is image content
- the content generation system 1 synthesizes the user's voice, the user's image, the image of the material, etc.
- the digital clone explains the material (hereinafter referred to as "composite animation ) will be described as a synthetic content.
- the content generation system 1 generates content based on material data.
- the material data includes data indicating the first text to be read aloud (hereinafter also referred to as "text data") and data displayed corresponding to the contents of the text data (hereinafter also referred to as “display data”).
- an example in which the material is used for presentation will be described as an example.
- material data hereinafter also referred to as “presentation data”
- PowerPoint data is data that includes both text data and display data.
- the text data is the text entered in the note section.
- the display data are mainly slides, and may include moving images and animations attached to the slides.
- the content generation system 1 generates a voice in which the first text is read aloud by the user's own voice (hereinafter also referred to as "synthesized voice") based on the text data.
- the content generation system 1 generates synthesized speech using a learned model learned by machine learning.
- the trained model that generates synthesized speech is a model that has learned how to read the second text with the user's voice based on the speech when the user reads out the second text to be learned (hereinafter referred to as "speech generation (also referred to as "model”).
- the reading style of the user to be learned includes, for example, intonation, accent, reading speed, and the like unique to the user.
- the speech generation model can generate and output synthetic speech that reads out the first text indicated by the text data in the same manner as the user's own speech.
- the content generation system 1 inputs the text data of the presentation data to the speech generation model, and acquires synthesized speech in which the first text indicated by the text data is read aloud in the same way as the user's own speech. can be done.
- the content generation system 1 generates a personal image for digital clone (hereinafter also referred to as a “compositing personal image”) based on the user's personal image.
- the personal image for synthesis may be either a still image or a moving image (video).
- the content generation system 1 generates a personal image for synthesis using a learned model learned by machine learning.
- a trained model for generating a personal image for synthesis is a model (hereinafter also referred to as an “image generation model”) that has learned user actions based on the user's personal image.
- the user's actions to be learned are, for example, the user's facial movements and gestures.
- the movement of the user's face is, for example, a movement of the mouth and a change in facial expression according to the reading.
- Gestures are, for example, head movements and gestures in response to reading.
- the image generation model can generate and output a synthesized person image in which the user's actions change according to the voice.
- the content generation system 1 inputs the synthetic voice generated based on the text data of the presentation data to the image generation model, and the person's image whose action changes according to the synthetic voice is used as the person's image for synthesis. can be obtained.
- the content generation system 1 generates data representing a digital clone of the user (hereinafter also referred to as “clone data”) by synthesizing the generated synthetic voice and the personal image for synthesis.
- clone data data representing a digital clone of the user
- the first text to be read aloud is read aloud by the user's own voice
- the person's image changes according to the contents of the first text as if the user is performing an action.
- the user's mouth or facial expression changes in accordance with the first text being read out (output of the user's voice), or the user seems to move his or her head or make gestures. change to In this way, by changing the user's own image in accordance with the user's own voice, it is possible to reduce the gap between the voice and the image and reduce the sense of discomfort given to the listener.
- the clone data is generated in a data format corresponding to content used by the user, such as image content, web content, 3D content, and 3D hologram content.
- content used by the user is image content
- clone moving image a moving image representing a digital clone of the user
- the content generation system 1 generates an image (hereinafter also referred to as “display image”) displayed in correspondence with the synthesized speech based on the display data.
- the content generation system 1 also generates text data (hereinafter also referred to as “subtitle text”) to be displayed as subtitles based on the text data. Then, the content generation system 1 synthesizes the clone moving image, the display image, and the subtitle text, thereby generating a moving image in which the user's digital clone explains the contents of the material as a synthesized moving image.
- a synthetic video is an example of content generated by the content generation system 1 .
- a digital clone of the user reads out the first text according to the content of the displayed material. This makes it appear as if a digital clone of the user is explaining the material on behalf of the user.
- a user terminal 10 is a terminal used by a user.
- the user terminal 10 includes an input device (mouse, keyboard, touch panel, etc.), an output device (display, speaker, etc.), a central processing unit, and the like.
- any terminal such as a PC (Personal Computer), a smart phone, a tablet, or the like may be used.
- the user operates the user terminal 10 to upload to the content generation device 20 information necessary for generating (learning) the speech generation model and the image generation model, and information necessary for generating a synthesized moving image.
- the information necessary for generating the speech generation model is the speech (hereinafter also referred to as "learning speech") read by the user from the second text to be learned.
- the training speech is generated, for example, by having the user actually read out about 200 second texts.
- the information necessary for generating the image generation model is the user's personal image for learning (hereinafter also referred to as “learning personal image”).
- the person image for learning may be either a still image or a moving image (video), but the image generation model can learn the change of the user's motion with higher accuracy with the moving image.
- Information necessary for generating a synthetic moving image is presentation data.
- the content generation device 20 Based on the uploaded presentation data, the content generation device 20 generates a synthetic moving image using a sound generation model and an image generation model.
- the user can operate the user terminal 10 to download and reproduce the composite moving image from the content generation device 20, thereby allowing the digital clone to give a presentation on behalf of the user.
- the content generation device 20 is a device that generates a synthetic moving image (an example of content).
- the content generation device 20 includes an input device (mouse, keyboard, touch panel, etc.), an output device (display, speaker, etc.), a central processing unit, and the like.
- the content generation device 20 is, for example, a server device realized by a PC (Personal Computer).
- the content generation device 20 generates a sound generation model, an image generation model, and a synthetic moving image based on various information uploaded from the user terminal 10 . Specifically, the content generation device 20 generates a speech generation model based on the learning speech uploaded from the user terminal 10 . In addition, the content generation device 20 generates an image generation model based on the personal image for learning uploaded from the user terminal 10 . Also, the content generation device 20 generates a display image based on the display data of the presentation data uploaded from the user terminal 10 . Also, the content generation device 20 generates caption text based on the text data of the presentation data uploaded from the user terminal 10 .
- the content generation device 20 inputs the text data of the presentation data uploaded from the user terminal 10 to the speech generation model to generate synthesized speech, inputs the generated synthesized speech to the image generation model, and produces a personal image for synthesis. is generated, and the generated synthetic voice and the personal image for synthesis are synthesized to generate a clone video. Then, the content generation device 20 synthesizes the generated display image, caption text, and clone video to generate a composite video.
- the content generation device 20 includes a communication section 210 , an input section 220 , a storage section 230 , a control section 240 and an output section 250 .
- the communication unit 210 has a function of transmitting and receiving various information.
- the communication unit 210 communicates with the user terminal 10 via the network NW.
- the communication unit 210 receives learning speech, which is information necessary for generating a speech production model.
- the communication unit 210 receives a training person image, which is information necessary for generating an image generation model, in communication with the user terminal 10 .
- the communication unit 210 receives presentation data, which is information necessary for generating a synthetic video, in communication with the user terminal 10 .
- the communication unit 210 transmits a synthetic moving image in communication with the user terminal 10 .
- Input unit 220 has a function of receiving an input.
- the input unit 220 receives input of information input by an input device such as a mouse, keyboard, or touch panel provided as hardware in the content generation device 20 .
- the storage unit 230 has a function of storing various information.
- the storage unit 230 includes a storage medium provided as hardware in the content generation device 20, such as a HDD (Hard Disk Drive), SSD (Solid State Drive), flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), RAM (Random Access Memory), and so on. read/write memory), ROM (Read Only Memory), or any combination of these storage media.
- the storage unit 230 stores a speech generation model 231 and an image generation model 232 .
- the storage unit 230 may store learning voices, learning person images, presentation data, and the like received by the communication unit 210 from the user terminal 10 .
- the storage unit 230 may also store display images, caption texts, synthetic voices, personal images for synthesis, clone videos, synthetic videos, and the like generated by the content generation device 20 .
- Control unit 240 has a function of controlling the overall operation of the content generation device 20 .
- the control unit 240 is implemented, for example, by causing a CPU (Central Processing Unit) provided as hardware in the content generation device 20 to execute a program.
- the control unit 240 includes an acquisition unit 241, a learning unit 242, a division unit 243, a reproduction time determination unit 244, a subtitle generation unit 245, an audio generation unit 246, and an image generation unit 247. , a synthesizing unit 248 , and an output processing unit 249 .
- a CPU Central Processing Unit
- the acquisition unit 241 has a function of acquiring various information. For example, the acquisition unit 241 acquires the learning voice, the learning person's own image, and the presentation data received by the communication unit 210 from the user terminal 10 . Acquisition unit 241 acquires text data to be read aloud and display data to be displayed corresponding to the contents of the text data from the presentation data.
- FIG. 2 to 4 are diagrams showing examples of presentation data according to this embodiment.
- 2 to 4 show presentation data 30 composed of n (n is a natural number) slides 31-1 to 31-n.
- n is a natural number
- slides 31-1 to 31-n are displayed.
- a slide selected from slides 31-1 to 31-n is displayed in the display area DA2 of the presentation data 30.
- FIG. The display area DA3 of the presentation data 30 displays the first text corresponding to the slide selected from the slides 31-1 to 31-n.
- FIG. 2 is a diagram showing the first slide. As shown in FIG. 2, the first slide 31-1 is displayed in the display area DA2, and the first text 32-1 corresponding to the slide 31-1 is displayed in the display area DA3.
- FIG. 3 is a diagram showing the second slide. As shown in FIG. 3, the second slide 31-2 is displayed in the display area DA2, and the first text 32-2 corresponding to the slide 31-2 is displayed in the display area DA3.
- FIG. 4 is a diagram showing the n-th slide. As shown in FIG. 4, the n-th slide 31-n is displayed in the display area DA2, and the first text 32-n corresponding to the slide 31-n is displayed in the display area DA3.
- the acquisition unit 241 acquires the slides 31-1 to 31-n as display data from the presentation data 30, and acquires the first texts 32-1 to 32-n as text data.
- the learning unit 242 has a function of generating a trained model. For example, the learning unit 242 generates a trained model by machine learning using learning data acquired by the acquisition unit 241 .
- the learning unit 242 uses the learning voice acquired by the acquisition unit 241 as teacher data to machine-learn how to read the text using the user's voice.
- the learning unit 242 when text data is input, generates and outputs a synthesized voice in which the first text indicated by the text data is read aloud by the user's own voice. to generate
- the learning unit 242 transfers learning speech (teacher data) to an existing trained model that has learned text and how to read it out in advance, so that the original speech generation model 231 to generate Note that an existing trained model for generating the speech generation model 231 is pre-stored in the storage unit 230 .
- An existing trained model for generating the speech generation model 231 has a dictionary indicating general intonations and accents, and can reproduce general intonations and accents.
- the learning unit 242 allows one trained model to learn only the training speech of one user, thereby generating original speech of a plurality of users.
- a model 231 can be generated.
- the learning unit 242 performs transfer learning of the user's learning speech to the trained model of each language, thereby creating an original speech generation model for each language. 231 can be generated.
- the learning unit 242 writes the generated speech generation model 231 to the storage unit 230 for storage.
- the learning unit 242 machine-learns the motion of the user using the learning person image acquired by the acquisition unit 241 .
- the learning unit 242 has already learned the image generation model 232 capable of generating and outputting a synthesized person image in which the user's behavior changes according to the input synthetic speech. Generate as a model.
- the learning unit 242 performs transfer learning of a person's image for learning (teacher data) to an existing trained model that has learned changes in a person's motion in advance, so that the person's original image generation model 232 is generated.
- An existing trained model for generating the image generation model 232 is pre-stored in the storage unit 230 .
- the existing trained model for generating the image generation model 232 is, for example, a model in which mouth movements synchronized with speech are learned in advance using a GAN (Generative Adversarial Network).
- GAN Generic Adversarial Network
- the learning unit 242 learns only one user's personal image for learning for one trained model, thereby obtaining original images of a plurality of users.
- a generative model 232 can be generated.
- the learning unit 242 writes the generated image generation model 232 to the storage unit 230 for storage.
- the dividing unit 243 has a function of dividing the first text (text data). For example, the dividing unit 243 divides the first text into a plurality of pieces based on an input indicating the division location of the first text. An input indicating a division point is, for example, a line feed. The dividing unit 243 divides the first text into a plurality of sentences at each line feed. By dividing the first text by the dividing unit 243, the first text is read out in units of divided sentences. As a result, the voice is interrupted each time one divided sentence is read aloud, so that the first text can be read aloud with a pause.
- the division unit 243 divides the first text 32-2 into sentences of ““Digital Transformation” will be expanded” and sentences of “Our company will continue to measure...”. divided into two. Also, in the case of the first text 32-n shown in FIG. 4, no line feed is entered. Therefore, the dividing unit 243 does not divide the first text 32-n.
- the playback time determination unit 244 has a function of determining the playback time of the composite moving image. For example, the playback time determination unit 244 determines the playback time of the synthetic video based on the number of characters of the first text. Specifically, the reproduction time determination unit 244 determines the reproduction time of the display image corresponding to the text data by converting the number of characters of the first text indicated by the text into time for each acquired text data. . Also, the playback time determination unit 244 may determine the playback time of the synthetic moving image based on the reading speed of the synthetic voice.
- the reproduction time determination unit 244 determines the reproduction time of the display image corresponding to the text data by calculating the reading completion time based on the reading speed for each acquired text data. When there are a plurality of display images, the reproduction time determination unit 244 calculates the reproduction time of the composite moving image by totaling the reproduction times of the respective display images. Note that the playback time determination unit 244 may determine the playback time of the synthetic moving image based on both the number of characters in the first text and the reading speed.
- the playback time determination unit 244 may determine the playback time of the synthetic video in consideration of the interval in reading the first text. For example, the reproduction time determination unit 244 calculates the time required for expression between sentences according to the number of sentences divided by the division unit 243, and adds it to the reproduction time calculated based on the number of characters and the reading speed described above.
- Subtitle generation unit 245 has a function of generating subtitle text.
- the subtitle generation unit 245 generates subtitle text based on the text data acquired by the acquisition unit 241 .
- the caption generation unit 245 generates caption text for each sentence divided by the division unit 243 .
- the subtitle generation unit 245 selects the language supported by the speech generation model 231.
- Subtitle text may be generated by translating the text data accordingly.
- the voice generator 246 has a function of generating synthetic voice.
- the voice generation unit 246 uses the user's original voice generation model 231 to generate synthesized voice in which the first voice indicated by the text data acquired by the acquisition unit 241 is read out by the user's voice.
- the speech generation unit 246 generates synthetic speech for each text data (sentence) divided by the division unit 243 .
- the voice is interrupted every time one of the generated synthesized voices is read aloud, so that the first text indicated by the text data can be read aloud with a pause.
- the speech generation unit 246 performs transfer learning based on learning speech (teacher data) in the first language used for learning by the speech generation model 231 corresponding to the first language.
- a speech generation model 231 is used.
- the speech generator 246 can generate synthesized speech in which the first text indicated by the text data indicated in the first language is read aloud by the user using the second language.
- the image generator 247 has a function of generating various images for generating a composite moving image. For example, the image generation unit 247 generates a display image to be displayed corresponding to the synthesized speech based on the display data acquired by the acquisition unit 241 . Specifically, the image generator 247 generates a display image by converting display data into an image.
- the image generation unit 247 uses the user's original image generation model 232 to generate a synthesis person's image in which the user's actions change according to the synthesized speech generated by the speech generation model 231 .
- the image generation unit 247 generates a personal image for synthesis in which the movement of the user's face changes according to the reading by the synthesized voice.
- the image generation unit 247 generates a personal image for synthesis in which the movement of the user's mouth and facial expression change according to the reading by the synthesized voice.
- the image generation unit 247 may generate a personal image for synthesis that changes as if the user is gesturing according to the reading by the synthetic voice.
- the image generation unit 247 generates a personal image for synthesis in which the user moves his or her head or makes gestures in response to reading by synthetic voice. In this manner, the image generating unit 247 generates a personal image for synthesis that reproduces the movement of the user himself/herself when the user reads out the text or gives a presentation. As a result, the clone of the clone moving image can read out text or give a presentation while moving more naturally like the user himself/herself.
- the synthesizing unit 248 has a function of performing various syntheses. For example, the synthesizing unit 248 synthesizes at least the synthetic voice generated by the voice generating unit 246, the display image generated by the image generating unit 247, and the user's own image to generate a synthetic moving image. Specifically, the synthesizing unit 248 synthesizes the synthetic voice generated by the audio generating unit 246 and the personal image for synthesis generated by the image generating unit 247 to generate the clone moving image. Then, the synthesizing unit 248 synthesizes the display image and the generated clone moving image to generate a synthetic moving image.
- the synthesizing unit 248 when there are multiple slides (display data) and multiple first texts (text data) like the presentation data shown in FIGS. , a synthesized voice, and a personal image for synthesis are generated. Therefore, the synthesizing unit 248 generates a clone moving image for each set of slides and the first text to generate a synthesized moving image. As a result, synthetic animations corresponding to the number of slides are generated from one piece of presentation data. The synthesizing unit 248 then synthesizes a plurality of synthesized moving images generated from one piece of presentation data to generate one final synthetic moving image.
- the image generation unit 247 generates a person-in-person image for synthesis for each of a plurality of synthesized voices.
- the synthesizing unit 248 generates one clone moving image by synthesizing a plurality of synthetic voices and a plurality of personal images for synthesis in association with each other.
- the synthesizing unit 248 may synthesize the subtitle text generated by the subtitle generating unit 245 in addition to the display image and the clone video to generate a synthetic video.
- the synthesizing unit 248 may synthesize so that only one subtitle text is displayed at a time, or may synthesize so that a plurality of subtitle texts are displayed at once. .
- FIG. 5 to 7 are diagrams showing an example of a composite moving image according to this embodiment.
- FIGS. 5 to 7 show parts of synthetic moving images generated based on the presentation data 30 shown in FIGS. 2 to 4, respectively.
- FIG. 5 shows a synthetic moving image 40-1 generated based on the slide 31-1 and the first text 32-1 shown in FIG. 2 as part of the synthetic moving image generated based on the presentation data 30. ing.
- the composite moving image 40-1 is composed of a display image 41-1, a clone moving image 42-1, and caption text 43-1.
- the synthetic video 40-1 portion of the synthetic video is reproduced, the clone of the user displayed in the clone video 42-1 is displayed using the user's own voice in the user's own reading manner.
- Read out subtitle text 43-1 After reading out the displayed subtitle text 43-1, the next subtitle text 43-1 is displayed.
- FIG. 6 shows a synthetic moving image 40-2 generated based on the slide 31-2 and the first text 32-2 shown in FIG. 3 as part of the synthetic moving image generated based on the presentation data 30. ing.
- the composite moving image 40-2 is composed of a display image 41-2, a clone moving image 42-2, and subtitle text 43-2.
- the composite moving image 40-n is composed of a display image 41-n, a clone moving image 42-n, and subtitle text 43-n.
- Synthetic animation 40-n is the last part of the synthetic animation.
- the synthetic video 40-n part of the synthetic video is played in the same manner as the synthetic video 40-1 and the synthetic video 40-2, and when the playback of the synthetic video 40-n ends, the entire synthetic video is played. ends.
- Output processing unit 249 has a function of controlling various outputs. For example, the output processing unit 249 transmits the synthesized video generated by the synthesizing unit 248 to the user terminal 10 . In addition, the output processing unit 249 may reproduce the synthesized moving image generated by the synthesizing unit 248, transmit the reproduced video and audio to the user terminal 10, and cause the user terminal 10 to output them.
- Output unit 250 The output unit 250 has a function of outputting various information.
- the output unit 250 is implemented by, for example, a display device such as a display or a touch panel provided as hardware in the content generation device 20, and an audio output device such as a speaker.
- the output unit 250 outputs, for example, screens and sounds according to the input from the output processing unit 249 .
- FIG. 8 is a flowchart showing an example of the flow of processing in generating the speech generation model 231 according to this embodiment.
- the content generation device 20 generates and prepares an audio generation model 231 in advance in order to generate a synthetic moving image.
- the acquisition unit 241 of the content generation device 20 acquires learning speech (step S101). Specifically, the acquisition unit 241 acquires the learning voice received from the user terminal 10 by the communication unit 210 of the content generation device 20 .
- the learning unit 242 of the content generation device 20 generates the speech generation model 231 (step S102). Specifically, the learning unit 242 transfers the learning speech (teacher data) acquired by the acquisition unit 241 to an existing trained model that has learned in advance the second text to be learned and how to read it. By learning, the user's own original speech generation model 231 is generated. Note that an existing trained model for generating the speech generation model 231 is pre-stored in the storage unit 230 of the content generation device 20 .
- the learning unit 242 writes the generated speech generation model 231 to the storage unit 230 and stores it (step S103).
- FIG. 9 is a flowchart showing an example of the flow of processing in generating the image generation model 232 according to this embodiment.
- the content generation device 20 generates and prepares an image generation model 232 in advance in order to generate a synthetic moving image.
- the acquisition unit 241 acquires a learning person's image (step S201). Specifically, the acquisition unit 241 acquires the learning person image that the communication unit 210 received from the user terminal 10 .
- the learning unit 242 generates the image generation model 232 (step S202). Specifically, the learning unit 242 performs transfer learning of the learning personal image (teacher data) acquired by the acquisition unit 241 to an existing trained model that has learned changes in human motion in advance, so that the user The original image generation model 232 of the person is generated. An existing trained model for generating image generation model 232 is stored in advance in storage unit 230 of content generation device 20 .
- the learning unit 242 writes and stores the generated image generation model 232 in the storage unit 230 (step S203).
- FIG. 10 is a flowchart showing an example of the flow of processing in generating a synthetic moving image according to this embodiment.
- the acquisition unit 241 first acquires presentation data (step S301). Specifically, the acquisition unit 241 acquires the presentation data received by the communication unit 210 from the user terminal 10 .
- the acquisition unit 241 acquires display data (step S302). Specifically, the acquiring unit 241 acquires slides included in the acquired presentation data as display data.
- the image generation unit 247 of the content generation device 20 generates a display image (step S303). Specifically, the image generation unit 247 generates a display image by converting the display data acquired by the acquisition unit 241 into an image.
- the acquisition unit 241 acquires text data (step S304). Specifically, the acquiring unit 241 acquires the first text of the notebook part included in the acquired presentation data as text data.
- the division unit 243 of the content generation device 20 performs division processing (step S305). Specifically, the dividing unit 243 divides the first text acquired by the acquiring unit 241 into a plurality of sentences by dividing each line feed.
- the playback time determination unit 244 of the content generation device 20 determines the playback time of the synthetic video (step S306). Specifically, the playback time determination unit 244 determines the playback time of the synthetic moving image based on the number of characters in the first text, the reading speed, the interval between the divided first texts, and the like.
- the subtitle generation unit 245 of the content generation device 20 generates subtitle text (step S307). Specifically, the subtitle generation unit 245 generates the subtitle text in units of the divided first text.
- the voice generation unit 246 of the content generation device 20 generates synthetic voice (step S308). Specifically, the speech generation unit 246 inputs the plurality of sentences divided by the division unit 243 into the speech generation model 231 stored in the storage unit 230 one by one. As a result, synthesized speech is generated by the speech generation model 231 . Then, the speech generation unit 246 acquires synthesized speech output from the speech generation model 231 .
- the image generation unit 247 generates a personal image for synthesis (step S309). Specifically, the image generation unit 247 inputs the synthesized speech generated by the sound generation unit 246 to the image generation model 232 stored in the storage unit 230 one by one. As a result, the image generation model 232 generates the personal image for synthesis. Then, the image generation unit 247 acquires the personal image for synthesis output from the image generation model 232 .
- the synthesizer 248 of the content generation device 20 generates a clone video (step S310). Specifically, the synthesizing unit 248 synthesizes the personal image for synthesis generated by the image generating unit 247 for each synthesized speech generated by the audio generating unit 246 to generate a clone moving image.
- the synthesizing unit 248 generates a synthesized moving image (step S311). Specifically, the synthesizing unit 248 synthesizes the display image generated by the image generating unit 247, the caption text generated by the caption generating unit 245, and the synthesized clone moving image to generate a synthesized moving image. After generating the synthetic video, the content generation device 20 ends the process. Note that the content generation device 20 may write and store the generated synthetic video in the storage unit 230 or transmit it to the user terminal 10 as necessary.
- the content generation device 20 includes the acquisition unit 241 , the audio generation unit 246 , the image generation unit 247 and the synthesis unit 248 .
- Acquisition unit 241 acquires text data indicating the first text to be read aloud and display data displayed corresponding to the content of the text data.
- the speech generation unit 246 uses the speech generation model 231 that has learned how to read the second text with the user's voice based on the speech when the user reads out the second text to be learned. Synthetic speech is generated in which the first text indicated by the text data is read aloud by the user's voice.
- the image generator 247 generates a display image displayed in correspondence with the synthesized speech based on the acquired display data.
- the synthesizing unit 248 synthesizes the generated synthetic voice and display image with the user's own image to generate synthetic content.
- the content generation device 20 can reduce the sense of incongruity given to the listener in reading the text, and also make it possible to express the authenticity of the person.
- the speech generation unit 246 may express the user's emotions with synthetic speech by adjusting parameters.
- the speech generation unit 246 sets parameters according to the contents of the text data, for example, 80% joy and 20% surprise.
- various emotions such as sadness and anger may be combined.
- the voice generating unit 246 can generate synthetic voice that can express various emotions according to the contents of the text data, in addition to reading out the user's own style.
- the content generation device 20 may have a function of editing the composite content generated by the composition unit 248 .
- the user inputs editing content to the user terminal 10 .
- the content generation device 20 edits the synthesized content according to the user's input to the user terminal 10 .
- this function for example, it is possible to edit slides (display images), caption texts, voices of clone data, and personal images. Editing a slide allows editing the content of the slide, changing the display order of a plurality of slides, adding a new slide, deleting an existing slide, and the like.
- a clone video obtained by synthesizing a plurality of synthesized voices and a plurality of personal images for synthesis is synthesized with one display image
- the present invention is not limited to such an example.
- a plurality of display images may be synthesized with respect to a clone moving image obtained by synthesizing one synthetic voice and one personal image for synthesis. In this case, a plurality of display images are switched and displayed until reading by one synthesized voice is completed.
- the present invention is not limited to such an example.
- the document data includes text data and display data, data created using Microsoft Word, data created using Microsoft Excel, PDF (Portable Document Format) data and so on.
- the material data may be a combination of data containing only text data and data containing only display data.
- the content generation system 1 may consist of only the content generation device 20 that can be directly operated by the user. That is, the content generation device 20 may also serve as the user terminal 10 . In this case, the user can generate and use the synthesized content without connecting the terminal to the network NW.
- the functions of the content generation device 20 may be implemented by a plurality of devices. For example, functions for generating the speech generation model 231 and the image generation model 232 may be implemented by other devices. In this case, the content generation device 20 stores the audio generation model 231 and the image generation model 232 generated by another device in the storage unit 230, thereby generating synthesized content in the same manner as in the above embodiment. can.
- the acquisition unit 241 acquires slides included in the presentation data as display data, and the image generation unit 247 converts the display data acquired by the acquisition unit 241 into an image to generate a display image.
- the acquisition unit 241 acquires the text data read aloud by the digital clone and does not acquire the display data, and the image generation unit 247 generates a personal image for synthesis (digital clone ) and does not generate a display image displayed in correspondence with the synthesized speech.
- a personal image for synthesis digital clone
- the displayed synthesized video 50 is a video (clone video 52) obtained by synthesizing the synthesized speech and the digital clone 51.
- the synthesized video 50 is generated by the image generation unit 247. Display images such as slides to be displayed are not included.
- subtitle text may be displayed within the clone video 52 .
- the caption generation unit 245 among the components of the content generation device 20 in the above embodiment may be omitted.
- step S304 the acquisition unit 241 acquires at least text data read by the digital clone. Both step S305 of division processing and step S306 of reproduction time determination are executed. If there is no need to determine , step S306 and the reproduction time determination unit 244 that executes this step may be omitted.
- step S309 the image generation unit 247 generates a personal image for synthesis (digital clone) in which the user's actions change according to the generated synthesized speech, and also generates a slide or the like displayed in correspondence with the synthesized speech. No display image is generated.
- the synthesized moving image to be generated includes the synthesized voice and the personal image for synthesis, but does not include the display image. That is, in this modified example, step S311 in the above embodiment is not essential. However, if subtitle text is to be included in the composite moving image, the clone moving image and the subtitle text may be combined to form a composite moving image in step S311.
- the content generation device 20 may generate 3D content in which display images such as slides displayed in 3D space and digital clones (3D avatars) are arranged.
- a composite moving image 60 generated by the content generation device 20 of this modification includes two 3D displayed images 62 and 63 and a 3D avatar 64 arranged in a 3D space 61 . .
- each of the two display images 62 and 63 one end in the horizontal direction of the synthetic moving image 60 (the end near the periphery of the synthetic moving image 60) is positioned closer to the viewer of the synthetic moving image 60 than the other end. It is expressed obliquely so that
- the display of the 3D space is not limited to the example shown in FIG. 13, and the number, size, arrangement position, inclination direction, etc. of the display images may be changed as appropriate.
- the two display images 62 and 63 display slides, still images, moving images, caption text, and the like.
- the two display images 62 and 63 may be configured to display in real time comments or the like uploaded by viewers of the composite video 60 to an SNS (Social Networking Service).
- SNS Social Networking Service
- the 3D avatar 64 is represented so as to be positioned closer to the viewer than the two display images 62 and 63.
- the present invention is not limited to this.
- the 3D avatar 64 may be moved around in the 3D space 61, the expression of the 3D avatar 64 may be changed, and parts such as the mouth, head, hands, feet, and body may be moved.
- a clone video is created from the synthesized voice and the user's own image, and the clone video is used to create a synthetic video.
- a 3D avatar corresponding to the user may be created using an image, and a synthesized moving image may be produced by extracting, for example, a situation in which the 3D avatar moves its mouth in accordance with synthesized speech in a 3D space. That is, the generation of the clone moving image and the generation of the synthetic moving image may be performed at the same time, and this also applies to the above-described embodiments.
- the composite moving image of this modified example may be a moving image in which display images such as slides, still images, and moving images, clone moving images, caption text, and the like are continuously combined, and the display mode changes over time.
- the moving image may be such that at least one of a slide, a still image, a moving image, a clone moving image, subtitle text, and the like is displayed at one point in the synthesized moving image being reproduced.
- the image generation unit 247 generates a personal image for synthesis (personal image) in which the user's actions change according to the synthesized speech
- the synthesis unit 248 generates the synthesized speech and the synthesized speech.
- the synthesizing unit 248 may generate a synthesized moving image using a user's personal image (hereinafter sometimes referred to as an independent personal image) independent of the synthesized voice.
- An independent image of the user is an image of the user whose behavior does not change according to the synthesized voice. is mentioned.
- the independent principal image may be an image acquired from the user terminal 10 via the communication unit 210, or may be an image generated by the image generation unit 247 from the learning principal image.
- the image generation unit 247 in the content generation device 20 may be omitted.
- the modified example of the embodiment of the present invention has been described above. It should be noted that part or all of the content generation device 20 in the above-described embodiment may be realized by a computer.
- the computer may include at least one processor and one memory.
- a program for realizing this function may be recorded in a computer-readable recording medium, and the program recorded in this recording medium may be read into a computer system and executed.
- the "computer system” referred to here includes hardware such as an OS and peripheral devices.
- the term "computer-readable recording medium” refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems.
- “computer-readable recording medium” refers to a program that dynamically retains programs for a short period of time, like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. It may also include something that holds the program for a certain period of time, such as a volatile memory inside a computer system that serves as a server or client in that case. Further, the program may be for realizing a part of the functions described above, or may be capable of realizing the functions described above in combination with a program already recorded in the computer system. It may be implemented using a programmable logic device such as an FPGA (Field Programmable Gate Array).
- FPGA Field Programmable Gate Array
- One aspect of the present invention is a recording medium that non-temporarily stores a program, the program comprising: an acquisition unit that acquires text data indicating a first text to be read aloud; Using a speech generation model that has learned how to read the second text by the user's voice based on the speech when the second text is read aloud, the first text indicated by the acquired text data is used. and a synthesizing unit for synthesizing the synthesized speech and the image of the user to generate synthetic content. .
- Content generation system 10 User terminal 20 Content generation device 30 Presentation data 31-1 to 30-n Slide 32-1 to 32-n First text 40-1 to 40 -n... Composite video, 41-1 to 41-n... Display image, 42-1 to 42-n... Clone video, 43-1 to 43-n... Subtitle text, 210... Communication unit, 220... Input unit, 230 Memory unit 231 Audio generation model 232 Image generation model 240 Control unit 241 Acquisition unit 242 Learning unit 243 Dividing unit 244 Reproduction time determination unit 245 Caption generation unit 246 ... Audio generation unit 247 ... Image generation unit 248 ... Synthesis unit 249 ... Output processing unit 250 ... Output unit
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Human Computer Interaction (AREA)
- General Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023562416A JPWO2023090419A1 (https=) | 2021-11-19 | 2022-11-18 | |
| US18/667,096 US12608864B2 (en) | 2021-11-19 | 2024-05-17 | Content generation device, content generation method, and program |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2021-188791 | 2021-11-19 | ||
| JP2021188791 | 2021-11-19 |
Related Child Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/667,096 Continuation US12608864B2 (en) | 2021-11-19 | 2024-05-17 | Content generation device, content generation method, and program |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023090419A1 true WO2023090419A1 (ja) | 2023-05-25 |
Family
ID=86396966
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2022/042847 Ceased WO2023090419A1 (ja) | 2021-11-19 | 2022-11-18 | コンテンツ生成装置、コンテンツ生成方法、及びプログラム |
Country Status (3)
| Country | Link |
|---|---|
| US (1) | US12608864B2 (https=) |
| JP (1) | JPWO2023090419A1 (https=) |
| WO (1) | WO2023090419A1 (https=) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP7794515B1 (ja) * | 2025-07-14 | 2026-01-06 | 株式会社バリューアップデート | 動画生成装置及び動画生成方法並びにプログラム |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US12561876B2 (en) * | 2022-11-28 | 2026-02-24 | Constructor Technology Ag | System and method for an audio-visual avatar creation |
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001014307A (ja) * | 1999-07-02 | 2001-01-19 | Sony Corp | 文書処理装置、文書処理方法、及び記録媒体 |
| JP2003108502A (ja) * | 2001-09-28 | 2003-04-11 | Interrobot Inc | 身体性メディア通信システム |
| US20100082345A1 (en) * | 2008-09-26 | 2010-04-01 | Microsoft Corporation | Speech and text driven hmm-based body animation synthesis |
| JP2020006482A (ja) * | 2018-07-09 | 2020-01-16 | 株式会社国際電気通信基礎技術研究所 | アンドロイドのジェスチャ生成装置及びコンピュータプログラム |
| WO2020204000A1 (ja) * | 2019-04-01 | 2020-10-08 | 住友電気工業株式会社 | コミュニケーション支援システム、コミュニケーション支援方法、コミュニケーション支援プログラム、および画像制御プログラム |
| US20210034976A1 (en) * | 2019-08-02 | 2021-02-04 | Google Llc | Framework for Learning to Transfer Learn |
| JP2021177647A (ja) * | 2020-12-22 | 2021-11-11 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | ビデオシーケンス編成方法、装置、電子設備、記憶媒体、及びプログラム |
Family Cites Families (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JPH11312160A (ja) | 1998-02-13 | 1999-11-09 | Fuji Xerox Co Ltd | 自律的パ―ソナルアバタ―による文書注釈方法及び装置 |
| JP4449723B2 (ja) * | 2004-12-08 | 2010-04-14 | ソニー株式会社 | 画像処理装置、画像処理方法、およびプログラム |
| WO2015092936A1 (ja) * | 2013-12-20 | 2015-06-25 | 株式会社東芝 | 音声合成装置、音声合成方法およびプログラム |
| KR102407132B1 (ko) * | 2021-02-05 | 2022-06-10 | 장건 | 고인을 모사하는 가상 인물과 대화를 수행하는 서비스를 제공하는 방법 및 시스템 |
| US12417762B2 (en) * | 2022-04-13 | 2025-09-16 | International Business Machines Corporation | Speech-to-text voice visualization |
| US12039653B1 (en) * | 2023-05-30 | 2024-07-16 | Roku, Inc. | Video-content system with narrative-based video content generation feature |
| CN119126980A (zh) * | 2024-09-04 | 2024-12-13 | 中国矿业大学 | 一种交互式虚拟专家形象生成方法和系统 |
| KR102832018B1 (ko) * | 2024-09-25 | 2025-07-10 | 주식회사 에이아이트릭스 | 얼굴 이미지에 기초한 tts 모델 기반 음성 합성 시스템 및 그것의 합성 방법 |
-
2022
- 2022-11-18 WO PCT/JP2022/042847 patent/WO2023090419A1/ja not_active Ceased
- 2022-11-18 JP JP2023562416A patent/JPWO2023090419A1/ja active Pending
-
2024
- 2024-05-17 US US18/667,096 patent/US12608864B2/en active Active
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2001014307A (ja) * | 1999-07-02 | 2001-01-19 | Sony Corp | 文書処理装置、文書処理方法、及び記録媒体 |
| JP2003108502A (ja) * | 2001-09-28 | 2003-04-11 | Interrobot Inc | 身体性メディア通信システム |
| US20100082345A1 (en) * | 2008-09-26 | 2010-04-01 | Microsoft Corporation | Speech and text driven hmm-based body animation synthesis |
| JP2020006482A (ja) * | 2018-07-09 | 2020-01-16 | 株式会社国際電気通信基礎技術研究所 | アンドロイドのジェスチャ生成装置及びコンピュータプログラム |
| WO2020204000A1 (ja) * | 2019-04-01 | 2020-10-08 | 住友電気工業株式会社 | コミュニケーション支援システム、コミュニケーション支援方法、コミュニケーション支援プログラム、および画像制御プログラム |
| US20210034976A1 (en) * | 2019-08-02 | 2021-02-04 | Google Llc | Framework for Learning to Transfer Learn |
| JP2021177647A (ja) * | 2020-12-22 | 2021-11-11 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | ビデオシーケンス編成方法、装置、電子設備、記憶媒体、及びプログラム |
Non-Patent Citations (1)
| Title |
|---|
| SAITO, NORIAKI.: " Proposal of Personalized Online Course.), non-official translation (Symposium of Information Processing Society of Japan. Groupware and Network Services Workshop 2018); *", INFORMATION PROCESSING SOCIETY OF JAPAN SYMPOSIUM GROUPWARE AND NETWORK SERVICE WORKSHOP 2018, 8 November 2018 (2018-11-08), JP, pages 1 - 6, XP009545702 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP7794515B1 (ja) * | 2025-07-14 | 2026-01-06 | 株式会社バリューアップデート | 動画生成装置及び動画生成方法並びにプログラム |
Also Published As
| Publication number | Publication date |
|---|---|
| US12608864B2 (en) | 2026-04-21 |
| US20240303892A1 (en) | 2024-09-12 |
| JPWO2023090419A1 (https=) | 2023-05-25 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US10088976B2 (en) | Systems and methods for multiple voice document narration | |
| CN106653052B (zh) | 虚拟人脸动画的生成方法及装置 | |
| US8498866B2 (en) | Systems and methods for multiple language document narration | |
| KR102116309B1 (ko) | 가상 캐릭터와 텍스트의 동기화 애니메이션 출력 시스템 | |
| US20100318362A1 (en) | Systems and Methods for Multiple Voice Document Narration | |
| US10372790B2 (en) | System, method and apparatus for generating hand gesture animation determined on dialogue length and emotion | |
| US12608864B2 (en) | Content generation device, content generation method, and program | |
| KR100856786B1 (ko) | 3d 버추얼 에이전트를 사용한 멀티미디어 나레이션 저작시스템 및 그 제공 방법 | |
| McDonald | Considerations on generating facial nonmanual signals on signing avatars | |
| Wolfe et al. | State of the art and future challenges of the portrayal of facial nonmanual signals by signing avatar | |
| CN117131210A (zh) | 一种基于富媒体的数字人汇报视频生成方法和系统 | |
| JPWO2023090419A5 (https=) | ||
| Kolivand et al. | Realistic lip syncing for virtual character using common viseme set | |
| JP2022164367A (ja) | 翻訳装置およびプログラム | |
| Priya et al. | Enabling global communication through automated real-time video dubbing | |
| JP2024088118A (ja) | コンテンツ生成システム、コンテンツ生成装置、ユーザ端末、コンテンツ生成方法、及びプログラム | |
| JP7807169B2 (ja) | 手話翻訳装置及びプログラム | |
| TWM652806U (zh) | 互動虛擬人像系統 | |
| Zabala et al. | Attainable digital embodied storytelling using state of the art tools, and a little touch | |
| US20230245644A1 (en) | End-to-end modular speech synthesis systems and methods | |
| Martin et al. | 3D audiovisual rendering and real-time interactive control of expressivity in a talking head | |
| JP2020204683A (ja) | 電子出版物視聴覚システム、視聴覚用電子出版物作成プログラム、及び利用者端末用プログラム | |
| Kener et al. | 3D Realistic Animation of Greek Sign Language’s Fingerspelled Signs | |
| CN118053416B (zh) | 声音定制方法、装置、设备及存储介质 | |
| González-Docasal et al. | EAM: emotional avatar generation for the metaverse |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22895700 Country of ref document: EP Kind code of ref document: A1 |
|
| ENP | Entry into the national phase |
Ref document number: 2023562416 Country of ref document: JP Kind code of ref document: A |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 22895700 Country of ref document: EP Kind code of ref document: A1 |