WO2025009127A1 - データ構築装置、データ構築方法、およびデータ構築プログラム - Google Patents

データ構築装置、データ構築方法、およびデータ構築プログラム Download PDF

Info

Publication number
WO2025009127A1
WO2025009127A1 PCT/JP2023/025022 JP2023025022W WO2025009127A1 WO 2025009127 A1 WO2025009127 A1 WO 2025009127A1 JP 2023025022 W JP2023025022 W JP 2023025022W WO 2025009127 A1 WO2025009127 A1 WO 2025009127A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
image data
speech
speaker
unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/JP2023/025022
Other languages
English (en)
French (fr)
Japanese (ja)
Inventor
沙希 水野
亮 増村
哲 小橋川
伸克 北条
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NTT Inc
Original Assignee
Nippon Telegraph and Telephone Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nippon Telegraph and Telephone Corp filed Critical Nippon Telegraph and Telephone Corp
Priority to JP2025530909A priority Critical patent/JPWO2025009127A1/ja
Priority to PCT/JP2023/025022 priority patent/WO2025009127A1/ja
Publication of WO2025009127A1 publication Critical patent/WO2025009127A1/ja
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present invention relates to a data construction device, a data construction method, and a data construction program.
  • a technology is known that converts speech video data into speech image data related to the individuality of the facial image data, mouth movements corresponding to the speech audio data, and facial expressions corresponding to the type and intensity of the emotion by inputting facial image data of a neutral (average) expression, speech audio data, and the type and intensity of the emotion.
  • One such technology is, for example, a converter that uses speech video data by multiple actors to train a learning model for estimating speech video data with transformed faces (see Non-Patent Document 1).
  • a few-speaker speech video database with impression labels which is a database of impression-labeled few-speaker video data related to speech by a few speakers with impression labels, is constructed, and when training a learning model using the conventional technology, there may be no data matching the individuality of the speaker's face in the training data, and speech video data with unnatural facial expressions may be output.
  • the present invention has a data acquisition unit that acquires speech image data related to an utterance by a first speaker and facial image data related to the face of a second speaker, a data expansion unit that generates extended speech image data related to the utterance by the second speaker based on the speech image data and the facial image data acquired by the data acquisition unit, and a data construction unit that constructs a pair of the extended speech image data generated by the data expansion unit and an impression label related to the impression of the utterance as learning data for training a learning model.
  • the present invention makes it possible to convert speech image data into natural speech at low cost.
  • FIG. 1 is a diagram illustrating an example of an overview of a data construction system according to the first embodiment.
  • FIG. 2 is a diagram illustrating an example of the configuration of a data construction system according to the first embodiment.
  • FIG. 3 is a diagram illustrating an example of data extension.
  • FIG. 4 is a diagram illustrating an example of data extension.
  • FIG. 5 is a diagram for explaining an example of learning.
  • FIG. 6 is a flowchart showing an example of the flow of processing executed by the data construction system according to the first embodiment.
  • FIG. 7 is a diagram illustrating an example of the configuration of a data construction system according to the second embodiment.
  • FIG. 8 is a diagram for explaining an example of the calculation.
  • FIG. 9 is a diagram for explaining an example of the calculation.
  • FIG. 1 is a diagram illustrating an example of an overview of a data construction system according to the first embodiment.
  • FIG. 2 is a diagram illustrating an example of the configuration of a data construction system according to the first embodiment.
  • FIG. 10 is a diagram for explaining an example of the calculation.
  • FIG. 11 is a diagram for explaining an example of learning.
  • FIG. 12 is a flowchart showing an example of the flow of processing executed by the data construction system according to the second embodiment.
  • FIG. 13 is a diagram illustrating an example of the configuration of a computer that executes a data construction program.
  • FIG. 14 is a diagram for explaining an example of an outline of a data construction system according to the reference technology.
  • FIG. 15 is a diagram for explaining an example of an outline of a data construction system according to the reference technology.
  • Figs. 14 and 15 are diagrams for explaining an example of the outline of a data construction system according to the reference technology.
  • the data construction system 1# according to the reference technology synthesizes speech video data and face image data, and trains a learning model that outputs speech image data related to an utterance with a facial expression according to the individuality of the face image data.
  • the speech image data refers to image data related to an utterance by a first speaker, and may include still image data.
  • the face image data refers to image data related to a second speaker.
  • the data construction system 1# generates output speech video data for the same speaker and speech content, in which the speaker's facial expression is converted according to the emotion label, based on the emotion label related to the speaker's emotion and the input speech video data.
  • the data construction system 1# uses, for example, the input speech video data and the emotion label as input, and trains a learning model by supervised learning using the feature amount of the speaker's facial expression in the output speech video data as the output ground truth data.
  • data construction system 1# generates output speech video data from input speech video data with emotion labels in an emotion-labeled multi-speaker speech video database, which is a database of speech video data by multiple speakers with emotion labels, where the speaker and speech content are the same, but the facial expressions of the speakers are transformed according to different emotion labels.
  • “multiple” is not particularly limited as long as it is a predetermined number or more, but refers to, for example, hundreds or thousands of people or more.
  • “few” is not particularly limited as long as it is less than a predetermined number, but refers to, for example, two or three people.
  • data construction system 1# uses input speech video data and emotion labels as input, and a facial expression feature sequence related to the time series of features of the speaker's facial expression in the output speech video data as output ground truth data, and trains a learning model through supervised learning.
  • a converter in data construction system 1# (which converts the speaker's facial expression in the speech video data) first extracts video features from the video data of the input speech video data, and extracts audio features from the audio data of the input speech video data. Next, the converter trains a learning model so that it can estimate the speaker's facial expression feature sequence in the output speech video data, for example, based on the video features, audio features, and emotion labels.
  • the data construction system 1# estimates the features of the speaker's facial expression in the output speech video data based on the emotion label and the input speech video data using the learning model, and estimates the output speech video data of the features.
  • the converter extracts video features from the video data of the input speech video data and extracts audio features from the audio data, as in the case of learning the learning model.
  • the converter estimates a facial expression feature sequence based on the video features, audio features, and emotion labels.
  • the converter estimates the output speech video data by rendering based on the input speaker image data and the facial expression feature sequence related to the speaker input to the learning model.
  • the converter estimates output speech video data in which the speaker's facial expression in the input speaker image data is converted according to the facial expression feature sequence to become an expression corresponding to the speaker's expression in the output speech video data.
  • the converter uses, for example, the rendering technology of the following reference document. References: Aliaksandr Siarohin, Stephane Lathuiliere, Sergey Tulyakov, Elisa Ricci and Nicu Sebe, “First Order Motion Model for Image Animation”, NeurIPS 2019
  • Fig. 1 is a diagram for explaining an example of the outline of a data construction system according to the first embodiment.
  • the data construction system 1 acquires input speech image data (speech image data) and face image data, generates extended speech image data based on the input speech image data and the face image data, and constructs a pair of the extended speech image data and an impression label as learning data.
  • Speech image data refers to image data related to a speech by a first speaker, and may include still image data.
  • Facial image data refers to image data related to a second speaker.
  • Extended speech image data refers to broader speech image data related to speech in which the facial expressions of the second speaker have been converted, and may include still image data.
  • Training data refers to data used for training a learning model, etc., which takes impression labels, speech image data, and facial image data as inputs, and outputs the facial features of the second speaker in the extended speech image data.
  • the data construction system 1 generates extended speech video data containing extended speech image data in which the individual facial features have been transformed so that the face of the first speaker becomes the face of the second speaker in the facial image data while conforming to the facial expression of the first speaker in the speech video data with impression labels that includes the input speech image data, and constructs a pair of the extended speech video data and the impression label as learning data.
  • FIG. 2 is a diagram showing an example of the configuration of the data construction system according to the first embodiment.
  • the data construction system 1 includes a data construction device 10.
  • the data construction device 10 acquires speech image data and face image data, generates extended speech image data based on the speech image data and the face image data, and constructs a pair of the extended speech image data and the impression label as learning data.
  • the data construction device 10 has a data acquisition unit 11, a control unit 12, an output unit 13, and a storage unit 14.
  • the data acquisition unit 11 acquires speech image data related to the speech by the first speaker and face image data related to the face of the second speaker.
  • the data acquisition unit 11 may acquire speech video data and face image data including speech image data.
  • the data acquisition unit 11 may acquire speech image data from a minority speaker speech image database related to speech by each of a minority speaker less than a predetermined number, and acquire face image data from a multiple face image database related to each face image of a multiple speaker equal to or greater than a predetermined number.
  • the minority speaker speech image database includes speech video data, it is also called a minority speaker speech video database.
  • the data acquisition unit 11 acquires speech video data (i) from the minority speaker speech video database and acquires face image data (k) from the multiple face image database. Since the minority speaker speech image database and the multiple face image data can be constructed at low cost, the data acquisition unit 11 can reduce costs by acquiring image data from these databases.
  • the data acquisition unit 11 may further acquire facial features of the second speaker in the extended speech image data.
  • the data acquisition unit 11 acquires a facial landmark of the speaker of the extended speech video data (i, k) by extracting facial features (landmark) of the speaker of the extended speech video data (i, k) from the extended speech video data (i, k).
  • i is a number corresponding to the speaker number of the speech video data.
  • k is a number corresponding to the speaker number of the face image data.
  • the control unit 12 controls the entire data construction device 10.
  • the control unit 12 is configured with one or more processors having programs in which various processing procedures are defined and internal memory in which control information is stored, and the processor executes each process using the programs and internal memory.
  • the control unit 12 is realized by, for example, electronic circuits such as a central processing unit (CPU), a micro processing unit (MPU), or a graphics processing unit (GPU), or integrated circuits such as an application specific integrated circuit (ASIC) or a field programmable gate array (FPGA).
  • the control unit 12 has a data extension unit 121, a data construction unit 122, a learning unit 123, and an estimation unit 124.
  • the data expansion unit 121 generates extended speech image data related to the speech by the second speaker based on the speech image data and face image data acquired by the data acquisition unit 11.
  • the data expansion unit 121 may generate extended speech video data including extended speech image data based on the speech video data and face image data acquired by the data acquisition unit 11.
  • the data expansion unit 121 is realized by, for example, a converter that converts the features of the speaker's facial expression. An example of data expansion by the data expansion unit 121 will be described below with reference to Figures 3 and 4. Figures 3 and 4 are diagrams for explaining an example of data expansion.
  • the data expansion unit 121 generates extended speech video data (i, k) based on impression-labeled speech video data (i) and facial image data (k) in the impression-labeled speech video database.
  • the data expansion unit 121 generates extended speech video data (i, k) related to speech by a speaker in facial image data (k) in which facial expression features have been converted so that the facial expression of the speaker in the speech video data (i) is converted to the facial expression of the speaker in the facial image data (k).
  • the data expansion unit 121 performs data expansion of the speech video data using a few-speaker speech video database including speech video data (i) by a few speakers and a multiple face image database including a large number of face image data (k). For example, the data expansion unit 121 generates expanded speech video data (i, k) having the facial expression of the speaker in the speech video data (i) and the individual characteristics of the speaker in the face image data (k) based on the speech video data (i) in the few-speaker speech video database and the face image data (k) in the multiple face image database.
  • the data expansion unit 121 generates each frame image of the expanded speech video data (i, k) corresponding to all combinations of each frame image of the speech video data (i) in the speech video database of the few speakers with impression labels and each face image data (k) in the multiple face image database, and generates expanded speech video data (i, k) having the facial expression of the speaker in the speech video data (i) and the individual characteristics of the speaker in the face image data (k).
  • the data extension unit 121 generates the extended speech video data (i, k) using a facial expression conversion technique such as that described in the following reference 2.
  • the data extension unit 121 may also perform the above-mentioned processing for each i and k.
  • the data construction unit 122 constructs a pair of the extended speech image data generated by the data expansion unit 121 and an impression label related to the impression of the speech as learning data for training a learning model.
  • the data construction unit 122 can construct speech video data of natural facial expressions of a large number of speakers as learning data at low cost, and therefore can convert the data into natural speech image data at low cost.
  • the data construction unit 122 may construct a pair of the extended speech video data generated by the data expansion unit 121 and an impression label as learning data.
  • the data construction unit 122 constructs an extended speech video database with impression labels, which includes extended speech video data (i, k) with impression labels, as a database for training data.
  • the data construction unit 122 adds a pair of extended speech video data (i, k) and an impression label related to the impression of the speech (extended speech video data with impression labels) to the extended speech video database with impression labels.
  • the database includes speech video data of natural facial expressions of many speakers.
  • the data construction unit 122 can convert the facial expressions of the speakers in the extended speech video data to natural facial expressions that match the individuality of the input speaker's face. Since a few-speaker speech video database and multiple face image data can be constructed at low cost, the data construction unit 122 can convert to natural speech image data at low cost.
  • the learning unit 123 trains a learning model using the learning data, speech image data, and face image data constructed by the data construction unit 122, which inputs the impression label, speech image data, and face image data, and outputs the facial feature amount of the second speaker in the extended speech image data.
  • the learning unit 123 trains the learning model in the same manner as the reference technology, except that the learning model is trained using the extended speech video data with impression labels as learning data.
  • the learning unit 123 realizes conversion of the speaker's facial expression in the extended speech image data so that the facial expression matches the individuality of the speaker (input speaker) of the face image data input to the learning model, and can convert to natural speech image data at low cost.
  • FIG. 5 is a diagram showing an example of learning.
  • the learning unit 123 may train a learning model that uses the extended speech video data with impression labels, speech video data, and face image data constructed by the data construction unit 122 as inputs, and outputs the facial features of the second speaker in the extended speech video data.
  • the learning unit 123 trains a learning model using the extended speech video data (i, k) with impression labels, the speech video data (i), and the facial image data (k) to input the speech video data (i) with impression labels and to output the landmark of the speaker's face in the extended speech video data (i, k).
  • This allows the learning unit 123 to train the learning model so that it is possible to convert, at low cost, natural speech video data in which the speaker's facial expression (facial expression feature) is converted into a natural expression (facial expression feature for impression conversion) that matches the individuality of the input speaker's face.
  • the estimation unit 124 estimates the facial feature of the second speaker in the extended speech image data based on the learning model learned by the learning unit 123. For example, as shown in FIG. 5, the estimation unit 124 estimates the coordinates (facial expression feature after conversion) of the speaker's facial landmark at each time of the impression-labeled extended speech video data (i, k) based on the learning model, and estimates the speaker's facial landmark at each time (facial expression feature for impression conversion). In addition, for example, the estimation unit 124 estimates output speech video data corresponding to the impression-labeled extended speech video data (i, k) in which the facial expression feature for impression conversion is reflected by rendering. Thus, in the example shown in FIG.
  • the estimation unit 124 estimates the facial feature of the second speaker in the extended speech video data and the output speech video data corresponding to the extended speech video data as the estimation target in the same manner as the reference technology.
  • the estimation unit 124 is realized, for example, by a converter that converts the facial expression feature of the speaker.
  • the output unit 13 outputs various data. For example, the output unit 13 outputs output speech video data corresponding to the extended speech video data (i, k) with impression labels estimated by the estimation unit 124.
  • the output unit 13 is realized by, for example, a display unit such as a display.
  • the storage unit 14 stores various data.
  • the storage unit 14 stores a few speaker speech video database with impression labels, a multiple face image database, a learning model, and an OS (Operating System) and various programs executed by the data construction device 10.
  • the storage unit 14 is realized by, for example, a storage device such as a hard disk drive (HDD), a solid state drive (SSD), or an optical disk, or a semiconductor memory capable of rewriting information such as a random access memory (RAM), a flash memory, or a non-volatile static random access memory (NVSRAM).
  • HDD hard disk drive
  • SSD solid state drive
  • NVSRAM non-volatile static random access memory
  • FIG. 6 is a flowchart showing an example of the flow of processing executed by the data construction system according to the first embodiment.
  • step S11 the data acquisition unit 11 acquires speech image data relating to an utterance by a first speaker and face image data relating to the face of a second speaker. For example, when training a learning model, the data acquisition unit 11 acquires speech video data (i) from a minority speaker speech video database and acquires face image data (k) from a multiple face image database.
  • step S12 the data expansion unit 121 generates extended speech image data relating to speech by a second speaker based on the speech image data and face image data acquired by the data acquisition unit 11.
  • the data expansion unit 121 generates extended speech video data (i, k) relating to speech by the speaker of the face image data, in which the facial expression of the speaker in the speech video data (i) is converted to the facial expression of the speaker in the face image data (k).
  • the data construction unit 122 constructs a pair of the extended speech image data generated by the data expansion unit 121 and an impression label related to the impression of the speech as learning data for training the learning model.
  • the data construction unit 122 constructs an extended speech video database with impression labels, which includes extended speech video data (i, k) with impression labels, as a database for learning data.
  • the learning unit 123 trains a learning model using the learning data, speech image data, and face image data constructed by the data construction unit 122, which takes the impression label, speech image data, and face image data as input, and outputs the facial features of the second speaker in the extended speech image data.
  • the learning unit 123 trains a learning model using extended speech video data (i, k) with impression labels, speech video data (i), and face image data (k), which takes the speech video data (i) with impression labels and face image data (k) as input, and outputs the speaker's facial landmark in the extended speech video data (i, k).
  • the estimation unit 124 estimates the facial features of the second speaker in the extended speech image data based on the learning model learned by the learning unit 123. For example, the estimation unit 124 estimates the coordinates of the speaker's facial landmark at each time in the extended speech video data (i, k) with impression labels based on the learning model, and estimates output speech video data corresponding to the extended speech video data (i, k) with impression labels reflecting the speaker's facial landmark at each time.
  • the data construction system 1 estimates the coordinates of the landmark of the speaker's face at each time based on a learning model, and estimates output speech video data corresponding to the impression-labeled extended speech video data (i, k) reflecting the landmark of the speaker's face at each time. Since the position of this landmark includes two types of information, namely, the individuality of the face (average landmark position) and the change in facial expression (difference from the average landmark position), in the above example, the data construction system 1 estimates two different types of information simultaneously.
  • the data construction system 1X trains a learning model using the learning data, impression label, speech image data, and face image data, with the impression label, speech image data, and face image data as inputs, and with the feature amount of the facial expression change of the second speaker in the extended speech image data as output.
  • the data construction system 1X deletes facial individuality information by normalizing the average value of the landmark of the speaker's face for each speaker to 0 and the variance to 1, and creates a vector that expresses only the facial expression change.
  • the data construction system 1X estimates the feature amount of the facial expression change of the second speaker in the extended speech image data, such as a vector that expresses only the facial expression change.
  • the data construction system 1X simplifies the input/output relationship with the learning model, makes it easier to train the learning model, improves conversion accuracy, and enables the synthesis of speech video data and facial image data so as to obtain a facial expression that matches the individual with less training data.
  • FIG. 7 is a diagram showing an example of the configuration of a data construction system according to the second embodiment.
  • the data construction system 1X includes a data construction device 10X.
  • the data construction device 10X uses the learning data, the impression label, the speech image data, and the face image data to train a learning model that inputs the impression label, the speech image data, and the face image data, and outputs the feature amount of the change in the facial expression of the second speaker in the extended speech image data.
  • the data construction device 10X has a control unit 12X instead of the control unit 12 in the first embodiment. Except for this point, the data construction device 10X is similar to the data construction device 10 according to the first embodiment.
  • the control unit 12X further includes a calculation unit 125, and includes a learning unit 123X and an estimation unit 124X instead of the learning unit 123 and the estimation unit 124 in the first embodiment. Except for this point, the control unit 12X is similar to the control unit 12 in the first embodiment.
  • the calculation unit 125 calculates the feature amount of the change in the facial expression of the second speaker in the extended speech image data based on the learning data constructed by the data construction unit 122.
  • An example of the calculation by the calculation unit 125 will be described below with reference to Fig. 8 to Fig. 10.
  • Fig. 8 to Fig. 10 are diagrams for explaining an example of the calculation.
  • the calculation unit 125 extracts a facial landmark of each speaker from the extended speech video data of each speaker included in the extended speech video database with impression labels.
  • This extended speech video database with impression labels includes a speech video database with impression labels of each speaker, and the calculation unit 125 extracts, for example, a facial landmark of each speaker from the extended speech video data with impression labels of each speaker in the extended speech video database with impression labels of each speaker.
  • the calculation unit 125 calculates the landmark statistics of each speaker's face, which mainly contain information about the individuality of each speaker, based on the average value and variance of the coordinates of the landmark of each speaker's face, based on all data corresponding to each speaker. For example, the calculation unit 125 calculates the landmark statistics of each speaker's face based on data including various facial expressions of each speaker, so that among the information included in the landmark series of each speaker's face related to the time series of the facial features of each speaker in the extended speech video data of each speaker, the facial expression information is averaged, and it is possible to calculate the landmark of each speaker's face, which mainly contains information about the individuality of each speaker.
  • t is the frame number corresponding to each time in the extended speech video data (i, k) with impression labels.
  • T is the last frame number of the extended speech video data (i, k) with impression labels.
  • the calculation unit 125 calculates the average value and variance of the coordinates of the landmark of the kth speaker's face based on the landmark feature of the frame number corresponding to each time, which is all the data corresponding to each speaker including the landmark of the kth speaker's face.
  • the calculation unit 125 calculates the average value and variance of the landmark of the kth speaker's face as the landmark statistics of the kth speaker's face.
  • the calculation unit 125 calculates the average value and variance of the coordinates of the landmark of the face of the kth speaker as the landmark statistics of the face of the kth speaker.
  • the calculation unit 125 calculates the landmark statistics of the face of each speaker, for example, in the same manner as in the above example.
  • the calculation unit 125 normalizes the landmark of each speaker's face based on the average value and variance of the landmark of each speaker's face. For example, the calculation unit 125 standard-normalizes the landmark statistics of each speaker so that the average value of the landmark of each speaker's face is 0 and the variance is 1, and calculates the normalized value as the facial expression change landmark of each speaker.
  • the learning unit 123X may train a learning model that outputs the difference between the facial feature amount of the second speaker and the average facial feature amount of the second speaker as the feature amount of the facial change of the second speaker.
  • the learning unit 123X trains a learning model using the average value and variance of the facial landmarks of each speaker standard-normalized by the calculation unit 125 as the facial change landmark sequence of each speaker. In this way, the learning unit 123X can train the learning model so that the facial feature amount from which the individual information of the face has been deleted and only the facial change is expressed can be estimated as the feature amount of the facial change of the second speaker in the extended speech image data.
  • the learning unit 123X learns a learning model that takes the impression label, input speech video data, and input facial image data as input, and outputs the speaker's facial expression change landmark sequence in the output speech video data, using the facial expression change landmark sequence, impression label, input speech video data, input facial image data, and output speech video data calculated by the calculation unit 125.
  • the learning unit 123X trains the learning model to become a learning model that can estimate each speaker's facial expression change landmark sequence, which mainly contains facial expression information, from the information contained in each speaker's facial landmark sequence.
  • Estimatiation unit 124X estimates a feature amount of a change in a facial expression of the second speaker in the extended speech image data based on the learning model learned by the learning unit 123X.
  • estimation unit 124X estimates a feature amount of a change in a facial expression of the second speaker in the extended speech image data based on the learning model learned by the learning unit 123X.
  • the estimation unit 124X performs estimation in the same manner as the data construction system 1 according to the first embodiment, except that the estimation targets are facial expression change landmarks and facial expression change landmark sequences.
  • the estimation unit 124X uses a learning model to estimate a facial expression change landmark and a facial expression change landmark sequence based on impression labels, video features extracted from video data of the input speech video data, and audio features extracted from audio data of the input speech video data.
  • the estimation unit 124X extracts landmarks of the input speaker's face in the video data of the input speech video data and calculates the landmark statistics of the input speaker's face to obtain the landmark statistics of the input speaker's face.
  • the estimation unit 124X performs decoding (reflecting individuality) based on, for example, the facial expression change landmark sequence and the landmark statistics of the input speaker's face in the input speaker image data to obtain a landmark sequence in which the facial expression change landmark sequence is added to the landmark statistics of the input speaker's face.
  • the estimation unit 124X performs rendering based on, for example, the landmark sequence and the input speaker image to obtain output speech video data that reflects the individuality (average landmark position) and facial expression changes (difference from the average landmark position) of the speaker's face in the input speaker image.
  • Fig. 12 is a flowchart showing an example of the flow of processing executed by the data construction system according to the second embodiment. Steps S21 to S23 are similar to steps S11 to S13 in the first embodiment.
  • the learning unit 123X uses the learning data, impression label, speech image data, and face image data constructed by the data construction unit 122 to train a learning model that takes the impression label, speech image data, and face image data as input, and outputs the feature amount of the facial expression change of the second speaker in the extended speech image data.
  • the learning unit 123X uses the facial expression change landmark sequence, impression label, input speech video data, input face image data, and output speech video data calculated by the calculation unit 125 to train a learning model that takes the impression label, input speech video data, and input face image data as input, and outputs the speaker's facial expression change landmark sequence in the output speech video data.
  • the estimation unit 124X estimates features of the facial expression changes of the second speaker in the extended speech image data based on the learning model learned by the learning unit 123X. For example, the estimation unit 124X uses the learning model to estimate a facial expression change landmark of the input speaker based on the impression label, the video features extracted from the video data in the input speech video data, and the audio features extracted from the audio data in the input speech video data.
  • the data construction device 10 and the data construction device 10X can be implemented by installing a data construction program in a computer as package software or online software.
  • the computer can function as the data construction device 10 and the data construction device 10X by executing the data construction program.
  • FIG. 13 is a diagram showing an example of the configuration of a computer that executes a data construction program.
  • the computer 1000 has a memory 1010, a CPU 1020, a hard disk drive interface 1030, a disk drive interface 1040, a serial port interface 1050, a video adapter 1060, and a network interface 1070. These components are connected by a bus 1080.
  • the memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012.
  • the ROM 1011 stores a boot program such as a BIOS (Basic Input Output System).
  • BIOS Basic Input Output System
  • the hard disk drive interface 1030 is connected to a hard disk drive 1090.
  • the disk drive interface 1040 is connected to a disk drive 1100.
  • a removable storage medium such as a magnetic disk or optical disk is inserted into the disk drive 1100.
  • the serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120.
  • the video adapter 1060 is connected to, for example, a display 1130.
  • the hard disk drive 1090 stores, for example, an OS 1091, an application program 1092, a program module 1093, and program data 1094. That is, the programs that define the processes executed by the data construction device 10 and the data construction device 10X are implemented as program modules 1093 in which computer-executable code is written.
  • the program modules 1093 are stored, for example, in the hard disk drive 1090.
  • the program modules 1093 for executing processes similar to the functional configurations in the data construction device 10 and the data construction device 10X are stored in the hard disk drive 1090.
  • the hard disk drive 1090 may be replaced by an SSD.
  • the data used in the processing of the above-described embodiment is stored as program data 1094, for example, in memory 1010 or hard disk drive 1090. Then, the CPU 1020 reads the program module 1093 or program data 1094 stored in memory 1010 or hard disk drive 1090 into RAM 1012 as necessary and executes it.
  • the program module 1093 and program data 1094 are not limited to being stored in the hard disk drive 1090, but may be stored in a removable storage medium, for example, and read by the CPU 1020 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected via a network (such as a LAN (Local Area Network), WAN (Wide Area Network)). The program module 1093 and program data 1094 may then be read by the CPU 1020 from the other computer via the network interface 1070.
  • a network such as a LAN (Local Area Network), WAN (Wide Area Network)
  • each component of each part shown in the figure is a functional concept, and does not necessarily have to be physically configured as shown in the figure.
  • the specific form of distribution and integration of the device is not limited to that shown in the figure, and all or a part of it can be functionally or physically distributed and integrated in any unit according to various loads, usage conditions, etc.
  • each processing function performed by the device can be realized in whole or in part by a CPU and a program executed by the CPU, or can be realized as hardware by wired logic.
  • the learning unit 123 and the estimation unit 124 shown in FIG. 2 can be realized as a learning device and an estimation device independent of the data construction device 10, respectively.
  • REFERENCE SIGNS LIST 1 Data construction system 10
  • Data construction device 11 Data acquisition unit 12
  • Control unit 13 Output unit 14
  • Storage unit 121 Data expansion unit 122

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
PCT/JP2023/025022 2023-07-05 2023-07-05 データ構築装置、データ構築方法、およびデータ構築プログラム Ceased WO2025009127A1 (ja)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP2025530909A JPWO2025009127A1 (https=) 2023-07-05 2023-07-05
PCT/JP2023/025022 WO2025009127A1 (ja) 2023-07-05 2023-07-05 データ構築装置、データ構築方法、およびデータ構築プログラム

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2023/025022 WO2025009127A1 (ja) 2023-07-05 2023-07-05 データ構築装置、データ構築方法、およびデータ構築プログラム

Publications (1)

Publication Number Publication Date
WO2025009127A1 true WO2025009127A1 (ja) 2025-01-09

Family

ID=94171777

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/025022 Ceased WO2025009127A1 (ja) 2023-07-05 2023-07-05 データ構築装置、データ構築方法、およびデータ構築プログラム

Country Status (2)

Country Link
JP (1) JPWO2025009127A1 (https=)
WO (1) WO2025009127A1 (https=)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019146315A1 (ja) * 2018-01-29 2019-08-01 株式会社日立製作所 異常検知システム、異常検知方法、および、プログラム
CN113807265A (zh) * 2021-09-18 2021-12-17 山东财经大学 一种多样化的人脸图像合成方法及系统
JP2022175923A (ja) * 2021-05-14 2022-11-25 Aiインフルエンサー株式会社 コンテンツ再生方法、及びコンテンツ再生システム
JP7207539B2 (ja) * 2019-06-20 2023-01-18 日本電信電話株式会社 学習データ拡張装置、学習データ拡張方法、およびプログラム

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019146315A1 (ja) * 2018-01-29 2019-08-01 株式会社日立製作所 異常検知システム、異常検知方法、および、プログラム
JP7207539B2 (ja) * 2019-06-20 2023-01-18 日本電信電話株式会社 学習データ拡張装置、学習データ拡張方法、およびプログラム
JP2022175923A (ja) * 2021-05-14 2022-11-25 Aiインフルエンサー株式会社 コンテンツ再生方法、及びコンテンツ再生システム
CN113807265A (zh) * 2021-09-18 2021-12-17 山东财经大学 一种多样化的人脸图像合成方法及系统

Also Published As

Publication number Publication date
JPWO2025009127A1 (https=) 2025-01-09

Similar Documents

Publication Publication Date Title
US11521110B2 (en) Learning apparatus, learning method, and non-transitory computer readable storage medium
RU2666631C2 (ru) Обучение dnn-студента посредством распределения вывода
CN113066484A (zh) 用于神经网络模型的分布式训练的系统和方法
CN113557505A (zh) 用于具有不同结构的实体间可互操作通信的系统和方法
US20210374543A1 (en) System, training device, training method, and predicting device
JP7586172B2 (ja) 情報処理装置およびプログラム
CN116721179A (zh) 一种基于扩散模型生成图像的方法、设备和存储介质
WO2021182199A1 (ja) 情報処理方法、情報処理装置及び情報処理プログラム
US20260004563A1 (en) Aggregating Nested Vision Transformers
WO2021106855A1 (ja) データ生成方法、データ生成装置、モデル生成方法、モデル生成装置及びプログラム
US11604999B2 (en) Learning device, learning method, and computer program product
JP7058556B2 (ja) 判定装置、判定方法、および判定プログラム
CN113269319B (zh) 深度学习模型的调优方法、编译方法及计算装置
CN114186609A (zh) 模型训练方法和装置
JP3925857B2 (ja) スケジュール作成方法、プログラム及びタスクスケジュール作成装置
CN118037908B (zh) 数字人驱动方法、装置、设备及存储介质
WO2012088629A1 (en) Method for generating motion synthesis data and device for generating motion synthesis data
WO2025009127A1 (ja) データ構築装置、データ構築方法、およびデータ構築プログラム
JP7677436B2 (ja) 画像処理装置、学習装置、画像処理方法、学習方法、画像処理プログラム及び学習プログラム
CN117788649A (zh) 数字人驱动模型的训练方法、数字人驱动方法及其装置
CN111314706A (zh) 一种视频转码方法及装置
JP2022102319A (ja) ベクトル推定プログラム、ベクトル推定装置、及び、ベクトル推定方法
Bozkurt Personalized speech-driven expressive 3d facial animation synthesis with style control
CN120724398B (zh) 基于跨模态注意力机制的多模态数据语义对齐方法及装置
JP7750306B2 (ja) データ拡張装置、データ拡張方法、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23944375

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2025530909

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2025530909

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE