WO2024023902A1

WO2024023902A1 - Information processing device, motion transfer method, and program

Info

Publication number: WO2024023902A1
Application number: PCT/JP2022/028671
Authority: WO
Inventors: 雄貴蔵内; 俊一瀬古; 隆二山本
Original assignee: 日本電信電話株式会社
Priority date: 2022-07-25
Filing date: 2022-07-25
Publication date: 2024-02-01

Abstract

An information processing device comprising: a feature extraction unit that is configured to extract a plurality of feature data indicating a specific motion from motion video data indicating a motion-including video; a feature synthesis unit that is configured to synthesize the plurality of feature data; a control unit that is configured to select, upon receiving a transfer request, feature data that corresponds to the transfer request from the synthesized feature data; and a feature transfer unit that is configured to transfer the selected feature data to input video data and generate output video data.

Description

Information processing device, gesture transcription method and program

The present invention relates to an information processing device, a gesture transcription method, and a program.

There is a known technology that converts video data of a person into video data of a person making a specific gesture such as nodding or smiling. For example, Non-Patent Document 1 discloses a technique for extracting data indicating a specific gesture from video data of a person and transferring it to video data of another person in real time.

By adding gestures such as facial expressions, blinks, nods, postures, compliments, and glances to the images of participants in video conferences, it is possible to build smooth interpersonal relationships and facilitate the progress of the meeting. However, with conventional technology, only the video data showing the gestures of the source person is transferred to the target video data in chronological order. Therefore, there is a problem in that the amount of required video data increases.

The disclosed technology aims to reduce the amount of video data required to transcribe gestures.

The disclosed technology includes: a feature extraction unit configured to extract a plurality of feature data indicating a specific gesture from gesture video data indicating an image including a gesture; a control unit configured to receive a transfer request and select feature data corresponding to the transfer request from the synthesized feature data; An information processing apparatus includes a feature transfer unit configured to transfer input video data to generate output video data.

The amount of video data required to transcribe gestures can be kept small.

1 is a diagram illustrating an example of a functional configuration of an information processing device according to Example 1 of an embodiment of the present invention; FIG. 3 is a flowchart illustrating an example of the flow of feature transfer processing according to Example 1 of the embodiment of the present invention. FIG. 3 is a diagram for explaining an overview of feature transfer processing according to Example 1 of the embodiment of the present invention. FIG. 2 is a diagram for explaining a method for synthesizing feature data according to Example 1 of the embodiment of the present invention. FIG. 3 is a diagram illustrating an example of a functional configuration of an information processing device according to Example 2 of the embodiment of the present invention. 12 is a flowchart illustrating an example of the flow of feature transfer processing according to Example 2 of the embodiment of the present invention. 1 is a diagram showing an example of a hardware configuration of a computer.

Hereinafter, an embodiment of the present invention (this embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and embodiments to which the present invention is applied are not limited to the following embodiments.

Hereinafter, Example 1 and Example 2 will be described as specific examples of this embodiment.

(Example 1)
In this example, a plurality of feature data extracted from data indicating a gesture video are combined, and the video is processed based on the combined feature data to reflect various gestures on the person in the video. Let's discuss an example.

FIG. 1 is a diagram illustrating an example of the functional configuration of an information processing apparatus according to Example 1 of the embodiment of the present invention. The information processing device 10 according to the present embodiment includes a gesture video storage section 11, a feature extraction section 12, a feature synthesis section 13, a control section 14, an input video storage section 15, a feature transfer section 16, and an output A video storage section 17 is provided.

The gesture video storage unit 11 stores data indicating a gesture video. A gesture video is a pre-recorded video of a person's gestures. Gestures are actions that convey emotions, intentions, etc., such as facial expressions, blinking, nodding, posture, gestures, and gaze.

The feature extraction unit 12 extracts a plurality of feature data from the data indicating the gesture video according to the content of the specific gesture. The data to be extracted is extracted for each gesture content, such as feature data of "smile", feature data of "nod", etc.

The feature synthesis unit 13 synthesizes the extracted plurality of feature data. For example, the feature synthesis unit 13 synthesizes the feature data of "smile" and the feature data of "nod" to generate feature data of "smile and nod", which is a combination of "smile" and "nod". . The feature data may be, for example, vector data indicating features. Therefore, the feature synthesis unit 13 may synthesize a plurality of feature data by vector synthesis.

The control unit 14 receives the transfer request and selects feature data corresponding to the transfer request from the synthesized feature data. The transcription request is a request for transcription in which a specific gesture is specified by a user's operation or the like. Note that the control unit 14 may select feature data from either the combined feature data or the non-combined feature data. For example, the control unit 14 may select the feature data from any of the feature data of "smile", the feature data of "nod", and the feature data of "smile and nod".

The input video storage unit 15 stores data indicating input video. The input video is a video of the user photographed by a photographing device such as a web camera.

The feature transfer unit 16 transfers the feature data output by the control unit 14 to the input video. For example, the feature transfer unit 16 transfers feature data of "smiling and nodding" to an input video of an expressionless user, thereby converting it into video data representing a smiling and nodding user, and outputs the video data.

The output video storage unit 17 stores the video data output by the feature transfer unit 16.

Next, the operation of the information processing device 10 according to this embodiment will be explained. The information processing device 10 executes feature transfer processing in response to a user's operation or the like.

FIG. 2 is a flowchart showing an example of the flow of feature transfer processing according to Example 1 of the embodiment of the present invention. When the feature transfer process is started, the feature extraction unit 12 extracts a plurality of feature data from the gesture video (step S11).

Subsequently, the feature synthesis unit 13 synthesizes the plurality of extracted feature data (step S12).
When the control unit 14 receives a transfer request through a user's operation or the like, it selects feature data corresponding to the transfer request from the synthesized feature data (step S13).

Next, the feature transfer unit 16 transfers the feature data to the input video to generate an output video (step S14). The generated output video is stored in the output video storage section 17. Then, the information processing device 10 outputs the generated output video (step S15).

FIG. 3 is a diagram for explaining an overview of feature transfer processing according to Example 1 of the embodiment of the present invention. The feature data 101 is an example of feature data of "nod". The feature data 101 is, for example, a feature vector characterized by conversion from a normal video 101a to a "nodding" video 101b.

The feature data 102 is an example of "smile" feature data. The feature data 102 is, for example, a feature vector characterized by conversion from a normal video 102a to a "smile" video 102b.

The feature data 103 is an example of "smile and nod" feature data that is a combination of "smile" feature data and "nod" feature data. The feature data 103 is, for example, a feature vector characterized by conversion from a normal video 103a to a "smile and nod" video 103b.

Here, the normal video 101a, the normal video 102a, and the normal video 103a may be the same video or different videos.

The video 104 is an example of an input video. Video 105 is an example of an output video. When the feature data 103 having the characteristics of "smiling and nodding" is transferred to the video 104, a video 105 is generated that includes an image in which the person in the video 104 is smiling and nodding.

Here, the person appearing in the input video and the person appearing in the gesture video may be the same person or different people. What appears in the input video or the gesture video may or may not be a person, and may be, for example, an animal other than a person, such as a dog or a cat.

FIG. 4 is a diagram for explaining a method for synthesizing feature data according to Example 1 of the embodiment of the present invention. The input/output video data and the transferred feature data are each expressed as vector data (video vector and feature vector) by edge processing and the like included in the video. For example, an input image 202a in which a person A is captured is characterized by an image vector 301a starting from the origin 201.

When the feature vector 302a having the feature of "smile" is reflected on the input video 202a, a video 203a of the smiling person A is generated. Then, when the feature vector 303a having the characteristic of "nodding" is reflected in the image 203a, an image 204a of the person A smiling and nodding is generated.

Similarly, the input image 202b in which the person B is captured is characterized by an image vector 301b starting from the origin 201.

When the feature vector 302b having the feature of "smile" is reflected on the input video 202b, a video 203b of the smiling person B is generated. Then, when the feature vector 303b having the characteristic of "nodding" is reflected in the video 203b, a video 204b of the person B smiling and nodding is generated.

Here, the feature vector 302a and the feature vector 302b may be the same vector. Similarly, the feature vector 303a and the feature vector 303b may be the same vector.

In step S12 of the feature transfer process described above, the feature synthesis unit 13 synthesizes, for example, the feature vector 302a and the feature vector 303a. Then, the feature transfer unit 16 transfers the combined feature vectors to 202a to generate an image 204a, and transfers the combined feature vectors to 202b to generate an image 204b.

According to this embodiment, by combining a plurality of feature data extracted from data indicating a gesture video and processing the video based on the combined feature data, various gestures can be applied to a person etc. in the video. It can be reflected. Therefore, if you want to combine multiple elements, such as smiling and nodding, you do not need as many videos as the number of combinations of elements, so the amount of video data required to transcribe the gesture can be kept small. .

(Example 2)
Example 2 will be described below with reference to the drawings. The second embodiment differs from the first embodiment in that emotions are estimated based on input video. Therefore, in the following explanation of the second embodiment, the differences from the first embodiment will be mainly explained, and parts having the same functional configuration as the first embodiment will be designated by the same reference numerals as used in the explanation of the first embodiment. A symbol is given and the explanation thereof is omitted.

This embodiment is an example for solving the following problems. That is, when transferring feature data extracted based on a gesture image to an input image, the facial expressions of the gesture image to be converted (for example, the normal image 101a, the normal image 102a, etc. shown in FIG. 3) and the input image are different from each other. Must match. For example, if the source of the gesture video is an expressionless expression and the destination is a smiling expression, it is fine as long as the input video is expressionless, but the source of the gesture video is an angry expression and the destination is a smiling expression. If the input video has a neutral expression, the conversion may not be successful.

Therefore, in this embodiment, emotions are estimated based on the input video, and feature data corresponding to the estimated emotions are synthesized.

FIG. 5 is a diagram illustrating an example of a functional configuration of an information processing device according to Example 2 of the embodiment of the present invention. The information processing device 10 according to the present embodiment has a configuration in which an emotion estimation unit 18 is added to the information processing device 10 according to the first embodiment.

The emotion estimation unit 18 estimates emotions based on the input video. For example, the emotion estimation unit 18 may estimate what kind of emotion the person is feeling based on the facial expression of the person in the input video. For example, the emotion of joy is estimated based on an image of a person with a smiling face.

Furthermore, the feature synthesis unit 13 according to the present embodiment synthesizes feature data corresponding to the estimated emotion from among the plurality of extracted feature data.

FIG. 6 is a flowchart showing an example of the flow of feature transfer processing according to Example 2 of the embodiment of the present invention. When the feature transfer process is started, the feature extraction unit 12 extracts a plurality of feature data from the gesture video (step S21).

Next, the emotion estimation unit 18 estimates emotions based on the input video (step S22). Next, the feature synthesis unit 13 synthesizes feature data corresponding to the estimated emotion from among the plurality of extracted feature data (step S23). When the control unit 14 receives a transfer request through a user's operation or the like, it selects feature data corresponding to the transfer request from the synthesized feature data (step S24).

Then, the feature transfer unit 16 transfers the input video to the selected feature data to generate an output video (step S25). The generated output video is stored in the output video storage section 17. Then, the information processing device 10 outputs the generated output video (step S26).

According to this embodiment, an emotion is estimated based on an input video, and feature data corresponding to the estimated emotion is synthesized. This makes it possible to synthesize and use feature data suitable for the input video. For example, facial expressions can be appropriately converted using feature data based on a gesture video with the same facial expression as the facial expression of a person in the input video.

<Hardware configuration>
Finally, the hardware configuration of the information processing device 10 according to this embodiment will be described. The information processing device 10 according to this embodiment is realized, for example, by the hardware configuration of a computer 500 shown in FIG. 7.

FIG. 7 is a diagram showing an example of the hardware configuration of the computer. The computer in FIG. 7 includes a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, and the like, which are interconnected via a bus B.

A program that realizes processing on the computer is provided, for example, on a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000. However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via a network. The auxiliary storage device 1002 stores installed programs as well as necessary files, data, and the like.

The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when there is an instruction to start the program. The CPU 1004 implements functions related to the device according to programs stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network. A display device 1006 displays a GUI (Graphical User Interface) and the like based on a program. The input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operation instructions. An output device 1008 outputs the calculation result. Note that the computer may include a GPU (Graphics Processing Unit) or a TPU (Tensor Processing Unit) instead of the CPU 1004, or may include a GPU or a TPU in addition to the CPU 1004. In that case, the processing may be divided and executed such that the GPU or TPU executes processing that requires special calculations, and the CPU 1004 executes other processing.

The information processing device 10 according to the present embodiment is realized by reading a program for causing the computer 500 to execute each of the above-described processes, and executing the processes specified in the program. The program may be recorded on the recording medium 503a or the like, or may be provided through a network.

(Summary of embodiments)
This specification describes at least the information processing device, gesture transcription method, and program described in the following sections.
(Section 1)
a feature extraction unit configured to extract a plurality of feature data indicating a specific gesture from gesture video data indicating an image including the gesture;
a feature synthesis unit configured to synthesize the plurality of feature data;
a control unit configured to receive a transfer request and select feature data corresponding to the transfer request from the synthesized feature data;
a feature transfer unit configured to transfer the selected feature data to input video data to generate output video data;
Information processing device.
(Section 2)
The feature synthesis unit is configured to synthesize a plurality of vector data indicating the plurality of feature data by vector synthesis.
The information processing device according to item 1.
(Section 3)
further comprising an emotion estimation unit configured to estimate an emotion based on the input video data,
The feature synthesis unit is configured to synthesize feature data corresponding to the estimated emotion from among the plurality of extracted feature data.
The information processing device according to item 1 or 2.
(Section 4)
A gesture transcription method performed by a computer, the method comprising:
extracting a plurality of feature data indicating a specific gesture from gesture video data indicating an image including the gesture;
a step of synthesizing the plurality of feature data;
receiving a transcription request and selecting feature data corresponding to the transcription request from the synthesized feature data;
transcribing the selected feature data to input video data to generate output video data;
Gesture transcription method.
(Section 5)
A program for causing a computer to function as each part of the information processing apparatus according to any one of items 1 to 3.

Although the present embodiment has been described above, the present invention is not limited to such specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention as described in the claims. It is.

10 Information processing device 11 Gesture video storage unit 12 Feature extraction unit 13 Feature synthesis unit 14 Control unit 15 Input video storage unit 16 Feature transfer unit 17 Output video storage unit 18 Emotion estimation unit 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

a feature extraction unit configured to extract a plurality of feature data indicating a specific gesture from gesture video data indicating an image including the gesture;
a feature synthesis unit configured to synthesize the plurality of feature data;
a control unit configured to receive a transfer request and select feature data corresponding to the transfer request from the synthesized feature data;
a feature transfer unit configured to transfer the selected feature data to input video data to generate output video data;
Information processing device.
The feature synthesis unit is configured to synthesize a plurality of vector data indicating the plurality of feature data by vector synthesis.
The information processing device according to claim 1.
further comprising an emotion estimation unit configured to estimate an emotion based on the input video data,
The feature synthesis unit is configured to synthesize feature data corresponding to the estimated emotion from among the plurality of extracted feature data.
The information processing device according to claim 1.
A gesture transcription method performed by a computer, the method comprising:
extracting a plurality of feature data indicating a specific gesture from gesture video data indicating an image including the gesture;
a step of synthesizing the plurality of feature data;
receiving a transcription request and selecting feature data corresponding to the transcription request from the synthesized feature data;
transcribing the selected feature data to input video data to generate output video data;
Gesture transcription method.
A program for causing a computer to function as each part of the information processing apparatus according to any one of claims 1 to 3.