WO2023167212A1

WO2023167212A1 - Computer program, information processing method, and information processing device

Info

Publication number: WO2023167212A1
Application number: PCT/JP2023/007458
Authority: WO
Inventors: 公之茶谷; 雅丈豊田; タンマウンマウン; 康貴朝倉; 直樹千葉
Original assignee: 株式会社KPMG Ignition Tokyo
Priority date: 2022-03-01
Filing date: 2023-03-01
Publication date: 2023-09-07

Abstract

Provided is a computer program, an information processing method, and an information processing device, by which support for generating contents data for presentation can be expected.　The computer program according to the present embodiment causes a computer to perform processing for acquiring presentation materials, acquiring text information for a presentation voice, receiving settings for the avatar of a presenter, and generating contents data in which the avatar is displayed with the presentation materials and the voice corresponding to the text information is uttered by the avatar. The computer program may acquire voice information for the presentation of the presenter and covert the voice information to acquire the text information. The computer program may receive settings for the pronunciations of words included in the text information and generate the contents data in which the avatar utters the words with the pronunciations corresponding to the received settings.

Description

Computer program, information processing method and information processing apparatus

The present invention relates to a computer program, an information processing method, and an information processing apparatus for generating content data for presentation.

Patent Document 1 proposes a presentation system related to large-scale repair work for collective housing. This presentation system consists of illustrated material data consisting of a combination of text and still images, survey situation video data that records the actual preliminary survey situation of the construction target property, and the actual construction situation of a pseudo-construction property similar to the construction target property. Based on the illustrated material data, an illustrated material signal is generated and transmitted to the display device, and based on the investigation situation video data and the simulated experience video data, the investigation situation video signal and the simulated An experience video signal is generated and transmitted to the display device.

Japanese Patent Application Laid-Open No. 2021-68265

For example, when a presenter makes a presentation to a plurality of audiences, such as a presentation to a customer, the presenter prepares in advance presentation materials composed of images of multiple pages, and sequentially displays the created presentation materials on a display or projector. Conventionally, the presenter explained the information on the displayed page and the like. In recent years, presenters have been making presentations with various ingenuity, such as outputting moving images and sound in addition to displaying still images at the time of presentation. Such a presentation, however, requires the presenter to prepare various data such as still images, moving images, and voices in advance, which is not something that anyone can easily do.

The present invention has been made in view of such circumstances, and its object is to provide a computer program, an information processing method, and an information processing apparatus that can be expected to support the generation of content data for presentation. to provide.

A computer program according to one embodiment acquires presentation materials in a computer, acquires text information related to presentation audio, receives settings related to a presenter's avatar, displays the avatar together with the presentation materials, and displays the text information. A process of generating content data in which the avatar utters a voice corresponding to information is executed.

According to one embodiment, it can be expected to support generation of content data for presentation.

1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment; FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment; FIG. 1 is a block diagram showing a configuration example of an information processing apparatus according to an embodiment; FIG. 4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus according to the present embodiment; FIG. 4 is a schematic diagram showing an example of an utterance editing screen; FIG. 4 is a schematic diagram for explaining a pronunciation correcting operation; FIG. 4 is a schematic diagram showing an example of a pronunciation correction dialog box; FIG. 4 is a schematic diagram showing an example of a content editing screen; FIG. 4 is a schematic diagram showing an example of a content editing screen provided with a caption setting area; FIG. 11 is a schematic diagram showing another example of the utterance editing screen; FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; FIG. 2 is a schematic diagram for explaining an overview of sound image localization technology; FIG. 4 is a schematic diagram for explaining an example of a content switching setting method; These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment.

A specific example of the information processing system according to the embodiment of the present invention will be described below with reference to the drawings. The present invention is not limited to these exemplifications, but is indicated by the scope of the claims, and is intended to include all modifications within the meaning and scope of equivalents to the scope of the claims.

<System overview>
1 and 2 are schematic diagrams for explaining an outline of an information processing system according to this embodiment. In the information processing system according to the present embodiment, a presenter who gives a presentation or the like prepares presentation materials in advance using, for example, so-called presentation software, as in the conventional art. This presentation material includes images of a plurality of pages, etc., and the presentation is made by displaying these in order (so-called slide show). In the illustrated example, an online presentation is given by a presenter using presentation materials in an online conference in which a plurality of participants participate via a network. In the information processing system according to the present embodiment, voice information is extracted from the text written in the presentation material (see FIG. 1). Further, in the information processing system, audio information may be acquired by recording the audio of the online presentation given by the presenter (see FIG. 2). It should be noted that not only audio but also video and audio may be recorded together. Also, the presentation by the presenter does not have to be an online presentation, and for example, the voice of the offline presentation may be recorded with a recording device.

In the information processing system according to the present embodiment, voice information obtained by recording the voice of the presenter's presentation can be used. In this case, the information processing system converts the voice information into text information by voice recognition processing. In the information processing system according to the present embodiment, based on the presentation material created by the presenter and text information such as sentences included in the presentation material and/or text information converted from voice information, the presenter's avatar Generate content data including a moving image of a presentation using presentation materials. By displaying the generated content data on a display, projecting it with a projector, or distributing it on a moving image distribution site or the like, the presenter does not have to repeatedly present the same content.

Also, in the above example, the voice information recorded by the presenter is converted into text information, but it is not limited to this. The presenter may create text information of lines to be spoken at the time of presentation. In this case, the information processing system acquires presentation materials and text information prepared in advance by the presenter, and generates content data based on these. That is, it does not matter whether text information used by the information processing system to generate content data is converted from voice information by voice recognition. By generating text information in advance, the presenter can generate content data in which his or her avatar presents without presenting themselves.

The information processing system according to the present embodiment acquires the above presentation materials and text information, and generates voice information by reading out the text information using synthesized voice. In addition, the information processing system allows the avatar of the presenter, which is selected from data of a plurality of avatars prepared in advance, to perform mouth movements and gestures corresponding to the generated voice information, so that the avatar can express lines related to the presentation. generates video data in a mode of uttering In addition, the information processing system uses images such as multiple pages of slides included in the acquired presentation material as background image data, superimposes the avatar video on this background image, and adds audio information to generate content data. . The content data is output as, for example, a moving image file, and can be used for display on an appropriate display device, projector, or the like, or for distribution on a moving image distribution site or the like.

In the information processing system according to the present embodiment, regarding the generation of this content data, for example, the appearance of the avatar, the gestures performed by the avatar, the characteristics of the voice output as the utterance of the avatar, or the pronunciation of the words output as voice. setting is received from the presenter, and content data reflecting this setting is generated. As a result, the information processing system can be expected to support generation of content data suitable for the presenter's preference and purpose.

<Device configuration>
FIG. 3 is a block diagram showing a configuration example of an information processing apparatus according to this embodiment. The information processing apparatus 1 according to the present embodiment includes a processing unit 11, a storage unit (storage) 12, a communication unit (transceiver) 13, a display unit (display) 14, an operation unit 15, and the like. The information processing device 1 according to the present embodiment can be configured using a general-purpose information processing device such as a personal computer or a tablet terminal device. In this embodiment, one information processing apparatus 1 performs the processing, but a plurality of information processing apparatuses may perform the processing in a distributed manner. In the following description, the user who uses the information processing device 1 is assumed to be a presenter, but the presenter is not limited to this. It may be someone other than the presenter.

The processing unit 11 includes an arithmetic processing unit such as a CPU (Central Processing Unit), MPU (Micro-Processing Unit), GPU (Graphics Processing Unit) or quantum processor, ROM (Read Only Memory), RAM (Random Access Memory), etc. It is configured using By reading and executing the program 12a stored in the storage unit 12, the processing unit 11 performs processing for acquiring presentation materials, text information, etc., processing for accepting various settings by the user, and processing for acquired information and received information. Various processes such as a process of generating content data based on the settings are performed.

The storage unit 12 is configured using a large-capacity storage device such as a hard disk. The storage unit 12 stores various programs executed by the processing unit 11 and various data required for processing by the processing unit 11 . In the present embodiment, the storage unit 12 stores a program 12a executed by the processing unit 11. FIG. The storage unit 12 is also provided with a content data storage unit 12b that stores content data generated by the information processing apparatus 1 .

In the present embodiment, the program (computer program, program product) 12a is provided in a form recorded in a recording medium 99 such as a memory card or an optical disk, and the information processing apparatus 1 reads out the program 12a from the recording medium 99 and stores it in the storage unit. 12. However, the program 12a may be written in the storage unit 12 at the manufacturing stage of the information processing device 1, for example. Further, for example, the program 12a may be distributed by a remote server device or the like and acquired by the information processing device 1 through communication. For example, the program 12 a may be recorded in the recording medium 99 and read by a writing device and written in the storage unit 12 of the information processing device 1 . The program 12 a may be provided in the form of distribution via a network, or may be provided in the form of being recorded on the recording medium 99 .

The content data storage unit 12b stores content data generated by the information processing device 1 based on information such as presentation materials and text information. The content data is stored in the content data storage unit 12b as a moving image file in MPEG-4 format, for example. The content data storage unit 12b may store various information such as the title of the presentation, the name of the presenter, the date and time of the presentation, or an overview of the contents of the presentation, together with the moving image file.

The communication unit 13 communicates with various devices via a network N including, for example, the Internet, a LAN (Local Area Network), or a mobile phone communication network. The information processing device 1 performs processing such as acquisition (downloading) of the program 12a, implementation of an online presentation, distribution of generated content data, etc., by communicating with other devices through the communication unit 13. can be done. The communication unit 13 transmits the data given from the processing unit 11 to other devices, and gives the data received from the other devices to the processing unit 11 .

The display unit 14 is configured using a liquid crystal display or the like, and displays various images, characters, etc. based on the processing of the processing unit 11. The operation unit 15 receives a user's operation and notifies the processing unit 11 of the received operation. For example, the operation unit 15 receives a user's operation using an input device such as mechanical buttons or a touch panel provided on the surface of the display unit 14 . Further, for example, the operation unit 15 may be an input device such as a mouse and a keyboard, and these input devices may be detachable from the information processing apparatus 1 .

Further, in the information processing apparatus 1 according to the present embodiment, the program 12a stored in the storage unit 12 is read out and executed by the processing unit 11, whereby the information acquisition unit 11a, the avatar data generation unit 11b, the voice data generation unit 11c, the background data generation unit 11d, the content data generation unit 11e, the display processing unit 11f, and the like are implemented in the processing unit 11 as software functional units. In this figure, as the functional units of the processing unit 11, functional units related to content data generation are illustrated, and functional units related to other processes are omitted.

The information acquisition unit 11a performs processing for acquiring information such as presentation materials and text information necessary for generating content data. For example, the information acquisition unit 11a acquires information on presentation materials prepared in advance by the user. In this embodiment, the user prepares in advance a multi-page image (slides) including sentences summarizing the content of the presentation, graphs, illustrations, etc., as a presentation material using, for example, existing presentation software. The presentation material may be created by the information processing device 1 or by another device. In the present embodiment, the data of the presentation material prepared in advance by the user is stored in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a reads out the presentation material stored in the storage unit 12 to perform the presentation. Get materials.

Also, for example, the information acquisition unit 11a acquires text information corresponding to the lines spoken by the presenter's avatar in the generated content data. In the present embodiment, sentences, characters, words, or the like described in the presentation material can be used as the avatar's lines. In this case, for example, the creator sets in advance which of the sentences included in the presentation material is to be used as the speech of the avatar using the comment function or the like of the presentation software for creating this presentation material. The information acquisition unit 11a can recognize comments and the like attached to the presentation material and extract sentences and the like included in the presentation material as text information for making the avatar's lines.

Also, for example, the information acquisition unit 11a can use the lines spoken by the presenter when the presenter actually made a presentation based on the presentation material as the lines spoken by the avatar in the content data. In this case, when the presenter gives an online presentation or the like, the presentation is video-recorded or recorded, and voice information including the lines spoken by the presenter is prepared in advance. The user stores this voice information in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a acquires the voice information by reading out the voice information stored in the storage unit 12. FIG. The information acquisition unit 11a that has acquired the voice information acquires text information by, for example, performing so-called voice recognition processing on this voice information and converting the voice information into text information. Since the conversion from voice information to text information by voice recognition processing is an existing technology, detailed description thereof will be omitted. The information processing device 1 may perform the speech recognition processing itself, or may transmit the speech information to another device that performs the speech recognition processing, and acquire the text information converted by the speech recognition processing by the other device.

In addition to the above-described methods of extracting text information from sentences included in presentation materials and obtaining text information based on voice information actually announced by the presenter, the method of acquiring text information includes, for example, avatar It is also possible to adopt a method of obtaining text information that is directly created by the user. In this case, the user creates sentences corresponding to the avatar's dialogue using, for example, a text editor or sentence creation software, and stores the sentences in the storage unit 12 as text information. The text information may be created by the information processing device 1 or by another device. The user stores the created text information in the storage unit 12, and the information acquisition unit 11a can acquire the text information by reading the text information stored in the storage unit 12. FIG.

The avatar data generation unit 11b performs processing for generating data related to the presenter's avatar appearing in the content data. The avatar data generation unit 11b displays, for example, a list of information about a plurality of avatars stored in the database on the display unit 14, and receives selection of an avatar from the user. The avatar data generation unit 11b acquires data of the selected avatar from the database, and displays a preview screen showing the appearance of the avatar on the display unit 14 based on the acquired data. The avatar data generation unit 11b accepts an editing operation such as the color or shape of the avatar from the user on this preview screen, and uses the edited avatar as an avatar to appear in the content data. In addition, the avatar data generation unit 11b accepts various settings from the user such as the position at which the avatar is displayed, the direction of the avatar, and the movements (gestures) performed by the avatar in the content data to be generated. To reflect. The avatar data generation unit 11 b generates avatar data including data such as the shape of the avatar and settings such as the display position of the avatar, and stores the data in the storage unit 12 . A plurality of avatars may appear in the content data, and in this case, the avatar data generating section 11b generates avatar data for the plurality of avatars.

The audio data generation unit 11c performs processing for generating audio data spoken by the avatar in the content data. The voice data generation unit 11c performs so-called text-to-speech processing based on the text information acquired by the information acquisition unit 11a, thereby converting the text information into voice data. Since the text-to-speech processing is an existing technology, detailed description is omitted. The information processing apparatus 1 may perform the text-to-speech process by itself, or may transmit text information to another apparatus that performs the text-to-speech process, and acquire voice data converted by the text-to-speech process by the other apparatus.

For example, the voice data generation unit 11c sequentially displays one or more texts included in the text information acquired by the information acquisition unit 11a on the display unit 14, and accepts selection of texts to be converted into voice data. In addition, the voice data generation unit 11c receives, from the user, settings related to, for example, the pitch, speed, depth (thickness of voice), pitch, voice, voice quality, or volume of voice data to be generated. Generates audio data that reflects the accepted settings. The audio data generator 11c outputs the generated audio data from an audio output device such as a speaker or an earphone. In addition, when there are a plurality of avatars, the voice data generation unit 11c accepts settings for association between avatars and texts, and also accepts settings such as speed or voice quality for each avatar.

In addition, in the present embodiment, the voice data generation unit 11c receives settings related to pronunciation for, for example, words or short sentences included in the text information, and corrects the pronunciation of the target words included in the voice data. For example, the text information includes the word "Yukawa", and the voice data generated by the voice data generation unit 11c pronounces this word as "Yugawa", but there is a possibility that the pronunciation of "Yukawa" is correct. . In such a case, the inventor selects "Yukawa" from the displayed sentences and sets "Yukawa" as the correct pronunciation of this word. The voice data generation unit 11c that receives this setting generates voice data in which the pronunciation of all "Yukawa" included in the text information is changed from "Yugawa" to "Yukawa".

It should be noted that this example deals with words written in ideograms (Chinese characters) in the text information, and is an example in which the pronunciation is set in phonetic characters (katakana or hiragana). The setting is not limited to this. The voice data generation unit 11c may accept settings for pronunciation using, for example, phoneme characters (romaji) or phonetic symbols. In addition, the voice data generation unit 11c may receive settings such as the position of an accent for the pronunciation of words.

The background data generation unit 11d performs processing for generating image data that serves as the background of the avatar in the content data. In the present embodiment, a plurality of images (slides) included in the presentation material acquired by the information acquisition unit 11a are used as the background image of the avatar, and content data is generated in which the avatar makes a presentation using the presentation material. . The background data generation unit 11d receives settings such as display order and display switching timing for a plurality of images included in the presentation material acquired by the information acquisition unit 11a. The background data generation unit 11 d generates background data including a plurality of background images and settings such as timings for displaying the background images, and stores the background data in the storage unit 12 .

The background data generation unit 11d also performs a process of adding a caption character string such as a title or subtitles to the background image based on the invention material. The background data generation unit 11d accepts an input of a character string to be displayed as a caption character string from the user, and also sets the position and direction of displaying the caption character string, the size and font of the character string, the timing of displaying the caption character string, and the like. Accept settings. The background data generation unit 11d stores the caption character strings and settings related to them in the background data.

Based on the avatar data generated by the avatar data generation unit 11b, the audio data generated by the audio data generation unit 11c, and the background data generated by the background data generation unit 11d, the content data generation unit 11e generates, for example, an avatar to create a presentation material. Data of moving images for presentation using is generated as content data. The content data generation unit 11e arranges the avatar included in the avatar data in a position, orientation, etc. according to the setting with respect to the background image included in the background data, and outputs the voice included in the voice data at an appropriate timing. content data can be generated. The content data generation unit 11 e stores the generated content data in the content data storage unit 12 b of the storage unit 12 .

The display processing unit 11f performs processing for displaying various information such as images and characters on the display unit 14. In the present embodiment, the display processing unit 11f displays, for example, a screen for accepting settings related to avatars, a screen for accepting settings related to voice, a screen for accepting settings related to background settings, and a display of generated content data. process. The display processing unit 11f not only displays these items on the display unit 14 provided in the information processing device 1, but also transmits the data for display to other devices through the communication unit 13, thereby enabling the display of other devices. It may be displayed on a display unit or the like.

<Content data generation processing>
FIG. 4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus 1 according to the present embodiment. The information acquisition unit 11a of the processing unit 11 of the information processing apparatus 1 according to the present embodiment acquires a presentation material file or the like created in advance by the user by reading it from the storage unit 12 (step S1).

The information acquisition unit 11a also acquires text information for the avatar's speech (step S2). At this time, the information acquisition unit 11a acquires sentences and the like included in the presentation material as text information for utterance by the avatar, based on, for example, comments preset in the presentation material acquired in step S1. The information acquisition unit 11a acquires a file or the like of voice information recorded when the presenter gave a presentation, and performs voice recognition processing to convert the voice information into text information, thereby acquiring text information. good too. The information acquisition unit 11a may also acquire text information created by, for example, the user writing the lines of the avatar in advance.

Next, the display processing unit 11f of the processing unit 11 displays, on the display unit 14, an utterance editing screen for making settings when uttering the text information based on the text information acquired in step S2 (step S3). ). The voice data generation unit 11c of the processing unit 11 accepts editing of the spoken voice by accepting the user's operation on the operation unit 15 while the spoken voice editing screen is displayed (step S4). The voice data generating unit 11c reflects the edited content received in step S4 and generates voice data based on the text information (step S5).

FIG. 5 is a schematic diagram showing an example of the speech editing screen. The information processing apparatus 1 according to the present embodiment displays the illustrated speech editing screen based on the text information acquired in step S2. The speech editing screen, for example, shows a title string of "speech speech editing" at the top, and a button labeled "total speech output" and a button labeled "add text" below it. are arranged on the left and right, and below these, a setting table 101 is provided in which a plurality of texts included in the text information and a plurality of setting items related to each text are arranged in a matrix.

The setting table 101 is a table in which a plurality of texts are arranged in a list in the vertical direction and a plurality of setting items are arranged in the horizontal direction. The setting table 101 has, for example, items of "number", "text", "interval (seconds)", "speaker" and "expression" in order from the left, and an icon area is provided at the right end. "Number" is numerical information indicating the order in which text is spoken by avatars in the finally generated content data. "Text" is text (sentences, lines, etc.) uttered by the avatar, and is character string information of one or more characters. In this example, the first text is "Hello. My name is Dr. Baru. Today, I would like to introduce the results of my research with a new member, Mr. College. Nice to meet you, Mr. College." is set, and the second text is set as "Dr. Balu, nice to meet you. Nice to meet you. I am very happy to be a research member."

For example, the information processing apparatus 1 appropriately divides a sentence included in the acquired text information into a plurality of texts based on punctuation marks, etc., and assigns numbers in order, so that the "number" and "text" of the setting table 101 shown in the figure are displayed. You can get the information to display in The division of sentences included in the text information into a plurality of texts may be performed, for example, in speech recognition processing, or may be performed in advance by the user, for example, or when the information processing apparatus 1 acquires the text information, for example. may be performed. When dividing text in speech recognition processing, for example, when there is an interval exceeding a predetermined time between utterances, the preceding and following utterances can be divided into two texts. When the user divides the text, for example, the user checks the text information with a text editor or the like and inserts a line feed or tab at an appropriate location, thereby dividing the text.

"Interval (seconds)" in the setting table 101 is an item for setting the interval (interval) provided between the utterances of this text and the previous text in units of seconds. In this example, 0.5 seconds is set by the information processing apparatus 1 as a default value. "Speaker" is an item for setting which avatar speaks this text. In this example, it is set that "Dr. Value" speaks the first text and "College" speaks the second text. The information processing apparatus 1 can accept the setting of the "speaker" by accepting the selection of one avatar from the pre-registered avatars, for example, using a pull-down menu or the like. "Facial expression" is an item for setting the facial expression of the avatar when the avatar set with this text speaks. In this example, facial expressions such as "natural" and "smiling" are set. The information processing apparatus 1 can accept the setting of the "facial expression" by accepting the selection of one facial expression from pre-registered facial expressions, for example, using a pull-down menu or the like.

The information processing device 1 displays, for example, an icon resembling a speaker and an icon resembling a trash can in the rightmost icon area of the setting table 101 in association with each text. The icon imitating a speaker is for accepting an operation for outputting the corresponding text by voice. When receiving an operation for this icon, the information processing apparatus 1 outputs only the corresponding text as voice. The trash can icon is for accepting an operation to delete this text. When receiving an operation for this icon, the information processing apparatus 1 deletes the corresponding text and settings.

Also, the "Overall audio output" button provided at the top of the utterance audio editing screen is a button for audio output of all text. When the information processing apparatus 1 accepts an operation on the "output all audio" button, the information processing apparatus 1 sequentially audio-outputs all the texts included in the setting table 101 from the beginning to the end. Also, the "add text" button is a button for adding arbitrary text. When the information processing device 1 receives an operation on the "add text" button, for example, a dialog box for adding text (not shown) is displayed, and the input of the text to be added, the order of uttering the text, the interval, the speaker, and the Accept settings such as facial expressions. The information processing apparatus 1 adds the text by inserting the text and setting received in this dialog box into the setting table 101 at an appropriate position.

In addition, the information processing apparatus 1 according to the present embodiment accepts a pronunciation correction operation from the user for sentences with short words included in the text information. FIG. 6 is a schematic diagram for explaining the pronunciation correction operation. FIG. 6 shows an utterance editing screen in which the setting table 101 is set with the text "Hello. My name is Yukawa." The user uses an input device such as a mouse to select the word "Yukawa" included in this text. In response to this operation, the information processing apparatus 1 displays a button labeled, for example, "pronunciation correction". When an operation on this button is accepted, the information processing apparatus 1 displays, for example, a pronunciation modification dialog box and accepts settings regarding pronunciation from the user.

FIG. 7 is a schematic diagram showing an example of a pronunciation correction dialog box. The upper part of FIG. 7 shows the state before pronunciation correction, and the lower part shows the state after pronunciation correction. In the pronunciation correction dialog box of this example, for example, the title string "Pronunciation correction" is displayed at the top, and a text box labeled "Target text" and a label "Pronunciation" are attached below it. Text boxes are arranged vertically, and a button labeled "Voice Output" and a button labeled "Complete" are arranged horizontally below the text boxes.

The information processing device 1 displays the word selected by the user on the speech editing screen in the "target text" text box. In this example, "Yukawa" selected on the utterance editing screen shown in FIG. 6 is displayed in the "target text" text box. The information processing apparatus 1 also displays the pronunciation when the target text is uttered in the "pronunciation" text box in phonetic notation such as katakana or hiragana. In the example in the upper part of FIG. 7, the katakana character "Yugawa" is displayed as the phonetic notation for "Yukawa", and the current setting indicates that the word "Yukawa" is being uttered with the pronunciation of "Yugawa". there is

The "Speech output" button in the pronunciation correction dialog box is a button for outputting only the words indicated in the "target text". When an operation on the "speech output" button is accepted, the information processing apparatus 1 performs speech output by reading out only the word indicated in the "target text" with the pronunciation set in the "pronunciation". In the example in the upper part of FIG. 7, voice output is performed with the pronunciation of "Yugawa".

The information processing device 1 accepts the correction of the pronunciation of the target text by accepting the user's correction of the phonetic notation displayed in the "pronunciation" text box. In the example in the lower part of FIG. 7, the user corrects "yugawa" displayed in the text box as the current pronunciation of "yukawa" to "yukawa" using an input device such as a keyboard. In the example shown in the lower part of FIG. 7, when the user operates the "voice output" button, the information processing apparatus 1 performs voice output with the pronunciation of "yukawa".

The "Done" button in the pronunciation correction dialog box is a button for letting the user reflect on the pronunciation correction and close this dialog box. When an operation on the "Complete" button is accepted, the information processing apparatus 1 associates and stores the words in the "target text" of the pronunciation modification dialog box with the pronunciations set in the "pronunciation" text box, and stores them in the text information. Generates audio data with the set pronunciation applied to all the same words included.

The display processing unit 11f of the information processing device 1, which has performed processing related to editing of the spoken voice in steps S3 to S5 of the flowchart shown in FIG. Based on the above, a content edit screen for setting the avatar, background, etc. to be displayed in the content data is displayed on the display unit 14 (step S6). The avatar data generation unit 11b and the background data generation unit 11d of the processing unit 11 accept editing of the avatar and the background by accepting user's operation on the operation unit 15 while the content editing screen is displayed (step S7). . The avatar data generation unit 11b generates avatar video data reflecting the editing contents accepted in step S7 (step S8). The background data generation unit 11d also generates background image data reflecting the editing contents accepted in step S7 (step S9).

FIG. 8 is a schematic diagram showing an example of the content editing screen. The information processing apparatus 1 according to the present embodiment displays the illustrated content editing screen based on the presentation material acquired in step S1 and the text information acquired in step S2. The content editing screen shows, for example, a title string of "content editing" at the top, and a background image selection area 111, a content editing area 112, and an avatar setting area 113 are arranged in the horizontal direction below the title string.

The background image selection area 111 of the content editing screen is an area for accepting selection of a background image by the user. The information processing apparatus 1 displays a list of a plurality of slides included in the presentation material in the background image selection area 111 as background images. In the content data, a plurality of background images listed in the background image selection area 111 are displayed in the order in which they are arranged in this area. The information processing apparatus 1 accepts selection of one background image from among the plurality of background images displayed in the background image selection area 111 and displays the selected background image in the content editing area 112 . The information processing apparatus 1 also accepts operations such as addition, deletion, and change of the display order of background images, and performs addition, deletion, order change, and the like for the plurality of background images displayed in a list according to the accepted operations.

The avatar setting area 113 of the content editing screen is an area for accepting settings related to one or more avatars appearing in the content data. In the avatar setting area 113, an avatar selection area, a text display area, a setting reception area, and the like are arranged vertically. The information processing apparatus 1 displays a list of images, names, etc. of one or more avatars created in advance in the avatar selection area of the avatar setting area 113 . In this example, two avatars "Dr. Value" and "College" are displayed in the avatar selection area, and the avatar "Dr. Value" is selected. The information processing device 1 displays the avatar selected in the avatar selection area in the content editing area 112 . Although illustration is omitted, avatar creation is performed on an avatar creation screen or the like. However, the avatar is not created by the user, and the user may acquire and use an avatar provided for a fee or free of charge, for example. Since the method of creating an avatar is an existing technology, detailed description is omitted.

The text display area of the avatar setting area 113 is an area where the text spoken by this avatar is displayed. Based on the text information acquired in step S2 or the text information obtained by editing the text information on the above-described speech editing screen, the information processing apparatus 1 extracts one or more texts included in the text information. is displayed in the text display area. The information processing apparatus 1 selects a plurality of texts included in the text information in output order and displays them in the text display area, and the user can appropriately change the text displayed in the text display area.

The setting acceptance area of the avatar setting area 113 is an area that accepts settings for a plurality of setting items related to the avatar selected in the avatar selection area. In this example, "gesture", "size", "orientation", "position" and the like are shown as setting items related to the avatar. The user can input or select a setting value for each setting item by various methods such as direct numerical input or selection from a pull-down menu. Note that the illustrated setting items are merely an example, and the information processing apparatus 1 may receive settings related to avatars by providing various setting items other than the illustrated setting items. The information processing apparatus 1 provides setting items related to the voice when the avatar utters the text in the setting reception area, for example, setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar, and accepts these settings from the user. may be accepted.

The content editing area 112 of the content editing screen displays the avatar selected in the avatar selection area of the avatar setting area 113 superimposed on the background image selected in the background image selection area 111 . As a result, in the content editing area 112, an image that reproduces one scene of the finally generated content data is displayed. The user can change the position and orientation of the avatar displayed in the content editing area 112 by, for example, mouse operation or touch operation. and accepts changes in settings such as orientation. When the content editing area 112 accepts a setting change, the information processing apparatus 1 changes the setting value of the corresponding setting item provided in the setting accepting area of the avatar setting area 113 . Conversely, when a setting change is accepted in the setting acceptance area of avatar setting area 113, information processing apparatus 1 changes the display mode of the avatar displayed in content editing area 112 according to the accepted setting.

Also, in this embodiment, the user can add a caption to the background image by performing a predetermined operation in the content editing area 112 . For example, when the user designates one point in the content editing area 112 using a function such as a right-click menu of a mouse and performs an operation to add a caption, the information processing apparatus 1 inputs a caption character string in the content editing area 112. In addition to displaying a text box for setting the caption, a caption setting area is displayed instead of the avatar setting area 113 of the content editing screen.

FIG. 9 is a schematic diagram showing an example of a content editing screen on which a caption setting area 114 is provided. The caption setting area 114 of the content editing screen is an area for receiving settings related to the caption character string input in the content editing area 112 . The caption setting area 114 is provided with setting items such as "font type", "size", and "position". In the illustrated example, a character string "Introduction to Business in the Digital Age" is entered as a caption in a text box indicated by a dashed rectangular frame in the content editing area 112 . The information processing apparatus 1 receives settings for the caption in each setting item of the caption setting area 114, and displays the caption in a display mode according to the received settings. The information processing apparatus 1 stores information such as the input caption character string and caption settings together with, for example, the background image data.

The content data generation unit 11e of the processing unit 11 of the information processing apparatus 1, which has performed processing related to content editing in steps S6 to S9 of the flowchart shown in FIG. Data is generated (step S10), the generated content data is stored in the content data storage section 12b of the storage section 12, and the process is terminated. At this time, the content data generation unit 11e integrates the audio data generated in step S5, the avatar video data generated in step S8, and the background image data generated in step S9, and superimposes them on the background image. Generates content data of the content uttered by the avatar.

Also, in the present embodiment, it is possible to superimpose a background image, various parts images to be placed on the background image, and avatars on the same layer or on different layers. The information processing apparatus 1 acquires, for example, presentation materials, text information, and the like, as well as image files and the like of various parts to be included in the content. The information processing apparatus 1 arranges these various parts together with the avatar in an appropriate position and order by accepting a user's operation on the content editing screen, for example. The user can edit, for example, placing a speech desk in front of the avatar, or placing decorative parts between the avatar and the background, and can express the depth on the screen, etc. It is possible to enhance the realism of.

The user can also hide the avatar on the screen, for example, by placing the avatar behind the background. As a result, for example, when it is desired to draw the audience's attention to a background image instead of the avatar, the lines of the hidden avatar can be output as narration, allowing the user to enjoy more effective content. can be created.

In addition, in the present embodiment, the text information acquired together with the presentation material may contain a large amount of text that becomes the dialogue of the avatar. Many lines included in the text information are associated with the avatar that speaks them, for example, in the content editing screen in the present embodiment. However, the information processing apparatus 1 may assign all lines to one pre-selected avatar, for example, when acquiring presentation materials and text information. When the information processing apparatus 1 assigns lines collectively in this manner, the user can generate content data without having to perform an operation for assigning lines to avatars on the subsequent content editing screen or the like. However, after allocating lines in such a batch, the information processing apparatus 1 accepts an editing operation from the user, such as allocating the lines allocated to the first avatar to the second avatar on a content editing screen or the like. you can

The information processing device 1 that has generated the content data can reproduce the content data using, for example, a video reproduction application program and display it on the display unit 14 . The information processing device 1 may also upload content data to, for example, a video distribution site.

<Another example of the speech editing screen>
FIG. 10 is a schematic diagram showing another example of the speech editing screen. The information processing apparatus 1 according to the present embodiment may display, for example, an utterance editing screen shown in FIG. 10 instead of the utterance editing screen shown in FIG. The speech editing screen shown in FIG. 10 is suitable for creating content data in a format in which two avatars interact.

On the spoken voice editing screen shown in FIG. 10, a title string of "spoken voice editing screen" is displayed at the top of the screen. The names are displayed side by side. In the utterance voice editing screen of this example, the left side of the screen is used as an area for displaying information such as the utterance content of "Dr. Value", and the right side of the screen is used as an area for displaying information such as the utterance content of "College". be done.

In addition, on the spoken voice editing screen of this example, the text information spoken by each avatar is placed in a rectangular frame, and multiple pieces of text information are displayed in chronological order from top to bottom of the screen. At this time, the text information about "Dr. Value" is displayed on the left side of the screen, and the text information about "College" is displayed on the right side of the screen. The user can scroll a plurality of pieces of text information in chronological order by, for example, performing a slide operation in the vertical direction, and can confirm a plurality of pieces of text information that cannot fit on one screen. Also, the user can arbitrarily edit the text information contained within the rectangular frame.

Also, in the utterance voice editing screen of this example, one or more icons are provided, for example, in the lower right corner of the rectangular frame in which the text information of each avatar is stored. Note that in this figure, these icons are shown in a simplified form using square figures. These icons accept various operations from the user, such as accepting settings for the corresponding text information, accepting an operation to output the corresponding text information by voice, or accepting an operation to delete the corresponding text information. It is for

In the utterance audio editing screen of this example, a rectangular frame that is long in the horizontal direction is displayed between the text information of two utterances that are continuous in time series, indicating the time setting of the interval to be provided between utterances. . In this example, "interval: 0.5 seconds" is set between the text information of "Hello. Nice to meet you." of "Dr. Value" and the text information of "Dr. A rectangular frame with a character string of is displayed. This indicates that there is an interval of 0.5 seconds between the utterance of "Dr. Value" and the utterance of "College", that is, a period during which both avatars do not speak. The user can arbitrarily set the interval time by correcting the numerical values in the rectangular frame.

As described above, the information processing apparatus 1 according to the present embodiment arranges a plurality of pieces of text information spoken by avatars in time series in the vertical direction of the screen, and arranges text information spoken by two avatars on the left and right sides of the screen. Display the separate speech editing screen. As a result, the user can expect to easily generate content data such as moving images in which, for example, two avatars make a presentation while talking.

In this example, a case in which two avatars speak is explained as an example, but the present invention is not limited to this, and a similar configuration can be applied when three or more avatars speak. For example, when three avatars speak, it is possible to divide the speech editing screen into three areas, the left side, the center, and the right side, associate each avatar with each area, and display the text information of the utterance contents in chronological order. can.

In addition, the information processing device 1 arranges the text information of the number of pieces uttered by the avatars in chronological order in the horizontal direction of the screen, and displays the text information uttered by the two avatars by dividing the screen into the upper and lower parts of the screen. may be displayed.

<Camera work settings>
Information processing apparatus 1 according to the present embodiment may use a three-dimensional model, that is, a three-dimensional character object reproduced in a three-dimensional virtual space as an avatar displayed in content data. The information processing apparatus 1 reads data of a three-dimensional model of an avatar created in advance or newly created by a user, and reproduces this avatar in a three-dimensional virtual space. The information processing device 1 can generate content data by acquiring a two-dimensional image by photographing an avatar with a virtual camera appropriately arranged in a three-dimensional virtual space.

The information processing device 1 accepts from the user settings related to the position of the virtual camera in the three-dimensional virtual space, that is, camerawork, in order to shoot an image (moving image) of the avatar to be included in the content data. The information processing device 1 can, for example, store front, back, left, right, and top and bottom positions (x-coordinate, y-coordinate, and z-coordinate) in a three-dimensional virtual space, the direction of the virtual camera from this position, and the temporal Receive settings such as changes from the user. The information processing apparatus 1 arranges a virtual camera in the three-dimensional virtual space according to the received settings, moves the virtual camera to photograph the avatar, and acquires a two-dimensional image to be included in the content data.

FIG. 11 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation, such as a right-click operation of a mouse, is performed on the avatar displayed in the content editing area 112 on the content editing screen shown in FIG. Show the selection menu dialog box. In the shot selection menu, for example, selection items such as "head shot", "upper body shot" and "whole body shot" are displayed vertically. The user can select any one of these multiple selection items.

For example, when "head shot" is selected in the shot selection menu, the information processing apparatus 1 displays the head of the avatar and its peripheral parts in the three-dimensional virtual space as shown in the lower left part of FIG. Place the virtual camera close to the avatar. Similarly, for example, when "upper body shot" or "full body shot" is selected, the information processing apparatus 1 displays the upper body or the whole body of the avatar as shown in the lower center or right side of FIG. A virtual camera is placed at a position suitable for these shots in the virtual space.

FIG. 12 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation, such as a right-click operation of a mouse, is performed on the avatar displayed in the content editing area 112 on the content editing screen shown in FIG. Show the selection menu dialog box. In the direction selection menu, for example, selection items such as "left", "front" and "right" are displayed vertically. The user can select any one of these multiple selection items.

For example, when "left side" is selected in the direction selection menu, the information processing device 1 places the virtual camera on the left side of the avatar as shown in the lower left part of FIG. 12, and shoots the avatar with the virtual camera. Similarly, when "front" or "right" is selected, the information processing apparatus 1 arranges the virtual camera in front or right of the avatar, as shown in the lower middle or right side of FIG. In this example, the left and right directions used for setting are the left and right directions when the virtual camera sees the avatar, but are not limited to this. good.

FIG. 13 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation is performed on an avatar displayed in the content editing area 112 on the content editing screen shown in FIG. A slide bar (slider, slider bar, scroll bar, etc.) for setting zoom is displayed nearby.

In the example shown in FIG. 13, a horizontally elongated slide bar is displayed below the avatar, and the user can slide the knob of the slide bar horizontally. For example, when the knob portion of the slide bar is slid leftward, the information processing apparatus 1 moves the virtual camera away from the avatar (zooms out) as shown on the left side of FIG. 13 . For example, when the knob portion of the slide bar is slid to the right, the information processing apparatus 1 moves the virtual camera closer to the avatar (zooms in) as shown on the right side of FIG. 13 .

As described above, the information processing apparatus 1 according to the present embodiment receives settings such as the position and orientation of a virtual camera that captures an avatar placed in a three-dimensional virtual space, and performs virtual camerawork according to the received settings. An avatar is photographed by a camera, and content data including a two-dimensional image of the avatar obtained by photographing is generated. Accordingly, in the information processing system according to the present embodiment, the user can easily set the orientation, size, etc. of the avatar displayed in the content data.

<Avatar according to region or age>
The information processing apparatus 1 according to the present embodiment changes the behavior of the avatar included in the content data to be generated according to the region where the content data is provided or the age of the person who provides the content data. For this reason, the information processing apparatus 1 accepts settings such as the region where the content data is provided or the age of the person who provides the content data from the user. For example, a country, a prefecture, a state, or the like can be adopted as the area for which the information processing apparatus 1 receives settings. Further, the information processing device 1 may accept an approximate age such as 20's or 30's as a setting, or may accept an age range such as 25 to 40 by inputting a numerical value. You may receive the setting of the age by the methods other than these.

When the language spoken by the avatar is English, the information processing device 1 presents the user with options such as the United States, the United Kingdom, and Australia as regions, and accepts setting of the region from among these. English pronunciation, accent, etc. differ from region to region, and the information processing device 1 converts text information into voice so that the avatar speaks with pronunciation, accent, etc., according to the set region.

Further, when the language spoken by the avatar is Japanese, the information processing device 1 presents the user with the names of regions such as the Kanto region, the Kansai region, and the Tohoku region as options, and allows the user to select one of these regions. accept. The information processing apparatus 1 may also present the user with dialect names such as standard Japanese, Kansai dialect, and Tohoku dialect as options for regions. The information processing device 1 converts the text information into voice so that the avatar speaks with pronunciation and accent according to the dialect of the set region.

For example, when conversion from text information to speech data is performed using a learning model generated by machine learning, the information processing apparatus 1 prepares a learning model in which pronunciation, accent, etc. are learned for each region, and is set by the user. It is possible to convert text information into speech data by using different learning models depending on the region.

Also, for example, in Japanese dialects, the names of things themselves may differ. Therefore, the information processing device 1 changes the phrases or words included in the text information output as speech by the avatar according to the set region. The information processing device 1 has, for example, a database that associates phrases or words that can be included in text information with expressions in each region, and based on this database and the region set by the user, text The phrases or words included in the information are replaced with expressions suitable for the region, and voice data is generated based on the text information after replacement.

Also, for example, young people and old people may use different names or expressions for the same thing or thing. Therefore, the information processing apparatus 1 may change the phrases or words included in the text information to be output as voice according to the set age.

Also, depending on the age of the user viewing the content data, there are differences in the volume and speed of the conversation that can be heard. For example, if the viewing user is an elderly person, it is preferable for the avatar to speak loudly and slowly. Therefore, the information processing apparatus 1 according to the present embodiment generates content data by changing the volume, speed, etc., when the avatar speaks according to the set age.

Also, in the information processing system according to the present embodiment, it is possible to make the avatar perform gestures in the content data. However, the same gesture may have different meanings in different countries. Therefore, the information processing apparatus 1 according to the present embodiment generates content data by changing the gesture that the user has set for the avatar to perform to a gesture according to the region set by the user.

As described above, the information processing apparatus 1 according to the present embodiment accepts settings such as the region or age for which content data is provided from the user, and uses pronunciation, accent, phrases, words, and volume according to the set region or age. Alternatively, content data is generated in which the avatar speaks at a speed or the like, or the avatar performs gestures according to the set region, age, or the like. As a result, the user can expect to easily generate content data for a different region and age based on content data generated for a specific region and age, for example.

In the present embodiment, the information processing apparatus 1 accepts the setting of the region or age to which the content data is to be provided. may be received and reflected in the generation of content data.

<Sound image according to avatar display position>
As described above, the content data generated by the information processing apparatus 1 according to the present embodiment includes, for example, an image (moving image) in which a presentation material is used as a background image, an avatar is placed in front of this background image, and Speech and speech. The user can appropriately set the position of the avatar on the screen displayed by reproducing the content data. The information processing apparatus 1 according to the present embodiment can set the sound image of the voice uttered by the avatar according to the display position of the avatar.

A sound image is, for example, the location or direction at which a user who has played back content data and listened to the sound recognizes the position of the sound source of the sound. In the present embodiment, the position where the avatar is displayed on the screen where the content data is played back is divided into three positions: the left side, the center, and the right side. Set the sound image of to one of three: left, center, and right.

For example, when the display position of the avatar is on the left side, the information processing device 1 sets the ratio of the output levels of the left channel (L) and the right channel (R) to R:L for audio data of stereo audio included in the content data. = 2:1. As a result, it can be expected that the user who has reproduced the content data will recognize that the sound source of the avatar's utterance is on the left side. Further, for example, when the display position of the avatar is in the center, the information processing device 1 sets the ratio of the left and right output levels of the audio data to R:L=1:1. Further, for example, when the display position of the avatar is on the right side, the information processing device 1 sets the ratio of the left and right output levels of the audio data to R:L=1:2.

In the present embodiment, the information processing apparatus 1 sets the sound image by adjusting the left and right output levels of the stereo sound. may be

For example, the information processing device 1 can set the sound image according to the display position of the avatar by sound image localization technology using a head-related transfer function (HRTF). FIG. 14 is a schematic diagram for explaining the outline of the sound image localization technique. In sound image localization technology, for example, FIR (Finite Impulse Response) filters 121 to 124 are used. Sounds of the right channel (R) of stereo sound are input to the two

FIR filters

121 and 122, respectively. The two

FIR filters

123 and 124 are supplied with left channel (L) stereo sound. A new right channel (R') is obtained by adding the right channel (R) sound processed by the FIR filter 121 and the left channel (L) sound processed by the FIR filter 123. is output as the sound of The right channel (R) sound processed by the FIR filter 122 and the left channel (L) sound processed by the FIR filter 124 are added to create a new left channel (L' ) is output as audio.

By appropriately adjusting the parameters of the FIR filters 121 to 124 according to, for example, the display position of the avatar, the information processing device 1 can appropriately adjust the position of the sound image associated with the uttered voice of the avatar. Further, the information processing apparatus 1 creates and stores a plurality of sets of parameters of the FIR filters 121 to 124 in association with, for example, a plurality of displayable positions of the avatar, and sets the parameters corresponding to the set display positions of the avatar. A set of parameters may be retrieved and used. The parameters of the FIR filters 121-124 can be determined using, for example, head-related transfer functions. Since the sound image localization technique using the head-related transfer function is an existing technique, detailed description thereof will be omitted.

The information processing device 1 generates content data including the above two outputs, that is, the new left channel (L) and right channel (R) sounds. As a result, the user who has reproduced this content data can hear the utterance voice of the avatar with a sound image according to the display position.

<Avatar's facial expression, voice, and gesture interlocking>
For example, as shown in FIG. 5, in the information processing system according to this embodiment, the user can set the facial expression of the avatar. The information processing apparatus 1 according to the present embodiment may adjust the pitch (pitch), volume, etc. of the voice uttered by the avatar according to the facial expression of the avatar set by the user.

For example, when "smile" is set as the facial expression of the avatar, the information processing device 1 increases the pitch and volume of the voice. Further, for example, when an "angry face" is set as the facial expression, the information processing device 1 lowers the pitch of the voice and raises the volume. The information processing device 1 stores, for example, in a database the correspondence between the facial expression of the avatar and the pitch increase/decrease amount and the volume increase/decrease amount of the voice. to get The information processing device 1 adjusts the pitch and volume according to the expression by reflecting the increase/decrease amount obtained from the database to the default values of the pitch and volume of the utterance of the avatar, for example. can be done.

The information processing device 1 may also determine the facial expression of the avatar based on the features of the text information, for example, when the user selects automatic setting for the facial expression of the avatar. For example, the information processing device 1 determines whether or not the text information spoken by the avatar contains a specific word, keyword, etc., and if the specific word, keyword, etc. is contained, the word, keyword, etc. is determined as the facial expression of the avatar. For example, if the text information includes words such as "happy" or "delicious", the information processing device 1 sets the avatar's facial expression to "smiling" and raises the pitch and volume of the avatar's utterance. can be higher. The information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with facial expressions of avatars.

Further, when the text information includes a specific word or keyword, the information processing apparatus 1 automatically sets the gesture so that the avatar performs the associated gesture when uttering the word or keyword. you can go For example, when the text information includes the word "Wow", the information processing device 1 can cause the avatar to make a gesture of opening the mouth and eyes wide and moving the hand. Further, for example, when the text information includes the word "No", the information processing device 1 can cause the avatar to make a gesture of shaking its head sideways. The information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with avatar gestures.

Also, the information processing device 1 may determine the facial expression of the avatar based on the text information, for example, using a learning model that has undergone machine learning in advance so as to estimate the emotion of the text information input. For example, by performing supervised learning processing of the learning model using learning data (teacher data) that associates text information and emotions, a learning model that estimates emotions for text information input is generated. be able to. The information processing device 1 inputs text information corresponding to the utterance content of the avatar to this learning model, acquires the emotion estimation result output by the learning model, and determines the facial expression associated with the emotion as the facial expression of the avatar. be able to. In this case, the information processing apparatus 1 may input all text information prepared for generating one piece of content data to the learning model. Alternatively, for example, each set of sentences until an interval is provided may be input to the learning model, or any other unit or amount of text information may be input to the learning model. Machine learning processing for generating a learning model may be performed by a device different from the information processing device 1 .

As described above, the information processing apparatus 1 according to the present embodiment generates content data in which the pitch or volume of the avatar's utterance is adjusted in accordance with the settings related to the facial expression of the avatar. The information processing device 1 also estimates the emotion of the person who speaks the text information based on the text information corresponding to the utterance content of the avatar, and sets the facial expression of the avatar according to the estimated emotion. In addition, the information processing apparatus 1 determines features of sentences included in the text information, such as whether or not a specific word or keyword is included in the text information, and adjusts the pitch of the voice according to the determined features. Alternatively, content data is generated in which the avatar speaks at a volume. The information processing device 1 also stores in a database the correspondence between words that can be included in the text information and facial expressions or gestures of the avatar, and stores content data in which the avatar makes corresponding facial expressions or gestures when uttering the words. Generate. As a result, the information processing apparatus 1 according to the present embodiment can be expected to link the expression of the avatar displayed in the content data, the pitch and volume of the uttered voice, the gesture, the content of the utterance, and the like.

<Switch content>
The content data generated by the information processing apparatus 1 described above is so-called moving image content, and content that is simply viewed by the viewer. However, the information processing apparatus 1 according to the present embodiment may receive information input from the viewer in the middle of the content, for example, and generate interactive content in which the content is switched according to the input information.

For interactive content, for example, a test is conducted to check proficiency during the output of a video image such as a lecture, and the next video image to be output is switched according to the score of this test. For this purpose, the information processing apparatus 1 accepts, for example, an operation of creating test questions and an operation of creating answers for this test from the user, and outputs questions to the viewer and receives answers from the viewer. Create content that does. The information processing apparatus 1 also receives from the user settings such as a scoring method based on responses received from viewers and switching conditions for switching a plurality of contents according to the scoring results.

FIG. 15 is a schematic diagram for explaining an example of a content switching setting method. In the example shown in FIG. 15, content switching is set on the speech editing screen. In the utterance voice editing screen shown in the figure, the line spoken by the avatar is set as the first item of the content, and the second item is "<Proficiency check test>". is set. As a result, in the generated content data, after the avatar utters the first line, a proficiency check test prepared in advance is carried out.

It should be noted that the proficiency level confirmation test to be carried out at this time is created in advance by the user, for example, on a test content creation screen separately displayed by the information processing apparatus 1, or the like. Although illustration is omitted, on the test content creation screen, for example, when a four-choice question is given as a test, the user inputs the question text, the sentences of the four options, and the answer indicating which of the options is the correct answer. , and the information processing apparatus 1 generates test content based on the received information. The user can give an arbitrary name to the test content, and in this example, the name "proficiency level confirmation test" is given. Further, when a plurality of questions are included, the information processing apparatus 1 may receive from the user settings such as calculation formulas for calculating the points and total points for each question, and generate test content.

On the utterance editing screen shown in the figure, the third item following the item "<proficiency check test>" is set as a content switching condition such as "Branch if score<80 goto No.11". In this example, the score of the proficiency check test is stored in the variable score, and if the score is less than 80, it is set to branch to the 11th item of the content. Note that the description method of the branch condition shown in FIG. 15 is an example, and content switching may be set in any format.

In this example, if the score of the proficiency check test set as the third item is 80 points or more, the content corresponding to the following fourth item, "Let's move on." content is output. Also, if the proficiency level confirmation test score is less than 80 points, the 4th to 10th items are not output, and the content corresponding to the 11th item, ``Start a repair course'', is spoken by the avatar. content is output.

<Material>
16 to 44 are materials related to the information processing system according to this embodiment.

<Summary>
The information processing apparatus 1 according to the present embodiment having the above configuration acquires presentation materials created in advance, acquires text information related to the presentation voice, receives settings related to the presenter's avatar, and Content data is generated in which an avatar is displayed and the avatar utters a voice corresponding to the text information. Accordingly, the information processing apparatus 1 can be expected to support generation of content data for presentation by the user.

Further, the information processing apparatus 1 according to the present embodiment acquires the voice information related to the presentation of the presenter, and acquires the text information by converting the acquired voice information. Accordingly, the information processing apparatus 1 can be expected to reduce the user's burden of creating text information.

The information processing apparatus 1 according to the present embodiment also receives settings related to the pronunciation of words included in the text information, and generates content data in which the avatar utters the words with the pronunciation according to the received settings. At this time, the information processing apparatus 1 displays text information, accepts word selection from the user, and displays the phonetic notation of the selected word. The information processing device 1 also accepts corrections to the displayed phonetic notation and causes the avatar to utter words in the corrected phonetic notation. As a result, the information processing apparatus 1 can be expected to facilitate the user's operation of setting the pronunciation of words.

Further, the information processing apparatus 1 according to the present embodiment accepts setting of an interval for outputting a voice corresponding to each text, or setting of an avatar that speaks each text, for a plurality of texts included in the acquired text information. content data in which the avatar speaks text according to the received settings. The information processing apparatus 1 also receives settings for facial expressions or gestures of the avatar when uttering text, and generates content data in which the avatar speaks with facial expressions or gestures according to the received settings. Accordingly, the information processing apparatus 1 can be expected to facilitate the user's avatar setting operation.

Information processing apparatus 1 according to the present embodiment also includes content editing area 112 (first area) in which an avatar is superimposed and displayed on a background image based on presentation materials, and setting items related to the avatar displayed in content editing area 112. and a background image selection area 111 (third area) for displaying a plurality of images included in the presentation material. The avatar setting area 113 includes, for example, setting items for gestures performed by the avatar, setting items for the position of the avatar, setting items for the orientation of the avatar, setting items for the size of the avatar, and the like. Also, the avatar setting area 113 may be provided with setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar. As a result, the information processing apparatus 1 can be expected to facilitate the user's setting operation regarding the background image and the avatar based on the presentation material.

Further, the information processing apparatus 1 according to the present embodiment receives an input of a caption character string to be displayed together with a background image based on the presentation material, and generates content data in which the received character string is displayed together with the background image of the presentation material. . Thereby, the information processing apparatus 1 can generate content data to which various information is added in addition to presentation materials and utterances of avatars.

It should be noted that the configurations of the screens shown in FIGS. 5 to 9 in the present embodiment are only examples, and the present invention is not limited to these. Also, the texts and images shown on these screens are only examples, and the present invention is not limited to these.

The embodiments disclosed this time are illustrative in all respects and should be considered not restrictive. The scope of the present invention is indicated by the scope of the claims rather than the meaning described above, and is intended to include all changes within the meaning and scope equivalent to the scope of the claims.

A computer program can be executed on a single computer, or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network. can be expanded to

The items described in each embodiment can be combined with each other. Also, the independent and dependent claims described in the claims can be combined with each other in all possible combinations regardless of the form of reference. Furthermore, the scope of claims uses a format in which claims referencing two or more other claims (multi-claim format), but is not limited to this. It may be described using a format for describing multiple claims (multi-multi-claim) that refers to at least one multiple claim.

1 information processing device 11 processing unit 11a information acquisition unit 11b avatar data generation unit 11c audio data generation unit 11d background data generation unit 11e content data generation unit 11f display processing unit 12 storage unit 12a program 12b content data storage unit 13 communication unit 14 display Unit 15 Operation unit 99 Recording medium 101 Setting table 111 Background image selection area 112 Content editing area 113 Avatar setting area 114 Caption setting area

Claims

to the computer,
Get presentation materials,
Acquire text information related to the presentation voice,
Accept settings related to the presenter's avatar,
A computer program for generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information.
Acquiring audio information related to the presentation of the presenter,
converting the audio information to obtain the text information;
2. A computer program as claimed in claim 1.
Receiving settings related to pronunciation of words included in the text information,
generating the content data in which the avatar utters the word with a pronunciation according to the accepted settings;
2. A computer program as claimed in claim 1.
outputting said text information;
Receiving a selection of words included in the output text information,
outputting a phonetic transcription of the selected word;
4. A computer program as claimed in claim 3.
accept corrections to said phonetic notation;
audibly outputting the word in the phonetic transcription for which corrections have been accepted;
5. Computer program according to claim 4.
The text information includes a plurality of spoken texts,
Receiving the setting of the interval for outputting the voice corresponding to each uttered text, or the setting of the avatar that utters each uttered text,
generating the content data according to the received settings;
2. A computer program as claimed in claim 1.
Receiving settings for facial expressions or gestures of the avatar when uttering the spoken text;
generating the content data in which the avatar utters the utterance text with a facial expression or gesture according to the received setting;
7. Computer program according to claim 6.
A first area for displaying the presentation material and the avatar, a second area for displaying setting items related to the avatar to be displayed in the first area, and a plurality of images included in the presentation material being displayed in the first area. displaying a screen including a third area displayed as a candidate to
2. A computer program as claimed in claim 1.
In the second area, setting items for gestures performed by the avatar, setting items for the position of the avatar, setting items for the orientation of the avatar, or setting items for the size of the avatar are displayed.
Computer program according to claim 8.
accepting settings for the pitch, speed, depth or volume of the voice uttered by the avatar;
generating the content data in which the avatar utters a voice according to the accepted settings;
2. A computer program as claimed in claim 1.
Receiving input of character information to be displayed together with the presentation materials,
generating the content data in which the received character information is displayed together with the presentation material;
2. A computer program as claimed in claim 1.
Displaying a text information editing screen showing a plurality of pieces of text information spoken by the avatar arranged in chronological order in the vertical direction of the screen, and showing the text information spoken by the two avatars divided into left and right sides of the screen;
2. A computer program as claimed in claim 1.
Receive settings related to the shooting position of the virtual camera for the avatar placed in the three-dimensional virtual space,
generating the content data including an image of the avatar captured by the virtual camera according to the received settings;
2. A computer program as claimed in claim 1.
Accept settings related to region or age,
generating the content data in which the avatar makes a gesture according to the region or age;
2. A computer program as claimed in claim 1.
Accept settings related to region or age,
generating the content data in which the avatar speaks words, accents, or speeds according to the region or age;
2. A computer program as claimed in claim 1.
generating the content data in which left and right output levels of stereo sound are adjusted according to the display position of the avatar;
2. A computer program as claimed in claim 1.
setting a sound image related to the utterance of the avatar according to the display position of the avatar;
2. A computer program as claimed in claim 1.
Receiving settings related to facial expressions of the avatar,
generating the content data in which the avatar speaks at a pitch or volume corresponding to the facial expression;
2. A computer program as claimed in claim 1.
estimating an emotion based on the text information;
setting the facial expression of the avatar according to the estimated emotion;
19. Computer program according to claim 18.
determining features of sentences included in the text information;
generating the content data in which the avatar speaks with a pitch or volume of voice according to the characteristics;
2. A computer program as claimed in claim 1.
storing correspondence between predetermined words and facial expressions or gestures of the avatar;
generating the content data that makes a corresponding facial expression or gesture when the avatar utters the predetermined word;
2. A computer program as claimed in claim 1.
The information processing device
Get presentation materials,
Acquire text information related to the presentation voice,
Accept settings related to the presenter's avatar,
generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information;
Information processing methods.
An information processing device comprising a processing unit,
The processing unit
Get presentation materials,
Acquire text information related to the presentation voice,
Accept settings related to the presenter's avatar,
generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information;
Information processing equipment.