WO2023167212A1 - Computer program, information processing method, and information processing device - Google Patents

Computer program, information processing method, and information processing device Download PDF

Info

Publication number
WO2023167212A1
WO2023167212A1 PCT/JP2023/007458 JP2023007458W WO2023167212A1 WO 2023167212 A1 WO2023167212 A1 WO 2023167212A1 JP 2023007458 W JP2023007458 W JP 2023007458W WO 2023167212 A1 WO2023167212 A1 WO 2023167212A1
Authority
WO
WIPO (PCT)
Prior art keywords
avatar
information processing
information
text
computer program
Prior art date
Application number
PCT/JP2023/007458
Other languages
French (fr)
Japanese (ja)
Inventor
公之 茶谷
雅丈 豊田
タン マウンマウン
康貴 朝倉
直樹 千葉
Original Assignee
株式会社KPMG Ignition Tokyo
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社KPMG Ignition Tokyo filed Critical 株式会社KPMG Ignition Tokyo
Publication of WO2023167212A1 publication Critical patent/WO2023167212A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output

Definitions

  • the present invention relates to a computer program, an information processing method, and an information processing apparatus for generating content data for presentation.
  • Patent Document 1 proposes a presentation system related to large-scale repair work for collective housing.
  • This presentation system consists of illustrated material data consisting of a combination of text and still images, survey situation video data that records the actual preliminary survey situation of the construction target property, and the actual construction situation of a pseudo-construction property similar to the construction target property.
  • an illustrated material signal is generated and transmitted to the display device, and based on the investigation situation video data and the simulated experience video data, the investigation situation video signal and the simulated An experience video signal is generated and transmitted to the display device.
  • a presenter when a presenter makes a presentation to a plurality of audiences, such as a presentation to a customer, the presenter prepares in advance presentation materials composed of images of multiple pages, and sequentially displays the created presentation materials on a display or projector. Conventionally, the presenter explained the information on the displayed page and the like. In recent years, presenters have been making presentations with various ingenuity, such as outputting moving images and sound in addition to displaying still images at the time of presentation. Such a presentation, however, requires the presenter to prepare various data such as still images, moving images, and voices in advance, which is not something that anyone can easily do.
  • the present invention has been made in view of such circumstances, and its object is to provide a computer program, an information processing method, and an information processing apparatus that can be expected to support the generation of content data for presentation. to provide.
  • a computer program acquires presentation materials in a computer, acquires text information related to presentation audio, receives settings related to a presenter's avatar, displays the avatar together with the presentation materials, and displays the text information.
  • a process of generating content data in which the avatar utters a voice corresponding to information is executed.
  • it can be expected to support generation of content data for presentation.
  • FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment
  • FIG. 1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment
  • FIG. 1 is a block diagram showing a configuration example of an information processing apparatus according to an embodiment
  • FIG. 4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus according to the present embodiment
  • FIG. 4 is a schematic diagram showing an example of an utterance editing screen
  • FIG. 4 is a schematic diagram for explaining a pronunciation correcting operation
  • FIG. 4 is a schematic diagram showing an example of a pronunciation correction dialog box
  • FIG. 4 is a schematic diagram showing an example of a content editing screen
  • FIG. 4 is a schematic diagram showing an example of a content editing screen
  • FIG. 4 is a schematic diagram showing an example of a content editing screen provided with a caption setting area;
  • FIG. 11 is a schematic diagram showing another example of the utterance editing screen;
  • FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method;
  • FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method;
  • FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method;
  • FIG. 2 is a schematic diagram for explaining an overview of sound image localization technology;
  • FIG. 4 is a schematic diagram for explaining an example of a content switching setting method;
  • materials related to the information processing system according to the present embodiment are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment.
  • materials related to the information processing system according to the present embodiment are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials related to the information processing system according to the present embodiment. These are materials
  • ⁇ System overview> 1 and 2 are schematic diagrams for explaining an outline of an information processing system according to this embodiment.
  • a presenter who gives a presentation or the like prepares presentation materials in advance using, for example, so-called presentation software, as in the conventional art.
  • This presentation material includes images of a plurality of pages, etc., and the presentation is made by displaying these in order (so-called slide show).
  • an online presentation is given by a presenter using presentation materials in an online conference in which a plurality of participants participate via a network.
  • voice information is extracted from the text written in the presentation material (see FIG. 1).
  • audio information may be acquired by recording the audio of the online presentation given by the presenter (see FIG. 2). It should be noted that not only audio but also video and audio may be recorded together. Also, the presentation by the presenter does not have to be an online presentation, and for example, the voice of the offline presentation may be recorded with a recording device.
  • voice information obtained by recording the voice of the presenter's presentation can be used.
  • the information processing system converts the voice information into text information by voice recognition processing.
  • the presenter's avatar Generate content data including a moving image of a presentation using presentation materials. By displaying the generated content data on a display, projecting it with a projector, or distributing it on a moving image distribution site or the like, the presenter does not have to repeatedly present the same content.
  • the voice information recorded by the presenter is converted into text information, but it is not limited to this.
  • the presenter may create text information of lines to be spoken at the time of presentation.
  • the information processing system acquires presentation materials and text information prepared in advance by the presenter, and generates content data based on these. That is, it does not matter whether text information used by the information processing system to generate content data is converted from voice information by voice recognition.
  • the presenter can generate content data in which his or her avatar presents without presenting themselves.
  • the information processing system acquires the above presentation materials and text information, and generates voice information by reading out the text information using synthesized voice.
  • the information processing system allows the avatar of the presenter, which is selected from data of a plurality of avatars prepared in advance, to perform mouth movements and gestures corresponding to the generated voice information, so that the avatar can express lines related to the presentation.
  • the information processing system uses images such as multiple pages of slides included in the acquired presentation material as background image data, superimposes the avatar video on this background image, and adds audio information to generate content data.
  • the content data is output as, for example, a moving image file, and can be used for display on an appropriate display device, projector, or the like, or for distribution on a moving image distribution site or the like.
  • this content data for example, the appearance of the avatar, the gestures performed by the avatar, the characteristics of the voice output as the utterance of the avatar, or the pronunciation of the words output as voice. setting is received from the presenter, and content data reflecting this setting is generated.
  • the information processing system can be expected to support generation of content data suitable for the presenter's preference and purpose.
  • FIG. 3 is a block diagram showing a configuration example of an information processing apparatus according to this embodiment.
  • the information processing apparatus 1 according to the present embodiment includes a processing unit 11, a storage unit (storage) 12, a communication unit (transceiver) 13, a display unit (display) 14, an operation unit 15, and the like.
  • the information processing device 1 according to the present embodiment can be configured using a general-purpose information processing device such as a personal computer or a tablet terminal device. In this embodiment, one information processing apparatus 1 performs the processing, but a plurality of information processing apparatuses may perform the processing in a distributed manner.
  • the user who uses the information processing device 1 is assumed to be a presenter, but the presenter is not limited to this. It may be someone other than the presenter.
  • the processing unit 11 includes an arithmetic processing unit such as a CPU (Central Processing Unit), MPU (Micro-Processing Unit), GPU (Graphics Processing Unit) or quantum processor, ROM (Read Only Memory), RAM (Random Access Memory), etc. It is configured using By reading and executing the program 12a stored in the storage unit 12, the processing unit 11 performs processing for acquiring presentation materials, text information, etc., processing for accepting various settings by the user, and processing for acquired information and received information. Various processes such as a process of generating content data based on the settings are performed.
  • arithmetic processing unit such as a CPU (Central Processing Unit), MPU (Micro-Processing Unit), GPU (Graphics Processing Unit) or quantum processor, ROM (Read Only Memory), RAM (Random Access Memory), etc. It is configured using By reading and executing the program 12a stored in the storage unit 12, the processing unit 11 performs processing for acquiring presentation materials, text information, etc., processing for accepting various settings by the user, and processing for acquired
  • the storage unit 12 is configured using a large-capacity storage device such as a hard disk.
  • the storage unit 12 stores various programs executed by the processing unit 11 and various data required for processing by the processing unit 11 .
  • the storage unit 12 stores a program 12a executed by the processing unit 11.
  • FIG. The storage unit 12 is also provided with a content data storage unit 12b that stores content data generated by the information processing apparatus 1 .
  • the program (computer program, program product) 12a is provided in a form recorded in a recording medium 99 such as a memory card or an optical disk, and the information processing apparatus 1 reads out the program 12a from the recording medium 99 and stores it in the storage unit. 12.
  • the program 12a may be written in the storage unit 12 at the manufacturing stage of the information processing device 1, for example.
  • the program 12a may be distributed by a remote server device or the like and acquired by the information processing device 1 through communication.
  • the program 12 a may be recorded in the recording medium 99 and read by a writing device and written in the storage unit 12 of the information processing device 1 .
  • the program 12 a may be provided in the form of distribution via a network, or may be provided in the form of being recorded on the recording medium 99 .
  • the content data storage unit 12b stores content data generated by the information processing device 1 based on information such as presentation materials and text information.
  • the content data is stored in the content data storage unit 12b as a moving image file in MPEG-4 format, for example.
  • the content data storage unit 12b may store various information such as the title of the presentation, the name of the presenter, the date and time of the presentation, or an overview of the contents of the presentation, together with the moving image file.
  • the communication unit 13 communicates with various devices via a network N including, for example, the Internet, a LAN (Local Area Network), or a mobile phone communication network.
  • the information processing device 1 performs processing such as acquisition (downloading) of the program 12a, implementation of an online presentation, distribution of generated content data, etc., by communicating with other devices through the communication unit 13. can be done.
  • the communication unit 13 transmits the data given from the processing unit 11 to other devices, and gives the data received from the other devices to the processing unit 11 .
  • the display unit 14 is configured using a liquid crystal display or the like, and displays various images, characters, etc. based on the processing of the processing unit 11.
  • the operation unit 15 receives a user's operation and notifies the processing unit 11 of the received operation.
  • the operation unit 15 receives a user's operation using an input device such as mechanical buttons or a touch panel provided on the surface of the display unit 14 .
  • the operation unit 15 may be an input device such as a mouse and a keyboard, and these input devices may be detachable from the information processing apparatus 1 .
  • the program 12a stored in the storage unit 12 is read out and executed by the processing unit 11, whereby the information acquisition unit 11a, the avatar data generation unit 11b, the voice data generation unit 11c, the background data generation unit 11d, the content data generation unit 11e, the display processing unit 11f, and the like are implemented in the processing unit 11 as software functional units.
  • the functional units of the processing unit 11 functional units related to content data generation are illustrated, and functional units related to other processes are omitted.
  • the information acquisition unit 11a performs processing for acquiring information such as presentation materials and text information necessary for generating content data.
  • the information acquisition unit 11a acquires information on presentation materials prepared in advance by the user.
  • the user prepares in advance a multi-page image (slides) including sentences summarizing the content of the presentation, graphs, illustrations, etc., as a presentation material using, for example, existing presentation software.
  • the presentation material may be created by the information processing device 1 or by another device.
  • the data of the presentation material prepared in advance by the user is stored in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a reads out the presentation material stored in the storage unit 12 to perform the presentation. Get materials.
  • the information acquisition unit 11a acquires text information corresponding to the lines spoken by the presenter's avatar in the generated content data.
  • sentences, characters, words, or the like described in the presentation material can be used as the avatar's lines.
  • the creator sets in advance which of the sentences included in the presentation material is to be used as the speech of the avatar using the comment function or the like of the presentation software for creating this presentation material.
  • the information acquisition unit 11a can recognize comments and the like attached to the presentation material and extract sentences and the like included in the presentation material as text information for making the avatar's lines.
  • the information acquisition unit 11a can use the lines spoken by the presenter when the presenter actually made a presentation based on the presentation material as the lines spoken by the avatar in the content data.
  • the presentation is video-recorded or recorded, and voice information including the lines spoken by the presenter is prepared in advance.
  • the user stores this voice information in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a acquires the voice information by reading out the voice information stored in the storage unit 12.
  • the information acquisition unit 11a that has acquired the voice information acquires text information by, for example, performing so-called voice recognition processing on this voice information and converting the voice information into text information.
  • the information processing device 1 may perform the speech recognition processing itself, or may transmit the speech information to another device that performs the speech recognition processing, and acquire the text information converted by the speech recognition processing by the other device.
  • the method of acquiring text information includes, for example, avatar It is also possible to adopt a method of obtaining text information that is directly created by the user.
  • the user creates sentences corresponding to the avatar's dialogue using, for example, a text editor or sentence creation software, and stores the sentences in the storage unit 12 as text information.
  • the text information may be created by the information processing device 1 or by another device.
  • the user stores the created text information in the storage unit 12, and the information acquisition unit 11a can acquire the text information by reading the text information stored in the storage unit 12.
  • the avatar data generation unit 11b performs processing for generating data related to the presenter's avatar appearing in the content data.
  • the avatar data generation unit 11b displays, for example, a list of information about a plurality of avatars stored in the database on the display unit 14, and receives selection of an avatar from the user.
  • the avatar data generation unit 11b acquires data of the selected avatar from the database, and displays a preview screen showing the appearance of the avatar on the display unit 14 based on the acquired data.
  • the avatar data generation unit 11b accepts an editing operation such as the color or shape of the avatar from the user on this preview screen, and uses the edited avatar as an avatar to appear in the content data.
  • the avatar data generation unit 11b accepts various settings from the user such as the position at which the avatar is displayed, the direction of the avatar, and the movements (gestures) performed by the avatar in the content data to be generated. To reflect.
  • the avatar data generation unit 11 b generates avatar data including data such as the shape of the avatar and settings such as the display position of the avatar, and stores the data in the storage unit 12 .
  • a plurality of avatars may appear in the content data, and in this case, the avatar data generating section 11b generates avatar data for the plurality of avatars.
  • the audio data generation unit 11c performs processing for generating audio data spoken by the avatar in the content data.
  • the voice data generation unit 11c performs so-called text-to-speech processing based on the text information acquired by the information acquisition unit 11a, thereby converting the text information into voice data. Since the text-to-speech processing is an existing technology, detailed description is omitted.
  • the information processing apparatus 1 may perform the text-to-speech process by itself, or may transmit text information to another apparatus that performs the text-to-speech process, and acquire voice data converted by the text-to-speech process by the other apparatus.
  • the voice data generation unit 11c sequentially displays one or more texts included in the text information acquired by the information acquisition unit 11a on the display unit 14, and accepts selection of texts to be converted into voice data.
  • the voice data generation unit 11c receives, from the user, settings related to, for example, the pitch, speed, depth (thickness of voice), pitch, voice, voice quality, or volume of voice data to be generated. Generates audio data that reflects the accepted settings.
  • the audio data generator 11c outputs the generated audio data from an audio output device such as a speaker or an earphone.
  • the voice data generation unit 11c accepts settings for association between avatars and texts, and also accepts settings such as speed or voice quality for each avatar.
  • the voice data generation unit 11c receives settings related to pronunciation for, for example, words or short sentences included in the text information, and corrects the pronunciation of the target words included in the voice data.
  • the text information includes the word “Yukawa”
  • the voice data generated by the voice data generation unit 11c pronounces this word as "Yugawa”
  • the inventor selects "Yukawa” from the displayed sentences and sets "Yukawa” as the correct pronunciation of this word.
  • the voice data generation unit 11c that receives this setting generates voice data in which the pronunciation of all "Yukawa" included in the text information is changed from "Yugawa" to "Yukawa".
  • this example deals with words written in ideograms (Chinese characters) in the text information, and is an example in which the pronunciation is set in phonetic characters (katakana or hiragana).
  • the setting is not limited to this.
  • the voice data generation unit 11c may accept settings for pronunciation using, for example, phoneme characters (romaji) or phonetic symbols.
  • the voice data generation unit 11c may receive settings such as the position of an accent for the pronunciation of words.
  • the background data generation unit 11d performs processing for generating image data that serves as the background of the avatar in the content data.
  • a plurality of images (slides) included in the presentation material acquired by the information acquisition unit 11a are used as the background image of the avatar, and content data is generated in which the avatar makes a presentation using the presentation material.
  • the background data generation unit 11d receives settings such as display order and display switching timing for a plurality of images included in the presentation material acquired by the information acquisition unit 11a.
  • the background data generation unit 11 d generates background data including a plurality of background images and settings such as timings for displaying the background images, and stores the background data in the storage unit 12 .
  • the background data generation unit 11d also performs a process of adding a caption character string such as a title or subtitles to the background image based on the invention material.
  • the background data generation unit 11d accepts an input of a character string to be displayed as a caption character string from the user, and also sets the position and direction of displaying the caption character string, the size and font of the character string, the timing of displaying the caption character string, and the like. Accept settings.
  • the background data generation unit 11d stores the caption character strings and settings related to them in the background data.
  • the content data generation unit 11e Based on the avatar data generated by the avatar data generation unit 11b, the audio data generated by the audio data generation unit 11c, and the background data generated by the background data generation unit 11d, the content data generation unit 11e generates, for example, an avatar to create a presentation material. Data of moving images for presentation using is generated as content data.
  • the content data generation unit 11e arranges the avatar included in the avatar data in a position, orientation, etc. according to the setting with respect to the background image included in the background data, and outputs the voice included in the voice data at an appropriate timing. content data can be generated.
  • the content data generation unit 11 e stores the generated content data in the content data storage unit 12 b of the storage unit 12 .
  • the display processing unit 11f performs processing for displaying various information such as images and characters on the display unit 14.
  • the display processing unit 11f displays, for example, a screen for accepting settings related to avatars, a screen for accepting settings related to voice, a screen for accepting settings related to background settings, and a display of generated content data. process.
  • the display processing unit 11f not only displays these items on the display unit 14 provided in the information processing device 1, but also transmits the data for display to other devices through the communication unit 13, thereby enabling the display of other devices. It may be displayed on a display unit or the like.
  • FIG. 4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus 1 according to the present embodiment.
  • the information acquisition unit 11a of the processing unit 11 of the information processing apparatus 1 according to the present embodiment acquires a presentation material file or the like created in advance by the user by reading it from the storage unit 12 (step S1).
  • the information acquisition unit 11a also acquires text information for the avatar's speech (step S2). At this time, the information acquisition unit 11a acquires sentences and the like included in the presentation material as text information for utterance by the avatar, based on, for example, comments preset in the presentation material acquired in step S1. The information acquisition unit 11a acquires a file or the like of voice information recorded when the presenter gave a presentation, and performs voice recognition processing to convert the voice information into text information, thereby acquiring text information. good too. The information acquisition unit 11a may also acquire text information created by, for example, the user writing the lines of the avatar in advance.
  • the display processing unit 11f of the processing unit 11 displays, on the display unit 14, an utterance editing screen for making settings when uttering the text information based on the text information acquired in step S2 (step S3).
  • the voice data generation unit 11c of the processing unit 11 accepts editing of the spoken voice by accepting the user's operation on the operation unit 15 while the spoken voice editing screen is displayed (step S4).
  • the voice data generating unit 11c reflects the edited content received in step S4 and generates voice data based on the text information (step S5).
  • FIG. 5 is a schematic diagram showing an example of the speech editing screen.
  • the information processing apparatus 1 displays the illustrated speech editing screen based on the text information acquired in step S2.
  • the speech editing screen for example, shows a title string of "speech speech editing" at the top, and a button labeled “total speech output” and a button labeled "add text” below it. are arranged on the left and right, and below these, a setting table 101 is provided in which a plurality of texts included in the text information and a plurality of setting items related to each text are arranged in a matrix.
  • the setting table 101 is a table in which a plurality of texts are arranged in a list in the vertical direction and a plurality of setting items are arranged in the horizontal direction.
  • the setting table 101 has, for example, items of "number”, “text”, “interval (seconds)", “speaker” and “expression” in order from the left, and an icon area is provided at the right end.
  • “Number” is numerical information indicating the order in which text is spoken by avatars in the finally generated content data.
  • “Text” is text (sentences, lines, etc.) uttered by the avatar, and is character string information of one or more characters. In this example, the first text is "Hello. My name is Dr. Baru.
  • the information processing apparatus 1 appropriately divides a sentence included in the acquired text information into a plurality of texts based on punctuation marks, etc., and assigns numbers in order, so that the "number" and "text" of the setting table 101 shown in the figure are displayed. You can get the information to display in
  • the division of sentences included in the text information into a plurality of texts may be performed, for example, in speech recognition processing, or may be performed in advance by the user, for example, or when the information processing apparatus 1 acquires the text information, for example. may be performed.
  • dividing text in speech recognition processing for example, when there is an interval exceeding a predetermined time between utterances, the preceding and following utterances can be divided into two texts.
  • the user divides the text for example, the user checks the text information with a text editor or the like and inserts a line feed or tab at an appropriate location, thereby dividing the text.
  • Interval (seconds) in the setting table 101 is an item for setting the interval (interval) provided between the utterances of this text and the previous text in units of seconds. In this example, 0.5 seconds is set by the information processing apparatus 1 as a default value.
  • “Speaker” is an item for setting which avatar speaks this text. In this example, it is set that "Dr. Value” speaks the first text and "College” speaks the second text.
  • the information processing apparatus 1 can accept the setting of the "speaker” by accepting the selection of one avatar from the pre-registered avatars, for example, using a pull-down menu or the like.
  • “Facial expression” is an item for setting the facial expression of the avatar when the avatar set with this text speaks.
  • facial expressions such as "natural” and “smiling” are set.
  • the information processing apparatus 1 can accept the setting of the "facial expression” by accepting the selection of one facial expression from pre-registered facial expressions, for example, using a pull-down menu or the like.
  • the information processing device 1 displays, for example, an icon resembling a speaker and an icon resembling a trash can in the rightmost icon area of the setting table 101 in association with each text.
  • the icon imitating a speaker is for accepting an operation for outputting the corresponding text by voice.
  • the information processing apparatus 1 outputs only the corresponding text as voice.
  • the trash can icon is for accepting an operation to delete this text.
  • the information processing apparatus 1 deletes the corresponding text and settings.
  • the "Overall audio output” button provided at the top of the utterance audio editing screen is a button for audio output of all text.
  • the information processing apparatus 1 accepts an operation on the "output all audio” button, the information processing apparatus 1 sequentially audio-outputs all the texts included in the setting table 101 from the beginning to the end.
  • the "add text” button is a button for adding arbitrary text.
  • a dialog box for adding text (not shown) is displayed, and the input of the text to be added, the order of uttering the text, the interval, the speaker, and the Accept settings such as facial expressions.
  • the information processing apparatus 1 adds the text by inserting the text and setting received in this dialog box into the setting table 101 at an appropriate position.
  • FIG. 6 is a schematic diagram for explaining the pronunciation correction operation.
  • FIG. 6 shows an utterance editing screen in which the setting table 101 is set with the text "Hello. My name is Yukawa.” The user uses an input device such as a mouse to select the word "Yukawa" included in this text.
  • the information processing apparatus 1 displays a button labeled, for example, "pronunciation correction". When an operation on this button is accepted, the information processing apparatus 1 displays, for example, a pronunciation modification dialog box and accepts settings regarding pronunciation from the user.
  • FIG. 7 is a schematic diagram showing an example of a pronunciation correction dialog box.
  • the upper part of FIG. 7 shows the state before pronunciation correction, and the lower part shows the state after pronunciation correction.
  • the pronunciation correction dialog box of this example for example, the title string "Pronunciation correction” is displayed at the top, and a text box labeled “Target text” and a label “Pronunciation” are attached below it. Text boxes are arranged vertically, and a button labeled “Voice Output” and a button labeled "Complete” are arranged horizontally below the text boxes.
  • the information processing device 1 displays the word selected by the user on the speech editing screen in the "target text” text box.
  • "Yukawa” selected on the utterance editing screen shown in FIG. 6 is displayed in the "target text” text box.
  • the information processing apparatus 1 also displays the pronunciation when the target text is uttered in the "pronunciation” text box in phonetic notation such as katakana or hiragana.
  • phonetic notation such as katakana or hiragana.
  • the "Speech output” button in the pronunciation correction dialog box is a button for outputting only the words indicated in the "target text".
  • the information processing apparatus 1 performs speech output by reading out only the word indicated in the "target text” with the pronunciation set in the "pronunciation". In the example in the upper part of FIG. 7, voice output is performed with the pronunciation of "Yugawa".
  • the information processing device 1 accepts the correction of the pronunciation of the target text by accepting the user's correction of the phonetic notation displayed in the "pronunciation" text box.
  • the user corrects "yugawa” displayed in the text box as the current pronunciation of "yukawa” to "yukawa” using an input device such as a keyboard.
  • the information processing apparatus 1 performs voice output with the pronunciation of "yukawa”.
  • the "Done” button in the pronunciation correction dialog box is a button for letting the user reflect on the pronunciation correction and close this dialog box.
  • the information processing apparatus 1 associates and stores the words in the "target text” of the pronunciation modification dialog box with the pronunciations set in the "pronunciation” text box, and stores them in the text information. Generates audio data with the set pronunciation applied to all the same words included.
  • the avatar data generation unit 11b and the background data generation unit 11d of the processing unit 11 accept editing of the avatar and the background by accepting user's operation on the operation unit 15 while the content editing screen is displayed (step S7). .
  • the avatar data generation unit 11b generates avatar video data reflecting the editing contents accepted in step S7 (step S8).
  • the background data generation unit 11d also generates background image data reflecting the editing contents accepted in step S7 (step S9).
  • FIG. 8 is a schematic diagram showing an example of the content editing screen.
  • the information processing apparatus 1 displays the illustrated content editing screen based on the presentation material acquired in step S1 and the text information acquired in step S2.
  • the content editing screen shows, for example, a title string of "content editing" at the top, and a background image selection area 111, a content editing area 112, and an avatar setting area 113 are arranged in the horizontal direction below the title string.
  • the background image selection area 111 of the content editing screen is an area for accepting selection of a background image by the user.
  • the information processing apparatus 1 displays a list of a plurality of slides included in the presentation material in the background image selection area 111 as background images.
  • a plurality of background images listed in the background image selection area 111 are displayed in the order in which they are arranged in this area.
  • the information processing apparatus 1 accepts selection of one background image from among the plurality of background images displayed in the background image selection area 111 and displays the selected background image in the content editing area 112 .
  • the information processing apparatus 1 also accepts operations such as addition, deletion, and change of the display order of background images, and performs addition, deletion, order change, and the like for the plurality of background images displayed in a list according to the accepted operations.
  • the avatar setting area 113 of the content editing screen is an area for accepting settings related to one or more avatars appearing in the content data.
  • an avatar selection area, a text display area, a setting reception area, and the like are arranged vertically.
  • the information processing apparatus 1 displays a list of images, names, etc. of one or more avatars created in advance in the avatar selection area of the avatar setting area 113 .
  • two avatars "Dr. Value” and “College” are displayed in the avatar selection area, and the avatar "Dr. Value” is selected.
  • the information processing device 1 displays the avatar selected in the avatar selection area in the content editing area 112 .
  • avatar creation is performed on an avatar creation screen or the like.
  • the avatar is not created by the user, and the user may acquire and use an avatar provided for a fee or free of charge, for example. Since the method of creating an avatar is an existing technology, detailed description is omitted.
  • the text display area of the avatar setting area 113 is an area where the text spoken by this avatar is displayed. Based on the text information acquired in step S2 or the text information obtained by editing the text information on the above-described speech editing screen, the information processing apparatus 1 extracts one or more texts included in the text information. is displayed in the text display area. The information processing apparatus 1 selects a plurality of texts included in the text information in output order and displays them in the text display area, and the user can appropriately change the text displayed in the text display area.
  • the setting acceptance area of the avatar setting area 113 is an area that accepts settings for a plurality of setting items related to the avatar selected in the avatar selection area.
  • “gesture”, “size”, “orientation”, “position” and the like are shown as setting items related to the avatar.
  • the user can input or select a setting value for each setting item by various methods such as direct numerical input or selection from a pull-down menu.
  • the illustrated setting items are merely an example, and the information processing apparatus 1 may receive settings related to avatars by providing various setting items other than the illustrated setting items.
  • the information processing apparatus 1 provides setting items related to the voice when the avatar utters the text in the setting reception area, for example, setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar, and accepts these settings from the user. may be accepted.
  • the content editing area 112 of the content editing screen displays the avatar selected in the avatar selection area of the avatar setting area 113 superimposed on the background image selected in the background image selection area 111 .
  • an image that reproduces one scene of the finally generated content data is displayed.
  • the user can change the position and orientation of the avatar displayed in the content editing area 112 by, for example, mouse operation or touch operation. and accepts changes in settings such as orientation.
  • the information processing apparatus 1 changes the setting value of the corresponding setting item provided in the setting accepting area of the avatar setting area 113 .
  • information processing apparatus 1 changes the display mode of the avatar displayed in content editing area 112 according to the accepted setting.
  • the user can add a caption to the background image by performing a predetermined operation in the content editing area 112 .
  • a predetermined operation for example, when the user designates one point in the content editing area 112 using a function such as a right-click menu of a mouse and performs an operation to add a caption, the information processing apparatus 1 inputs a caption character string in the content editing area 112.
  • a caption setting area is displayed instead of the avatar setting area 113 of the content editing screen.
  • FIG. 9 is a schematic diagram showing an example of a content editing screen on which a caption setting area 114 is provided.
  • the caption setting area 114 of the content editing screen is an area for receiving settings related to the caption character string input in the content editing area 112 .
  • the caption setting area 114 is provided with setting items such as "font type", "size", and "position".
  • a character string "Introduction to Business in the Digital Age” is entered as a caption in a text box indicated by a dashed rectangular frame in the content editing area 112 .
  • the information processing apparatus 1 receives settings for the caption in each setting item of the caption setting area 114, and displays the caption in a display mode according to the received settings.
  • the information processing apparatus 1 stores information such as the input caption character string and caption settings together with, for example, the background image data.
  • the content data generation unit 11e of the processing unit 11 of the information processing apparatus 1, which has performed processing related to content editing in steps S6 to S9 of the flowchart shown in FIG. Data is generated (step S10), the generated content data is stored in the content data storage section 12b of the storage section 12, and the process is terminated.
  • the content data generation unit 11e integrates the audio data generated in step S5, the avatar video data generated in step S8, and the background image data generated in step S9, and superimposes them on the background image. Generates content data of the content uttered by the avatar.
  • the information processing apparatus 1 acquires, for example, presentation materials, text information, and the like, as well as image files and the like of various parts to be included in the content.
  • the information processing apparatus 1 arranges these various parts together with the avatar in an appropriate position and order by accepting a user's operation on the content editing screen, for example.
  • the user can edit, for example, placing a speech desk in front of the avatar, or placing decorative parts between the avatar and the background, and can express the depth on the screen, etc. It is possible to enhance the realism of.
  • the user can also hide the avatar on the screen, for example, by placing the avatar behind the background.
  • the lines of the hidden avatar can be output as narration, allowing the user to enjoy more effective content. can be created.
  • the text information acquired together with the presentation material may contain a large amount of text that becomes the dialogue of the avatar.
  • Many lines included in the text information are associated with the avatar that speaks them, for example, in the content editing screen in the present embodiment.
  • the information processing apparatus 1 may assign all lines to one pre-selected avatar, for example, when acquiring presentation materials and text information.
  • the information processing apparatus 1 assigns lines collectively in this manner, the user can generate content data without having to perform an operation for assigning lines to avatars on the subsequent content editing screen or the like.
  • the information processing apparatus 1 accepts an editing operation from the user, such as allocating the lines allocated to the first avatar to the second avatar on a content editing screen or the like. you can
  • the information processing device 1 that has generated the content data can reproduce the content data using, for example, a video reproduction application program and display it on the display unit 14 .
  • the information processing device 1 may also upload content data to, for example, a video distribution site.
  • FIG. 10 is a schematic diagram showing another example of the speech editing screen.
  • the information processing apparatus 1 according to the present embodiment may display, for example, an utterance editing screen shown in FIG. 10 instead of the utterance editing screen shown in FIG.
  • the speech editing screen shown in FIG. 10 is suitable for creating content data in a format in which two avatars interact.
  • a title string of "spoken voice editing screen” is displayed at the top of the screen.
  • the names are displayed side by side.
  • the left side of the screen is used as an area for displaying information such as the utterance content of "Dr. Value”
  • the right side of the screen is used as an area for displaying information such as the utterance content of "College”. be done.
  • the text information spoken by each avatar is placed in a rectangular frame, and multiple pieces of text information are displayed in chronological order from top to bottom of the screen.
  • the text information about "Dr. Value” is displayed on the left side of the screen
  • the text information about "College” is displayed on the right side of the screen.
  • the user can scroll a plurality of pieces of text information in chronological order by, for example, performing a slide operation in the vertical direction, and can confirm a plurality of pieces of text information that cannot fit on one screen.
  • the user can arbitrarily edit the text information contained within the rectangular frame.
  • one or more icons are provided, for example, in the lower right corner of the rectangular frame in which the text information of each avatar is stored. Note that in this figure, these icons are shown in a simplified form using square figures. These icons accept various operations from the user, such as accepting settings for the corresponding text information, accepting an operation to output the corresponding text information by voice, or accepting an operation to delete the corresponding text information. It is for
  • a rectangular frame that is long in the horizontal direction is displayed between the text information of two utterances that are continuous in time series, indicating the time setting of the interval to be provided between utterances. .
  • “interval: 0.5 seconds” is set between the text information of "Hello. Nice to meet you.” of "Dr. Value” and the text information of "Dr.
  • a rectangular frame with a character string of is displayed. This indicates that there is an interval of 0.5 seconds between the utterance of "Dr. Value” and the utterance of "College", that is, a period during which both avatars do not speak.
  • the user can arbitrarily set the interval time by correcting the numerical values in the rectangular frame.
  • the information processing apparatus 1 arranges a plurality of pieces of text information spoken by avatars in time series in the vertical direction of the screen, and arranges text information spoken by two avatars on the left and right sides of the screen. Display the separate speech editing screen. As a result, the user can expect to easily generate content data such as moving images in which, for example, two avatars make a presentation while talking.
  • the information processing device 1 arranges the text information of the number of pieces uttered by the avatars in chronological order in the horizontal direction of the screen, and displays the text information uttered by the two avatars by dividing the screen into the upper and lower parts of the screen. may be displayed.
  • Information processing apparatus 1 may use a three-dimensional model, that is, a three-dimensional character object reproduced in a three-dimensional virtual space as an avatar displayed in content data.
  • the information processing apparatus 1 reads data of a three-dimensional model of an avatar created in advance or newly created by a user, and reproduces this avatar in a three-dimensional virtual space.
  • the information processing device 1 can generate content data by acquiring a two-dimensional image by photographing an avatar with a virtual camera appropriately arranged in a three-dimensional virtual space.
  • the information processing device 1 accepts from the user settings related to the position of the virtual camera in the three-dimensional virtual space, that is, camerawork, in order to shoot an image (moving image) of the avatar to be included in the content data.
  • the information processing device 1 can, for example, store front, back, left, right, and top and bottom positions (x-coordinate, y-coordinate, and z-coordinate) in a three-dimensional virtual space, the direction of the virtual camera from this position, and the temporal Receive settings such as changes from the user.
  • the information processing apparatus 1 arranges a virtual camera in the three-dimensional virtual space according to the received settings, moves the virtual camera to photograph the avatar, and acquires a two-dimensional image to be included in the content data.
  • FIG. 11 is a schematic diagram for explaining an example of a camerawork setting method.
  • a predetermined operation such as a right-click operation of a mouse
  • the selection menu dialog box for example, selection items such as "head shot”, “upper body shot” and “whole body shot” are displayed vertically. The user can select any one of these multiple selection items.
  • the information processing apparatus 1 displays the head of the avatar and its peripheral parts in the three-dimensional virtual space as shown in the lower left part of FIG. Place the virtual camera close to the avatar.
  • the information processing apparatus 1 displays the upper body or the whole body of the avatar as shown in the lower center or right side of FIG. A virtual camera is placed at a position suitable for these shots in the virtual space.
  • FIG. 12 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation, such as a right-click operation of a mouse, is performed on the avatar displayed in the content editing area 112 on the content editing screen shown in FIG. Show the selection menu dialog box. In the direction selection menu, for example, selection items such as "left", “front” and “right” are displayed vertically. The user can select any one of these multiple selection items.
  • a predetermined operation such as a right-click operation of a mouse
  • the information processing device 1 places the virtual camera on the left side of the avatar as shown in the lower left part of FIG. 12, and shoots the avatar with the virtual camera.
  • the information processing apparatus 1 arranges the virtual camera in front or right of the avatar, as shown in the lower middle or right side of FIG.
  • the left and right directions used for setting are the left and right directions when the virtual camera sees the avatar, but are not limited to this. good.
  • FIG. 13 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation is performed on an avatar displayed in the content editing area 112 on the content editing screen shown in FIG. A slide bar (slider, slider bar, scroll bar, etc.) for setting zoom is displayed nearby.
  • a slide bar sliding, slider bar, scroll bar, etc.
  • a horizontally elongated slide bar is displayed below the avatar, and the user can slide the knob of the slide bar horizontally.
  • the information processing apparatus 1 moves the virtual camera away from the avatar (zooms out) as shown on the left side of FIG. 13 .
  • the information processing apparatus 1 moves the virtual camera closer to the avatar (zooms in) as shown on the right side of FIG. 13 .
  • the information processing apparatus 1 receives settings such as the position and orientation of a virtual camera that captures an avatar placed in a three-dimensional virtual space, and performs virtual camerawork according to the received settings.
  • An avatar is photographed by a camera, and content data including a two-dimensional image of the avatar obtained by photographing is generated. Accordingly, in the information processing system according to the present embodiment, the user can easily set the orientation, size, etc. of the avatar displayed in the content data.
  • the information processing apparatus 1 changes the behavior of the avatar included in the content data to be generated according to the region where the content data is provided or the age of the person who provides the content data. For this reason, the information processing apparatus 1 accepts settings such as the region where the content data is provided or the age of the person who provides the content data from the user. For example, a country, a prefecture, a state, or the like can be adopted as the area for which the information processing apparatus 1 receives settings. Further, the information processing device 1 may accept an approximate age such as 20's or 30's as a setting, or may accept an age range such as 25 to 40 by inputting a numerical value. You may receive the setting of the age by the methods other than these.
  • the information processing device 1 When the language spoken by the avatar is English, the information processing device 1 presents the user with options such as the United States, the United Kingdom, and Australia as regions, and accepts setting of the region from among these. English pronunciation, accent, etc. differ from region to region, and the information processing device 1 converts text information into voice so that the avatar speaks with pronunciation, accent, etc., according to the set region.
  • the information processing device 1 presents the user with the names of regions such as the Kanto region, the Kansai region, and the Tohoku region as options, and allows the user to select one of these regions. accept.
  • the information processing apparatus 1 may also present the user with dialect names such as standard Japanese, Kansai dialect, and Tohoku dialect as options for regions.
  • the information processing device 1 converts the text information into voice so that the avatar speaks with pronunciation and accent according to the dialect of the set region.
  • the information processing apparatus 1 when conversion from text information to speech data is performed using a learning model generated by machine learning, the information processing apparatus 1 prepares a learning model in which pronunciation, accent, etc. are learned for each region, and is set by the user. It is possible to convert text information into speech data by using different learning models depending on the region.
  • the information processing device 1 changes the phrases or words included in the text information output as speech by the avatar according to the set region.
  • the information processing device 1 has, for example, a database that associates phrases or words that can be included in text information with expressions in each region, and based on this database and the region set by the user, text The phrases or words included in the information are replaced with expressions suitable for the region, and voice data is generated based on the text information after replacement.
  • the information processing apparatus 1 may change the phrases or words included in the text information to be output as voice according to the set age.
  • the information processing apparatus 1 According to the present embodiment, generates content data by changing the volume, speed, etc., when the avatar speaks according to the set age.
  • the information processing apparatus 1 in the information processing system according to the present embodiment, it is possible to make the avatar perform gestures in the content data. However, the same gesture may have different meanings in different countries. Therefore, the information processing apparatus 1 according to the present embodiment generates content data by changing the gesture that the user has set for the avatar to perform to a gesture according to the region set by the user.
  • the information processing apparatus 1 accepts settings such as the region or age for which content data is provided from the user, and uses pronunciation, accent, phrases, words, and volume according to the set region or age.
  • content data is generated in which the avatar speaks at a speed or the like, or the avatar performs gestures according to the set region, age, or the like.
  • the user can expect to easily generate content data for a different region and age based on content data generated for a specific region and age, for example.
  • the information processing apparatus 1 accepts the setting of the region or age to which the content data is to be provided. may be received and reflected in the generation of content data.
  • the content data generated by the information processing apparatus 1 according to the present embodiment includes, for example, an image (moving image) in which a presentation material is used as a background image, an avatar is placed in front of this background image, and Speech and speech.
  • the user can appropriately set the position of the avatar on the screen displayed by reproducing the content data.
  • the information processing apparatus 1 according to the present embodiment can set the sound image of the voice uttered by the avatar according to the display position of the avatar.
  • a sound image is, for example, the location or direction at which a user who has played back content data and listened to the sound recognizes the position of the sound source of the sound.
  • the position where the avatar is displayed on the screen where the content data is played back is divided into three positions: the left side, the center, and the right side. Set the sound image of to one of three: left, center, and right.
  • the information processing apparatus 1 sets the sound image by adjusting the left and right output levels of the stereo sound. may be
  • FIG. 14 is a schematic diagram for explaining the outline of the sound image localization technique.
  • FIR Finite Impulse Response
  • filters 121 to 124 are used. Sounds of the right channel (R) of stereo sound are input to the two FIR filters 121 and 122, respectively.
  • the two FIR filters 123 and 124 are supplied with left channel (L) stereo sound.
  • a new right channel (R') is obtained by adding the right channel (R) sound processed by the FIR filter 121 and the left channel (L) sound processed by the FIR filter 123. is output as the sound of
  • the right channel (R) sound processed by the FIR filter 122 and the left channel (L) sound processed by the FIR filter 124 are added to create a new left channel (L' ) is output as audio.
  • the information processing device 1 can appropriately adjust the position of the sound image associated with the uttered voice of the avatar. Further, the information processing apparatus 1 creates and stores a plurality of sets of parameters of the FIR filters 121 to 124 in association with, for example, a plurality of displayable positions of the avatar, and sets the parameters corresponding to the set display positions of the avatar. A set of parameters may be retrieved and used.
  • the parameters of the FIR filters 121-124 can be determined using, for example, head-related transfer functions. Since the sound image localization technique using the head-related transfer function is an existing technique, detailed description thereof will be omitted.
  • the information processing device 1 generates content data including the above two outputs, that is, the new left channel (L) and right channel (R) sounds. As a result, the user who has reproduced this content data can hear the utterance voice of the avatar with a sound image according to the display position.
  • ⁇ Avatar's facial expression, voice, and gesture interlocking> For example, as shown in FIG. 5, in the information processing system according to this embodiment, the user can set the facial expression of the avatar.
  • the information processing apparatus 1 according to the present embodiment may adjust the pitch (pitch), volume, etc. of the voice uttered by the avatar according to the facial expression of the avatar set by the user.
  • the information processing device 1 increases the pitch and volume of the voice. Further, for example, when an "angry face” is set as the facial expression, the information processing device 1 lowers the pitch of the voice and raises the volume.
  • the information processing device 1 stores, for example, in a database the correspondence between the facial expression of the avatar and the pitch increase/decrease amount and the volume increase/decrease amount of the voice. to get The information processing device 1 adjusts the pitch and volume according to the expression by reflecting the increase/decrease amount obtained from the database to the default values of the pitch and volume of the utterance of the avatar, for example. can be done.
  • the information processing device 1 may also determine the facial expression of the avatar based on the features of the text information, for example, when the user selects automatic setting for the facial expression of the avatar. For example, the information processing device 1 determines whether or not the text information spoken by the avatar contains a specific word, keyword, etc., and if the specific word, keyword, etc. is contained, the word, keyword, etc. is determined as the facial expression of the avatar. For example, if the text information includes words such as "happy" or "delicious", the information processing device 1 sets the avatar's facial expression to "smiling" and raises the pitch and volume of the avatar's utterance. can be higher.
  • the information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with facial expressions of avatars.
  • the information processing apparatus 1 automatically sets the gesture so that the avatar performs the associated gesture when uttering the word or keyword. you can go For example, when the text information includes the word "Wow”, the information processing device 1 can cause the avatar to make a gesture of opening the mouth and eyes wide and moving the hand. Further, for example, when the text information includes the word "No", the information processing device 1 can cause the avatar to make a gesture of shaking its head sideways.
  • the information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with avatar gestures.
  • the information processing device 1 may determine the facial expression of the avatar based on the text information, for example, using a learning model that has undergone machine learning in advance so as to estimate the emotion of the text information input. For example, by performing supervised learning processing of the learning model using learning data (teacher data) that associates text information and emotions, a learning model that estimates emotions for text information input is generated. be able to.
  • the information processing device 1 inputs text information corresponding to the utterance content of the avatar to this learning model, acquires the emotion estimation result output by the learning model, and determines the facial expression associated with the emotion as the facial expression of the avatar. be able to.
  • the information processing apparatus 1 may input all text information prepared for generating one piece of content data to the learning model.
  • each set of sentences until an interval is provided may be input to the learning model, or any other unit or amount of text information may be input to the learning model.
  • Machine learning processing for generating a learning model may be performed by a device different from the information processing device 1 .
  • the information processing apparatus 1 generates content data in which the pitch or volume of the avatar's utterance is adjusted in accordance with the settings related to the facial expression of the avatar.
  • the information processing device 1 also estimates the emotion of the person who speaks the text information based on the text information corresponding to the utterance content of the avatar, and sets the facial expression of the avatar according to the estimated emotion.
  • the information processing apparatus 1 determines features of sentences included in the text information, such as whether or not a specific word or keyword is included in the text information, and adjusts the pitch of the voice according to the determined features.
  • content data is generated in which the avatar speaks at a volume.
  • the information processing device 1 also stores in a database the correspondence between words that can be included in the text information and facial expressions or gestures of the avatar, and stores content data in which the avatar makes corresponding facial expressions or gestures when uttering the words. Generate.
  • the information processing apparatus 1 can be expected to link the expression of the avatar displayed in the content data, the pitch and volume of the uttered voice, the gesture, the content of the utterance, and the like.
  • the content data generated by the information processing apparatus 1 described above is so-called moving image content, and content that is simply viewed by the viewer.
  • the information processing apparatus 1 according to the present embodiment may receive information input from the viewer in the middle of the content, for example, and generate interactive content in which the content is switched according to the input information.
  • a test is conducted to check proficiency during the output of a video image such as a lecture, and the next video image to be output is switched according to the score of this test.
  • the information processing apparatus 1 accepts, for example, an operation of creating test questions and an operation of creating answers for this test from the user, and outputs questions to the viewer and receives answers from the viewer. Create content that does.
  • the information processing apparatus 1 also receives from the user settings such as a scoring method based on responses received from viewers and switching conditions for switching a plurality of contents according to the scoring results.
  • FIG. 15 is a schematic diagram for explaining an example of a content switching setting method.
  • content switching is set on the speech editing screen.
  • the line spoken by the avatar is set as the first item of the content, and the second item is " ⁇ Proficiency check test>". is set.
  • a proficiency check test prepared in advance is carried out.
  • the proficiency level confirmation test to be carried out at this time is created in advance by the user, for example, on a test content creation screen separately displayed by the information processing apparatus 1, or the like.
  • a test content creation screen for example, when a four-choice question is given as a test, the user inputs the question text, the sentences of the four options, and the answer indicating which of the options is the correct answer. , and the information processing apparatus 1 generates test content based on the received information.
  • the user can give an arbitrary name to the test content, and in this example, the name "proficiency level confirmation test" is given.
  • the information processing apparatus 1 may receive from the user settings such as calculation formulas for calculating the points and total points for each question, and generate test content.
  • the third item following the item " ⁇ proficiency check test>" is set as a content switching condition such as "Branch if score ⁇ 80 goto No.11".
  • the score of the proficiency check test is stored in the variable score, and if the score is less than 80, it is set to branch to the 11th item of the content.
  • the description method of the branch condition shown in FIG. 15 is an example, and content switching may be set in any format.
  • the score of the proficiency check test set as the third item is 80 points or more, the content corresponding to the following fourth item, "Let's move on.” content is output. Also, if the proficiency level confirmation test score is less than 80 points, the 4th to 10th items are not output, and the content corresponding to the 11th item, ⁇ Start a repair course'', is spoken by the avatar. content is output.
  • ⁇ Material> 16 to 44 are materials related to the information processing system according to this embodiment.
  • the information processing apparatus 1 acquires presentation materials created in advance, acquires text information related to the presentation voice, receives settings related to the presenter's avatar, and Content data is generated in which an avatar is displayed and the avatar utters a voice corresponding to the text information. Accordingly, the information processing apparatus 1 can be expected to support generation of content data for presentation by the user.
  • the information processing apparatus 1 acquires the voice information related to the presentation of the presenter, and acquires the text information by converting the acquired voice information. Accordingly, the information processing apparatus 1 can be expected to reduce the user's burden of creating text information.
  • the information processing apparatus 1 also receives settings related to the pronunciation of words included in the text information, and generates content data in which the avatar utters the words with the pronunciation according to the received settings. At this time, the information processing apparatus 1 displays text information, accepts word selection from the user, and displays the phonetic notation of the selected word. The information processing device 1 also accepts corrections to the displayed phonetic notation and causes the avatar to utter words in the corrected phonetic notation. As a result, the information processing apparatus 1 can be expected to facilitate the user's operation of setting the pronunciation of words.
  • the information processing apparatus 1 accepts setting of an interval for outputting a voice corresponding to each text, or setting of an avatar that speaks each text, for a plurality of texts included in the acquired text information. content data in which the avatar speaks text according to the received settings.
  • the information processing apparatus 1 also receives settings for facial expressions or gestures of the avatar when uttering text, and generates content data in which the avatar speaks with facial expressions or gestures according to the received settings. Accordingly, the information processing apparatus 1 can be expected to facilitate the user's avatar setting operation.
  • Information processing apparatus 1 also includes content editing area 112 (first area) in which an avatar is superimposed and displayed on a background image based on presentation materials, and setting items related to the avatar displayed in content editing area 112. and a background image selection area 111 (third area) for displaying a plurality of images included in the presentation material.
  • the avatar setting area 113 includes, for example, setting items for gestures performed by the avatar, setting items for the position of the avatar, setting items for the orientation of the avatar, setting items for the size of the avatar, and the like.
  • the avatar setting area 113 may be provided with setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar.
  • the information processing apparatus 1 can be expected to facilitate the user's setting operation regarding the background image and the avatar based on the presentation material.
  • the information processing apparatus 1 receives an input of a caption character string to be displayed together with a background image based on the presentation material, and generates content data in which the received character string is displayed together with the background image of the presentation material. . Thereby, the information processing apparatus 1 can generate content data to which various information is added in addition to presentation materials and utterances of avatars.
  • a computer program can be executed on a single computer, or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network. can be expanded to

Abstract

Provided is a computer program, an information processing method, and an information processing device, by which support for generating contents data for presentation can be expected. The computer program according to the present embodiment causes a computer to perform processing for acquiring presentation materials, acquiring text information for a presentation voice, receiving settings for the avatar of a presenter, and generating contents data in which the avatar is displayed with the presentation materials and the voice corresponding to the text information is uttered by the avatar. The computer program may acquire voice information for the presentation of the presenter and covert the voice information to acquire the text information. The computer program may receive settings for the pronunciations of words included in the text information and generate the contents data in which the avatar utters the words with the pronunciations corresponding to the received settings.

Description

コンピュータプログラム、情報処理方法及び情報処理装置Computer program, information processing method and information processing apparatus
 本発明は、発表のためのコンテンツデータを生成するコンピュータプログラム、情報処理方法及び情報処理装置に関する。 The present invention relates to a computer program, an information processing method, and an information processing apparatus for generating content data for presentation.
 特許文献1においては、集合住宅の大規模修繕工事に係るプレゼンテーションシステムが提案されている。このプレゼンテーションシステムは、テキスト及び静止画の組み合わせからなる図解資料データと、施工対象物件における現実の事前調査状況を記録した調査状況動画データと、施工対象物件に類似する疑似施工物件における現実の施工状況を記録した疑似体験動画データとを有し、図解資料データに基づき、図解資料信号を生成して表示装置に発信し、調査状況動画データ及び疑似体験動画データに基づき、それぞれ調査状況動画信号と疑似体験動画信号を生成して表示装置に発信する。 Patent Document 1 proposes a presentation system related to large-scale repair work for collective housing. This presentation system consists of illustrated material data consisting of a combination of text and still images, survey situation video data that records the actual preliminary survey situation of the construction target property, and the actual construction situation of a pseudo-construction property similar to the construction target property. Based on the illustrated material data, an illustrated material signal is generated and transmitted to the display device, and based on the investigation situation video data and the simulated experience video data, the investigation situation video signal and the simulated An experience video signal is generated and transmitted to the display device.
特開2021-68265号公報Japanese Patent Application Laid-Open No. 2021-68265
 例えば顧客に対するプレゼンテーションのように、発表者が複数の聴衆に対する発表を行う場合、発表者が複数ページの画像で構成した発表用の資料を予め作成し、作成した発表資料をディスプレイ又はプロジェクタ等により順に表示し、表示されたページに関する情報等を発表者が説明することが従来行われていた。近年では、発表の際に静止画像を表示するのみでなく、動画像及び音声等の出力を行うなど、発表者は様々な工夫を凝らした発表を行っている。しかしこのような発表は、発表者が静止画像、動画像及び音声等の様々なデータを予め作成する必要があり、誰もが容易に行うことができるものではない。 For example, when a presenter makes a presentation to a plurality of audiences, such as a presentation to a customer, the presenter prepares in advance presentation materials composed of images of multiple pages, and sequentially displays the created presentation materials on a display or projector. Conventionally, the presenter explained the information on the displayed page and the like. In recent years, presenters have been making presentations with various ingenuity, such as outputting moving images and sound in addition to displaying still images at the time of presentation. Such a presentation, however, requires the presenter to prepare various data such as still images, moving images, and voices in advance, which is not something that anyone can easily do.
 本発明は、斯かる事情に鑑みてなされたものであって、その目的とするところは、発表のためのコンテンツデータの生成を支援することが期待できるコンピュータプログラム、情報処理方法及び情報処理装置を提供することにある。 The present invention has been made in view of such circumstances, and its object is to provide a computer program, an information processing method, and an information processing apparatus that can be expected to support the generation of content data for presentation. to provide.
 一実施形態に係るコンピュータプログラムは、コンピュータに、発表資料を取得し、発表音声に係るテキスト情報を取得し、発表者のアバターに係る設定を受け付け、前記発表資料と共に前記アバターが表示され且つ前記テキスト情報に対応する音声を前記アバターが発話するコンテンツデータを生成する処理を実行させる。 A computer program according to one embodiment acquires presentation materials in a computer, acquires text information related to presentation audio, receives settings related to a presenter's avatar, displays the avatar together with the presentation materials, and displays the text information. A process of generating content data in which the avatar utters a voice corresponding to information is executed.
 一実施形態による場合は、発表のためのコンテンツデータの生成を支援することが期待できる。 According to one embodiment, it can be expected to support generation of content data for presentation.
本実施の形態に係る情報処理システムの概要を説明するための模式図である。1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment; FIG. 本実施の形態に係る情報処理システムの概要を説明するための模式図である。1 is a schematic diagram for explaining an overview of an information processing system according to an embodiment; FIG. 本実施の形態に係る情報処理装置の一構成例を示すブロック図である。1 is a block diagram showing a configuration example of an information processing apparatus according to an embodiment; FIG. 本実施の形態に係る情報処理装置が行うコンテンツデータ生成処理の手順を示すフローチャートである。4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus according to the present embodiment; 発話音声編集画面の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of an utterance editing screen; 発音の修正操作を説明するための模式図である。FIG. 4 is a schematic diagram for explaining a pronunciation correcting operation; 発音修正ダイアログボックスの一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a pronunciation correction dialog box; コンテンツ編集画面の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a content editing screen; キャプション設定領域が設けられたコンテンツ編集画面の一例を示す模式図である。FIG. 4 is a schematic diagram showing an example of a content editing screen provided with a caption setting area; 発話音声編集画面の別の例を示す模式図である。FIG. 11 is a schematic diagram showing another example of the utterance editing screen; カメラワークの設定方法の一例を説明するための模式図である。FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; カメラワークの設定方法の一例を説明するための模式図である。FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; カメラワークの設定方法の一例を説明するための模式図である。FIG. 4 is a schematic diagram for explaining an example of a camerawork setting method; 音像定位技術の概要を説明するための模式図である。FIG. 2 is a schematic diagram for explaining an overview of sound image localization technology; コンテンツ切替の設定方法の一例を説明するための模式図である。FIG. 4 is a schematic diagram for explaining an example of a content switching setting method; 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment. 本実施の形態に係る情報処理システムに関連する資料である。These are materials related to the information processing system according to the present embodiment.
 本発明の実施形態に係る情報処理システムの具体例を、以下に図面を参照しつつ説明する。なお、本発明はこれらの例示に限定されるものではなく、請求の範囲によって示され、請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 A specific example of the information processing system according to the embodiment of the present invention will be described below with reference to the drawings. The present invention is not limited to these exemplifications, but is indicated by the scope of the claims, and is intended to include all modifications within the meaning and scope of equivalents to the scope of the claims.
<システム概要>
 図1及び図2は、本実施の形態に係る情報処理システムの概要を説明するための模式図である。本実施の形態に係る情報処理システムでは、プレゼンテーション等の発表を行う発表者は、従来と同様に、例えばいわゆるプレゼンテーションソフトウェアを利用して発表資料を予め作成する。この発表資料には複数ページの画像等が含まれ、これらを順に表示すること(いわゆるスライドショー)により発表が行われる。図示の例では、複数の参加者がネットワークを介して参加するオンライン会議において発表資料を用いた発表者によるオンラインプレゼンテーションが行われている。本実施の形態に係る情報処理システムでは、この発表資料中に記載された文章から音声情報が抽出される(図1参照)。また情報処理システムでは、発表者が行ったオンラインプレゼンテーションの音声を録音することで音声情報の取得が行われてもよい(図2参照)。なお音声のみでなく、映像と音声とが共に記録されてもよい。また発表者による発表はオンラインプレゼンテーションでなくてよく、例えばオフラインでの発表の音声を録音機器にて録音してもよい。
<System overview>
1 and 2 are schematic diagrams for explaining an outline of an information processing system according to this embodiment. In the information processing system according to the present embodiment, a presenter who gives a presentation or the like prepares presentation materials in advance using, for example, so-called presentation software, as in the conventional art. This presentation material includes images of a plurality of pages, etc., and the presentation is made by displaying these in order (so-called slide show). In the illustrated example, an online presentation is given by a presenter using presentation materials in an online conference in which a plurality of participants participate via a network. In the information processing system according to the present embodiment, voice information is extracted from the text written in the presentation material (see FIG. 1). Further, in the information processing system, audio information may be acquired by recording the audio of the online presentation given by the presenter (see FIG. 2). It should be noted that not only audio but also video and audio may be recorded together. Also, the presentation by the presenter does not have to be an online presentation, and for example, the voice of the offline presentation may be recorded with a recording device.
 本実施の形態に係る情報処理システムでは、発表者の発表の音声を録音した音声情報を利用することができ、この場合に情報処理システムは音声認識処理により音声情報をテキスト情報に変換する。本実施の形態に係る情報処理システムは、発表者が作成した発表資料と、発表資料に含まれる文章等のテキスト情報及び/又は音声情報を変換したテキスト情報とを基に、発表者のアバターが発表資料を用いた発表を行う動画像を含むコンテンツデータを生成する。生成されたコンテンツデータを例えばディスプレイに表示する、プロジェクタで投影する、又は、動画像配信サイト等にて配信することにより、発表者は同じ内容の発表を繰り返して行う必要がなくなる。 In the information processing system according to the present embodiment, voice information obtained by recording the voice of the presenter's presentation can be used. In this case, the information processing system converts the voice information into text information by voice recognition processing. In the information processing system according to the present embodiment, based on the presentation material created by the presenter and text information such as sentences included in the presentation material and/or text information converted from voice information, the presenter's avatar Generate content data including a moving image of a presentation using presentation materials. By displaying the generated content data on a display, projecting it with a projector, or distributing it on a moving image distribution site or the like, the presenter does not have to repeatedly present the same content.
 また上記の例では、発表者の発表を録音した音声情報をテキスト情報に変換しているが、これに限るものではない。発表者は、発表の際に話す台詞をテキスト情報として作成してもよい。この場合に情報処理システムは、発表者が予め作成した発表資料及びテキスト情報を取得し、これらを基にコンテンツデータを生成する。即ち、情報処理システムがコンテンツデータの生成に用いるテキスト情報は、音声認識により音声情報から変換されたものであるか否かが問われない。発表者は、テキスト情報を予め生成することにより、自らが発表を行うことなく、自らのアバターが発表を行うコンテンツデータを生成することができる。 Also, in the above example, the voice information recorded by the presenter is converted into text information, but it is not limited to this. The presenter may create text information of lines to be spoken at the time of presentation. In this case, the information processing system acquires presentation materials and text information prepared in advance by the presenter, and generates content data based on these. That is, it does not matter whether text information used by the information processing system to generate content data is converted from voice information by voice recognition. By generating text information in advance, the presenter can generate content data in which his or her avatar presents without presenting themselves.
 本実施の形態に係る情報処理システムは、上記の発表資料及びテキスト情報を取得し、テキスト情報を合成音声により読み上げた音声情報を生成する。また情報処理システムは、例えば予め用意された複数のアバターのデータから選択された発表者のアバターに、生成した音声情報に対応した口の動き及びジェスチャー等を行わせることによって、アバターが発表に関する台詞を発話する態様の映像データを生成する。また情報処理システムは、取得した発表資料に含まれる複数ページのスライド等の画像を背景画像データとして用い、この背景画像にアバターの映像を重畳し、音声情報を加えることによって、コンテンツデータを生成する。コンテンツデータは、例えば動画像ファイルとして出力され、適宜のディスプレイ装置もしくはプロジェクタ等による表示、又は、動画像配信サイト等での配信に用いることができる。 The information processing system according to the present embodiment acquires the above presentation materials and text information, and generates voice information by reading out the text information using synthesized voice. In addition, the information processing system allows the avatar of the presenter, which is selected from data of a plurality of avatars prepared in advance, to perform mouth movements and gestures corresponding to the generated voice information, so that the avatar can express lines related to the presentation. generates video data in a mode of uttering In addition, the information processing system uses images such as multiple pages of slides included in the acquired presentation material as background image data, superimposes the avatar video on this background image, and adds audio information to generate content data. . The content data is output as, for example, a moving image file, and can be used for display on an appropriate display device, projector, or the like, or for distribution on a moving image distribution site or the like.
 本実施の形態に係る情報処理システムは、このコンテンツデータの生成に関して例えばアバターの外観、アバターが行うジェスチャー、アバターの発話として出力される音声の特徴、又は、音声出力される単語の発音等の種々の設定を発表者から受け付け、この設定を反映したコンテンツデータを生成する。これにより情報処理システムは、発表者の好み及び目的等に適したコンテンツデータの生成を支援することが期待できる。 In the information processing system according to the present embodiment, regarding the generation of this content data, for example, the appearance of the avatar, the gestures performed by the avatar, the characteristics of the voice output as the utterance of the avatar, or the pronunciation of the words output as voice. setting is received from the presenter, and content data reflecting this setting is generated. As a result, the information processing system can be expected to support generation of content data suitable for the presenter's preference and purpose.
<装置構成>
 図3は、本実施の形態に係る情報処理装置の一構成例を示すブロック図である。本実施の形態に係る情報処理装置1は、処理部11、記憶部(ストレージ)12、通信部(トランシーバ)13、表示部(ディスプレイ)14及び操作部15等を備えて構成されている。本実施の形態に係る情報処理装置1は、例えばパーソナルコンピュータ又はタブレット型端末装置等の汎用的な情報処理装置を用いて構成され得る。なお本実施の形態においては、1つの情報処理装置1にて処理が行われるものとして説明を行うが、複数の情報処理装置が分散して処理を行ってもよい。また以下において情報処理装置1を利用するユーザには、発表者を想定して説明を行うが、これに限るものではなく、情報処理装置1を利用してコンテンツデータを生成する作業を行うユーザは発表者以外であってもよい。
<Device configuration>
FIG. 3 is a block diagram showing a configuration example of an information processing apparatus according to this embodiment. The information processing apparatus 1 according to the present embodiment includes a processing unit 11, a storage unit (storage) 12, a communication unit (transceiver) 13, a display unit (display) 14, an operation unit 15, and the like. The information processing device 1 according to the present embodiment can be configured using a general-purpose information processing device such as a personal computer or a tablet terminal device. In this embodiment, one information processing apparatus 1 performs the processing, but a plurality of information processing apparatuses may perform the processing in a distributed manner. In the following description, the user who uses the information processing device 1 is assumed to be a presenter, but the presenter is not limited to this. It may be someone other than the presenter.
 処理部11は、CPU(Central Processing Unit)、MPU(Micro-Processing Unit)、GPU(Graphics Processing Unit)又は量子プロセッサ等の演算処理装置、ROM(Read Only Memory)及びRAM(Random Access Memory)等を用いて構成されている。処理部11は、記憶部12に記憶されたプログラム12aを読み出して実行することにより、発表資料及びテキスト情報等を取得する処理、ユーザによる各種の設定を受け付ける処理、並びに、取得した情報及び受け付けた設定に基づいてコンテンツデータを生成する処理等の種々の処理を行う。 The processing unit 11 includes an arithmetic processing unit such as a CPU (Central Processing Unit), MPU (Micro-Processing Unit), GPU (Graphics Processing Unit) or quantum processor, ROM (Read Only Memory), RAM (Random Access Memory), etc. It is configured using By reading and executing the program 12a stored in the storage unit 12, the processing unit 11 performs processing for acquiring presentation materials, text information, etc., processing for accepting various settings by the user, and processing for acquired information and received information. Various processes such as a process of generating content data based on the settings are performed.
 記憶部12は、例えばハードディスク等の大容量の記憶装置を用いて構成されている。記憶部12は、処理部11が実行する各種のプログラム、及び、処理部11の処理に必要な各種のデータを記憶する。本実施の形態において記憶部12は、処理部11が実行するプログラム12aを記憶する。また記憶部12には、情報処理装置1が生成したコンテンツデータを記憶するコンテンツデータ記憶部12bが設けられている。 The storage unit 12 is configured using a large-capacity storage device such as a hard disk. The storage unit 12 stores various programs executed by the processing unit 11 and various data required for processing by the processing unit 11 . In the present embodiment, the storage unit 12 stores a program 12a executed by the processing unit 11. FIG. The storage unit 12 is also provided with a content data storage unit 12b that stores content data generated by the information processing apparatus 1 .
 本実施の形態においてプログラム(コンピュータプログラム、プログラム製品)12aは、メモリカード又は光ディスク等の記録媒体99に記録された態様で提供され、情報処理装置1は記録媒体99からプログラム12aを読み出して記憶部12に記憶する。ただし、プログラム12aは、例えば情報処理装置1の製造段階において記憶部12に書き込まれてもよい。また例えばプログラム12aは、遠隔のサーバ装置等が配信するものを情報処理装置1が通信にて取得してもよい。例えばプログラム12aは、記録媒体99に記録されたものを書込装置が読み出して情報処理装置1の記憶部12に書き込んでもよい。プログラム12aは、ネットワークを介した配信の態様で提供されてもよく、記録媒体99に記録された態様で提供されてもよい。 In the present embodiment, the program (computer program, program product) 12a is provided in a form recorded in a recording medium 99 such as a memory card or an optical disk, and the information processing apparatus 1 reads out the program 12a from the recording medium 99 and stores it in the storage unit. 12. However, the program 12a may be written in the storage unit 12 at the manufacturing stage of the information processing device 1, for example. Further, for example, the program 12a may be distributed by a remote server device or the like and acquired by the information processing device 1 through communication. For example, the program 12 a may be recorded in the recording medium 99 and read by a writing device and written in the storage unit 12 of the information processing device 1 . The program 12 a may be provided in the form of distribution via a network, or may be provided in the form of being recorded on the recording medium 99 .
 コンテンツデータ記憶部12bは、発表資料及びテキスト情報等の情報に基づいて情報処理装置1が生成したコンテンツデータを記憶する。コンテンツデータは例えばMPEG-4形式の動画像のファイルとしてコンテンツデータ記憶部12bに記憶される。コンテンツデータ記憶部12bは、この動画像のファイルと共に、例えば発表のタイトル、発表者名、発表日時又は発表内容の概要等の種々の情報を記憶してよい。 The content data storage unit 12b stores content data generated by the information processing device 1 based on information such as presentation materials and text information. The content data is stored in the content data storage unit 12b as a moving image file in MPEG-4 format, for example. The content data storage unit 12b may store various information such as the title of the presentation, the name of the presenter, the date and time of the presentation, or an overview of the contents of the presentation, together with the moving image file.
 通信部13は、例えばインターネット、LAN(Local Area Network)又は携帯電話通信網等を含むネットワークNを介して、種々の装置との間で通信を行う。情報処理装置1は、通信部13にて他の装置との通信を行うことにより、例えばプログラム12aの取得(ダウンロード)、オンラインプレゼンテーションの実施、及び、生成したコンテンツデータの配信等の処理を行うことができる。通信部13は、処理部11から与えられたデータを他の装置へ送信すると共に、他の装置から受信したデータを処理部11へ与える。 The communication unit 13 communicates with various devices via a network N including, for example, the Internet, a LAN (Local Area Network), or a mobile phone communication network. The information processing device 1 performs processing such as acquisition (downloading) of the program 12a, implementation of an online presentation, distribution of generated content data, etc., by communicating with other devices through the communication unit 13. can be done. The communication unit 13 transmits the data given from the processing unit 11 to other devices, and gives the data received from the other devices to the processing unit 11 .
 表示部14は、液晶ディスプレイ等を用いて構成されており、処理部11の処理に基づいて種々の画像及び文字等を表示する。操作部15は、ユーザの操作を受け付け、受け付けた操作を処理部11へ通知する。例えば操作部15は、機械式のボタン又は表示部14の表面に設けられたタッチパネル等の入力デバイスによりユーザの操作を受け付ける。また例えば操作部15は、マウス及びキーボード等の入力デバイスであってよく、これらの入力デバイスは情報処理装置1に対して取り外すことが可能な構成であってもよい。 The display unit 14 is configured using a liquid crystal display or the like, and displays various images, characters, etc. based on the processing of the processing unit 11. The operation unit 15 receives a user's operation and notifies the processing unit 11 of the received operation. For example, the operation unit 15 receives a user's operation using an input device such as mechanical buttons or a touch panel provided on the surface of the display unit 14 . Further, for example, the operation unit 15 may be an input device such as a mouse and a keyboard, and these input devices may be detachable from the information processing apparatus 1 .
 また本実施の形態に係る情報処理装置1には、記憶部12に記憶されたプログラム12aを処理部11が読み出して実行することにより、情報取得部11a、アバターデータ生成部11b、音声データ生成部11c、背景データ生成部11d、コンテンツデータ生成部11e及び表示処理部11f等が、ソフトウェア的な機能部として処理部11に実現される。なお本図においては、処理部11の機能部として、コンテンツデータの生成に関連する機能部を図示し、これ以外の処理に関する機能部は図示を省略している。 Further, in the information processing apparatus 1 according to the present embodiment, the program 12a stored in the storage unit 12 is read out and executed by the processing unit 11, whereby the information acquisition unit 11a, the avatar data generation unit 11b, the voice data generation unit 11c, the background data generation unit 11d, the content data generation unit 11e, the display processing unit 11f, and the like are implemented in the processing unit 11 as software functional units. In this figure, as the functional units of the processing unit 11, functional units related to content data generation are illustrated, and functional units related to other processes are omitted.
 情報取得部11aは、コンテンツデータの生成に必要な発表資料及びテキスト情報等の情報を取得する処理を行う。例えば情報取得部11aは、ユーザが予め作成した発表資料の情報を取得する。本実施の形態においてユーザは、例えば既存のプレゼンテーションソフトウェア等を利用して、発表内容をまとめた文章、グラフ及びイラスト等を含む複数ページの画像(スライド)を発表資料として予め作成する。発表資料の作成は、情報処理装置1にて行われてもよく、他の装置にて行われてもよい。本実施の形態においては、ユーザが予め作成した発表資料のデータを情報処理装置1の記憶部12に記憶し、情報取得部11aは、記憶部12に記憶された発表資料を読み出すことで、発表資料を取得する。 The information acquisition unit 11a performs processing for acquiring information such as presentation materials and text information necessary for generating content data. For example, the information acquisition unit 11a acquires information on presentation materials prepared in advance by the user. In this embodiment, the user prepares in advance a multi-page image (slides) including sentences summarizing the content of the presentation, graphs, illustrations, etc., as a presentation material using, for example, existing presentation software. The presentation material may be created by the information processing device 1 or by another device. In the present embodiment, the data of the presentation material prepared in advance by the user is stored in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a reads out the presentation material stored in the storage unit 12 to perform the presentation. Get materials.
 また例えば情報取得部11aは、生成するコンテンツデータにおいて発表者のアバターが発話する台詞に相当するテキスト情報を取得する。本実施の形態においては、上記の発表資料に記載された文章、文字又は単語等がアバターの台詞として用いられ得る。この場合には、例えば発表資料に含まれる文章等のいずれをアバターの台詞として用いるかを、この発表資料を作成するプレゼンテーションソフトウェアのコメント機能等を利用して作成者が予め設定する。情報取得部11aは、発表資料に付されたコメント等を認識して、発表資料に含まれる文章等をアバターの台詞とするためのテキスト情報として抽出することができる。 Also, for example, the information acquisition unit 11a acquires text information corresponding to the lines spoken by the presenter's avatar in the generated content data. In the present embodiment, sentences, characters, words, or the like described in the presentation material can be used as the avatar's lines. In this case, for example, the creator sets in advance which of the sentences included in the presentation material is to be used as the speech of the avatar using the comment function or the like of the presentation software for creating this presentation material. The information acquisition unit 11a can recognize comments and the like attached to the presentation material and extract sentences and the like included in the presentation material as text information for making the avatar's lines.
 また例えば情報取得部11aは、発表資料に基づいて発表者が実際に発表を行った際にこの発表者が話した台詞を、コンテンツデータにおいてアバターが発話する台詞とすることもできる。この場合には、発表者がオンラインプレゼンテーション等で発表を行う際に録画又は録音が行われ、発表者が話した台詞を含む音声情報が予め用意される。ユーザはこの音声情報を情報処理装置1の記憶部12に記憶し、情報取得部11aは、記憶部12に記憶された音声情報を読み出すことで、音声情報を取得する。音声情報を取得した情報取得部11aは、例えばこの音声情報に対していわゆる音声認識処理を行い、音声情報をテキスト情報に変換することによって、テキスト情報を取得する。なお音声認識処理による音声情報からテキスト情報への変換は、既存の技術であるため、詳細な説明を省略する。情報処理装置1は、音声認識処理を自ら行ってもよく、音声認識処理を行う他の装置に音声情報を送信し、他の装置が音声認識処理により変換したテキスト情報を取得してもよい。 Also, for example, the information acquisition unit 11a can use the lines spoken by the presenter when the presenter actually made a presentation based on the presentation material as the lines spoken by the avatar in the content data. In this case, when the presenter gives an online presentation or the like, the presentation is video-recorded or recorded, and voice information including the lines spoken by the presenter is prepared in advance. The user stores this voice information in the storage unit 12 of the information processing apparatus 1, and the information acquisition unit 11a acquires the voice information by reading out the voice information stored in the storage unit 12. FIG. The information acquisition unit 11a that has acquired the voice information acquires text information by, for example, performing so-called voice recognition processing on this voice information and converting the voice information into text information. Since the conversion from voice information to text information by voice recognition processing is an existing technology, detailed description thereof will be omitted. The information processing device 1 may perform the speech recognition processing itself, or may transmit the speech information to another device that performs the speech recognition processing, and acquire the text information converted by the speech recognition processing by the other device.
 またテキスト情報の取得方法は、上述のような発表資料に含まれる文章等からのテキスト情報の抽出及び発表者が実際に発表した音声情報に基づくテキスト情報の取得等の方法の他に、例えばアバターの台詞となるテキスト情報をユーザが直接的に作成したものを取得するという方法も採用されてよい。この場合にユーザは、例えばテキストエディタ又は文章作成ソフトウェア等を用いて、アバターの台詞に相当する文章を作成し、テキスト情報として記憶部12に記憶する。テキスト情報の作成は、情報処理装置1にて行われてもよく、他の装置にて行われてもよい。ユーザは作成したテキスト情報を記憶部12に記憶し、情報取得部11aは、記憶部12に記憶されたテキスト情報を読み出すことで、テキスト情報を取得することができる。 In addition to the above-described methods of extracting text information from sentences included in presentation materials and obtaining text information based on voice information actually announced by the presenter, the method of acquiring text information includes, for example, avatar It is also possible to adopt a method of obtaining text information that is directly created by the user. In this case, the user creates sentences corresponding to the avatar's dialogue using, for example, a text editor or sentence creation software, and stores the sentences in the storage unit 12 as text information. The text information may be created by the information processing device 1 or by another device. The user stores the created text information in the storage unit 12, and the information acquisition unit 11a can acquire the text information by reading the text information stored in the storage unit 12. FIG.
 アバターデータ生成部11bは、コンテンツデータに登場する発表者のアバターに関するデータを生成する処理を行う。アバターデータ生成部11bは、例えばデータベースに記憶された複数のアバターに関する情報を表示部14に一覧表示して、ユーザからアバターの選択を受け付ける。アバターデータ生成部11bは、選択されたアバターのデータをデータベースから取得し、取得したデータを基にこのアバターの外観を示すプレビュー画面を表示部14に表示する。アバターデータ生成部11bは、このプレビュー画面にてアバターの色又は形状等の編集操作をユーザから受け付けて、編集されたアバターをコンテンツデータに登場させるアバターとする。またアバターデータ生成部11bは、生成するコンテンツデータにおいてアバターを表示する位置、アバターの向き又はアバターが行う動き(ジェスチャー)等の種々の設定をユーザから受け付けて、受け付けた設定をプレビュー画面のアバターに反映させる。アバターデータ生成部11bは、アバターの形状等のデータと、このアバターの表示位置等の設定とを含むアバターデータを生成して、記憶部12に記憶する。なお、コンテンツデータには複数のアバターが登場してよく、この場合にアバターデータ生成部11bは、複数のアバターについてアバターデータを生成する。 The avatar data generation unit 11b performs processing for generating data related to the presenter's avatar appearing in the content data. The avatar data generation unit 11b displays, for example, a list of information about a plurality of avatars stored in the database on the display unit 14, and receives selection of an avatar from the user. The avatar data generation unit 11b acquires data of the selected avatar from the database, and displays a preview screen showing the appearance of the avatar on the display unit 14 based on the acquired data. The avatar data generation unit 11b accepts an editing operation such as the color or shape of the avatar from the user on this preview screen, and uses the edited avatar as an avatar to appear in the content data. In addition, the avatar data generation unit 11b accepts various settings from the user such as the position at which the avatar is displayed, the direction of the avatar, and the movements (gestures) performed by the avatar in the content data to be generated. To reflect. The avatar data generation unit 11 b generates avatar data including data such as the shape of the avatar and settings such as the display position of the avatar, and stores the data in the storage unit 12 . A plurality of avatars may appear in the content data, and in this case, the avatar data generating section 11b generates avatar data for the plurality of avatars.
 音声データ生成部11cは、コンテンツデータにおいてアバターが発話する音声のデータを生成する処理を行う。音声データ生成部11cは、情報取得部11aが取得したテキスト情報を基にいわゆるテキスト読み上げの処理を行うことによって、テキスト情報を音声データに変換する。テキスト読み上げ処理は、既存の技術であるため、詳細な説明は省略する。情報処理装置1は、テキスト読み上げ処理を自ら行ってもよく、テキスト読み上げ処理を行う他の装置にテキスト情報を送信し、他の装置がテキスト読み上げ処理により変換した音声データを取得してもよい。 The audio data generation unit 11c performs processing for generating audio data spoken by the avatar in the content data. The voice data generation unit 11c performs so-called text-to-speech processing based on the text information acquired by the information acquisition unit 11a, thereby converting the text information into voice data. Since the text-to-speech processing is an existing technology, detailed description is omitted. The information processing apparatus 1 may perform the text-to-speech process by itself, or may transmit text information to another apparatus that performs the text-to-speech process, and acquire voice data converted by the text-to-speech process by the other apparatus.
 音声データ生成部11cは、例えば情報取得部11aが取得したテキスト情報に含まれる一又は複数のテキストを順に表示部14に表示し、音声データへ変換するテキストの選択を受け付ける。また音声データ生成部11cは、生成する音声データについて、例えばテキスト読み上げのピッチ、速度、深さ(声の太さ)、声の高さ、声音、声質又は声量等に関する設定をユーザから受け付けて、受け付けた設定を反映した音声データを生成する。音声データ生成部11cは、生成した音声データを例えばスピーカ又はイヤホン等の音声出力装置から出力する。また音声データ生成部11cは、アバターが複数存在する場合に、アバターとテキストとの対応付けの設定を受け付けると共に、アバター毎に速度又は声質等の設定を受け付ける。 For example, the voice data generation unit 11c sequentially displays one or more texts included in the text information acquired by the information acquisition unit 11a on the display unit 14, and accepts selection of texts to be converted into voice data. In addition, the voice data generation unit 11c receives, from the user, settings related to, for example, the pitch, speed, depth (thickness of voice), pitch, voice, voice quality, or volume of voice data to be generated. Generates audio data that reflects the accepted settings. The audio data generator 11c outputs the generated audio data from an audio output device such as a speaker or an earphone. In addition, when there are a plurality of avatars, the voice data generation unit 11c accepts settings for association between avatars and texts, and also accepts settings such as speed or voice quality for each avatar.
 また本実施の形態において音声データ生成部11cは、例えばテキスト情報に含まれる単語又は短い文章等を対象として、発音に関する設定を受け付けて、音声データに含まれる対象の単語の発音を修正する。例えばテキスト情報に「湯川」という単語が含まれており、音声データ生成部11cが生成した音声データではこの単語を「ユガワ」と発音しているが、「ユカワ」の発音が正しい場合があり得る。このような場合に発明者は、表示された文章の中から「湯川」を選択し、この単語の正しい発音として「ユカワ」を設定する操作を行う。この設定を受け付けた音声データ生成部11cは、テキスト情報に含まれる全ての「湯川」について発音を「ユガワ」から「ユカワ」に変更した音声データを生成する。 In addition, in the present embodiment, the voice data generation unit 11c receives settings related to pronunciation for, for example, words or short sentences included in the text information, and corrects the pronunciation of the target words included in the voice data. For example, the text information includes the word "Yukawa", and the voice data generated by the voice data generation unit 11c pronounces this word as "Yugawa", but there is a possibility that the pronunciation of "Yukawa" is correct. . In such a case, the inventor selects "Yukawa" from the displayed sentences and sets "Yukawa" as the correct pronunciation of this word. The voice data generation unit 11c that receives this setting generates voice data in which the pronunciation of all "Yukawa" included in the text information is changed from "Yugawa" to "Yukawa".
 なお本例は、テキスト情報において表意文字(漢字)で記されている単語を対象とし、表音文字(カタカナ又はひらがな)で発音を設定する例であるが、音声データ生成部11cが受け付ける発音の設定はこれに限らない。音声データ生成部11cは、例えば音素文字(ローマ字)又は発音記号等で発音の設定を受け付けてもよい。また音声データ生成部11cは、単語の発音に関して例えばアクセントの位置等の設定を受け付けてもよい。 It should be noted that this example deals with words written in ideograms (Chinese characters) in the text information, and is an example in which the pronunciation is set in phonetic characters (katakana or hiragana). The setting is not limited to this. The voice data generation unit 11c may accept settings for pronunciation using, for example, phoneme characters (romaji) or phonetic symbols. In addition, the voice data generation unit 11c may receive settings such as the position of an accent for the pronunciation of words.
 背景データ生成部11dは、コンテンツデータにおいてアバターの背景となる画像のデータを生成する処理を行う。本実施の形態においては、情報取得部11aが取得した発表資料に含まれる複数の画像(スライド)がアバターの背景画像として用いられ、発表資料を用いた発表をアバターが行うコンテンツデータが生成される。背景データ生成部11dは、情報取得部11aが取得した発表資料に含まれる複数の画像について、例えば表示する順番及び表示を切り替えるタイミング等の設定を受け付ける。背景データ生成部11dは、複数の背景画像と、これらを表示するタイミング等の設定とを含む背景データを生成し、記憶部12に記憶する。 The background data generation unit 11d performs processing for generating image data that serves as the background of the avatar in the content data. In the present embodiment, a plurality of images (slides) included in the presentation material acquired by the information acquisition unit 11a are used as the background image of the avatar, and content data is generated in which the avatar makes a presentation using the presentation material. . The background data generation unit 11d receives settings such as display order and display switching timing for a plurality of images included in the presentation material acquired by the information acquisition unit 11a. The background data generation unit 11 d generates background data including a plurality of background images and settings such as timings for displaying the background images, and stores the background data in the storage unit 12 .
 また背景データ生成部11dは、発明資料を基にした背景画像に対して、タイトル又は字幕等のキャプション文字列を追加する処理を行う。背景データ生成部11dは、キャプション文字列として表示する文字列の入力をユーザから受け付けると共に、このキャプション文字列を表示する位置及び向き、文字列のサイズ及びフォント、キャプション文字列を表示するタイミング等の設定を受け付ける。背景データ生成部11dは、キャプション文字列とこれらに関する設定とを背景データに含めて記憶する。 The background data generation unit 11d also performs a process of adding a caption character string such as a title or subtitles to the background image based on the invention material. The background data generation unit 11d accepts an input of a character string to be displayed as a caption character string from the user, and also sets the position and direction of displaying the caption character string, the size and font of the character string, the timing of displaying the caption character string, and the like. Accept settings. The background data generation unit 11d stores the caption character strings and settings related to them in the background data.
 コンテンツデータ生成部11eは、アバターデータ生成部11bが生成したアバターデータ、音声データ生成部11cが生成した音声データ、及び、背景データ生成部11dが生成した背景データを基に、例えばアバターが発表資料を用いて発表を行う動画像のデータをコンテンツデータとして生成する。コンテンツデータ生成部11eは、背景データに含まれる背景画像に対して、アバターデータに含まれるアバターを設定に応じた位置及び向き等で配置し、音声データに含まれる音声を適宜のタイミングで出力することでコンテンツデータを生成することができる。コンテンツデータ生成部11eは、生成したコンテンツデータを記憶部12のコンテンツデータ記憶部12bに記憶する。 Based on the avatar data generated by the avatar data generation unit 11b, the audio data generated by the audio data generation unit 11c, and the background data generated by the background data generation unit 11d, the content data generation unit 11e generates, for example, an avatar to create a presentation material. Data of moving images for presentation using is generated as content data. The content data generation unit 11e arranges the avatar included in the avatar data in a position, orientation, etc. according to the setting with respect to the background image included in the background data, and outputs the voice included in the voice data at an appropriate timing. content data can be generated. The content data generation unit 11 e stores the generated content data in the content data storage unit 12 b of the storage unit 12 .
 表示処理部11fは、画像及び文字等の種々の情報を表示部14に表示する処理を行う。本実施の形態において表示処理部11fは、例えばアバターに関する設定を受け付ける画面の表示、音声に関する設定を受け付ける画面の表示、背景の設定に関する設定を受け付ける画面の表示、及び、生成したコンテンツデータの表示等の処理を行う。また表示処理部11fは、情報処理装置1が備える表示部14にこれらの表示を行うのみでなく、通信部13にて表示のためのデータを他の装置へ送信することにより、他の装置の表示部等に表示を行ってもよい。 The display processing unit 11f performs processing for displaying various information such as images and characters on the display unit 14. In the present embodiment, the display processing unit 11f displays, for example, a screen for accepting settings related to avatars, a screen for accepting settings related to voice, a screen for accepting settings related to background settings, and a display of generated content data. process. The display processing unit 11f not only displays these items on the display unit 14 provided in the information processing device 1, but also transmits the data for display to other devices through the communication unit 13, thereby enabling the display of other devices. It may be displayed on a display unit or the like.
<コンテンツデータの生成処理>
 図4は、本実施の形態に係る情報処理装置1が行うコンテンツデータ生成処理の手順を示すフローチャートである。本実施の形態に係る情報処理装置1の処理部11の情報取得部11aは、ユーザが予め作成した発表資料のファイル等を記憶部12から読み出すことで取得する(ステップS1)。
<Content data generation processing>
FIG. 4 is a flow chart showing the procedure of content data generation processing performed by the information processing apparatus 1 according to the present embodiment. The information acquisition unit 11a of the processing unit 11 of the information processing apparatus 1 according to the present embodiment acquires a presentation material file or the like created in advance by the user by reading it from the storage unit 12 (step S1).
 また情報取得部11aは、アバターの発話のためのテキスト情報を取得する(ステップS2)。このときに情報取得部11aは、例えばステップS1にて取得した発表資料に予め設定されたコメント等にもとづいて、この発表資料に含まれる文章等をアバターの発話のためのテキスト情報として取得する。また情報取得部11aは、発表者が発表を行った際に録音された音声情報のファイル等を取得し、音声認識処理を行って音声情報をテキスト情報へ変換することによりテキスト情報を取得してもよい。また情報取得部11aは、例えばアバターの台詞を予めユーザが記述することにより作成されたテキスト情報を取得してもよい。 The information acquisition unit 11a also acquires text information for the avatar's speech (step S2). At this time, the information acquisition unit 11a acquires sentences and the like included in the presentation material as text information for utterance by the avatar, based on, for example, comments preset in the presentation material acquired in step S1. The information acquisition unit 11a acquires a file or the like of voice information recorded when the presenter gave a presentation, and performs voice recognition processing to convert the voice information into text information, thereby acquiring text information. good too. The information acquisition unit 11a may also acquire text information created by, for example, the user writing the lines of the avatar in advance.
 次いで処理部11の表示処理部11fは、ステップS2にて取得したテキスト情報を基に、テキスト情報を発話する際の設定等を行うための発話音声編集画面を表示部14に表示する(ステップS3)。処理部11の音声データ生成部11cは、発話音声編集画面が表示されている状態で操作部15に対するユーザの操作を受け付けることにより、発話音声に関する編集を受け付ける(ステップS4)。音声データ生成部11cは、ステップS4にて受け付けた編集内容を反映して、テキスト情報を基に音声データを生成する(ステップS5)。 Next, the display processing unit 11f of the processing unit 11 displays, on the display unit 14, an utterance editing screen for making settings when uttering the text information based on the text information acquired in step S2 (step S3). ). The voice data generation unit 11c of the processing unit 11 accepts editing of the spoken voice by accepting the user's operation on the operation unit 15 while the spoken voice editing screen is displayed (step S4). The voice data generating unit 11c reflects the edited content received in step S4 and generates voice data based on the text information (step S5).
 図5は、発話音声編集画面の一例を示す模式図である。本実施の形態に係る情報処理装置1は、ステップS2にて取得したテキスト情報を基に、図示の発話音声編集画面を表示する。発話音声編集画面は、例えば最上部に「発話音声編集」のタイトル文字列が示され、その下方に「全体音声出力」のラベルが付されたボタン及び「テキスト追加」のラベルが付されたボタンが左右に並べて設けられ、これらの下方にテキスト情報に含まれる複数のテキスト及び各テキストに関する複数の設定項目をマトリクス状に並べた設定テーブル101が設けられている。 FIG. 5 is a schematic diagram showing an example of the speech editing screen. The information processing apparatus 1 according to the present embodiment displays the illustrated speech editing screen based on the text information acquired in step S2. The speech editing screen, for example, shows a title string of "speech speech editing" at the top, and a button labeled "total speech output" and a button labeled "add text" below it. are arranged on the left and right, and below these, a setting table 101 is provided in which a plurality of texts included in the text information and a plurality of setting items related to each text are arranged in a matrix.
 設定テーブル101は、上下方向に複数のテキストを一覧で並べ、横方向に複数の設定項目を並べた構成のテーブルである。設定テーブル101は、例えば左側から順に「番号」、「テキスト」、「間隔(秒)」、「話者」及び「表情」の項目が設けられ、右端にはアイコン領域が設けられている。「番号」は、最終的に生成されるコンテンツデータにおいてテキストがアバターにより発話される順序を示す数値情報である。「テキスト」は、アバターが発話するテキスト(文章、台詞等)であり、1文字以上の文字列情報である。本例では、1番目のテキストとして「こんにちは。私はバリュー博士です。今日は新しくメンバーに加わったカレッジさんと一緒に、これまでの研究の成果を紹介したいと思っています。カレッジさんよろしくね。」が設定され、2番目のテキストとして「バリュー博士、はじめまして。よろしくお願いします。研究メンバーの一員になれてとっても嬉しいです。」が設定されている。 The setting table 101 is a table in which a plurality of texts are arranged in a list in the vertical direction and a plurality of setting items are arranged in the horizontal direction. The setting table 101 has, for example, items of "number", "text", "interval (seconds)", "speaker" and "expression" in order from the left, and an icon area is provided at the right end. "Number" is numerical information indicating the order in which text is spoken by avatars in the finally generated content data. "Text" is text (sentences, lines, etc.) uttered by the avatar, and is character string information of one or more characters. In this example, the first text is "Hello. My name is Dr. Baru. Today, I would like to introduce the results of my research with a new member, Mr. College. Nice to meet you, Mr. College." is set, and the second text is set as "Dr. Balu, nice to meet you. Nice to meet you. I am very happy to be a research member."
 情報処理装置1は、例えば取得したテキスト情報に含まれる文章を句読点などに基づいて複数のテキストを適宜に分割し、順に番号を付すことで、図示の設定テーブル101の「番号」及び「テキスト」に表示する情報を得ることができる。テキスト情報に含まれる文章の複数のテキストへの分割は、例えば音声認識処理において行われてもよく、また例えばユーザにより予め行われてもよく、また例えば情報処理装置1がテキスト情報を取得した際に行われてもよい。音声認識処理においてテキストの分割を行う場合、例えば発話と発話との間に所定時間を超える間隔がある場合に、これら前後の発話を2つのテキストに分割することができる。ユーザがテキストの分割を行う場合、例えばユーザがテキストエディタ等でテキスト情報を確認し、適宜の箇所に改行又はタブ等を挿入する操作を行うことでテキストの分割が行われ得る。 For example, the information processing apparatus 1 appropriately divides a sentence included in the acquired text information into a plurality of texts based on punctuation marks, etc., and assigns numbers in order, so that the "number" and "text" of the setting table 101 shown in the figure are displayed. You can get the information to display in The division of sentences included in the text information into a plurality of texts may be performed, for example, in speech recognition processing, or may be performed in advance by the user, for example, or when the information processing apparatus 1 acquires the text information, for example. may be performed. When dividing text in speech recognition processing, for example, when there is an interval exceeding a predetermined time between utterances, the preceding and following utterances can be divided into two texts. When the user divides the text, for example, the user checks the text information with a text editor or the like and inserts a line feed or tab at an appropriate location, thereby dividing the text.
 設定テーブル101の「間隔(秒)」は、このテキストと1つ前のテキストとの発話の間に設ける間隔(インターバル)を秒単位で設定する項目である。本例では、デフォルト値として0.5秒が情報処理装置1により設定されている。「話者」は、このテキストを発話するアバターがいずれであるかを設定する項目である。本例では、1番目のテキストを「バリュー博士」が発話し、2番目のテキストを「カレッジ」が発話することが設定されている。情報処理装置1は、例えばプルダウンメニュー等を用いて、予め登録されたアバターの中から1つのアバターの選択を受け付けることで、「話者」の設定を受け付けることができる。「表情」は、このテキストを設定されたアバターが発話する際のアバターの表情を設定する項目である。本例では「自然」及び「笑顔」等の表情が設定されている。情報処理装置1は、例えばプルダウンメニュー等を用いて、予め登録された表情の中から1つの表情の選択を受け付けることで、「表情」の設定を受け付けることができる。 "Interval (seconds)" in the setting table 101 is an item for setting the interval (interval) provided between the utterances of this text and the previous text in units of seconds. In this example, 0.5 seconds is set by the information processing apparatus 1 as a default value. "Speaker" is an item for setting which avatar speaks this text. In this example, it is set that "Dr. Value" speaks the first text and "College" speaks the second text. The information processing apparatus 1 can accept the setting of the "speaker" by accepting the selection of one avatar from the pre-registered avatars, for example, using a pull-down menu or the like. "Facial expression" is an item for setting the facial expression of the avatar when the avatar set with this text speaks. In this example, facial expressions such as "natural" and "smiling" are set. The information processing apparatus 1 can accept the setting of the "facial expression" by accepting the selection of one facial expression from pre-registered facial expressions, for example, using a pull-down menu or the like.
 情報処理装置1は、各テキストに対応付けて設定テーブル101の右端のアイコン領域に、例えばスピーカを模したアイコンと、ゴミ箱を模したアイコンとを表示する。スピーカを模したアイコンは、対応するテキストを音声出力させる操作を受け付けるためのものである。このアイコンに対する操作を受け付けた場合、情報処理装置1は、対応するテキストのみの音声出力を行う。ゴミ箱のアイコンは、このテキストを削除する操作を受け付けるためのものである。このアイコンに対する操作を受け付けた場合、情報処理装置1は、対応するテキスト及び設定等を削除する。 The information processing device 1 displays, for example, an icon resembling a speaker and an icon resembling a trash can in the rightmost icon area of the setting table 101 in association with each text. The icon imitating a speaker is for accepting an operation for outputting the corresponding text by voice. When receiving an operation for this icon, the information processing apparatus 1 outputs only the corresponding text as voice. The trash can icon is for accepting an operation to delete this text. When receiving an operation for this icon, the information processing apparatus 1 deletes the corresponding text and settings.
 また、発話音声編集画面の上部に設けられた「全体音声出力」ボタンは、全てのテキストの音声出力を行うためのボタンである。情報処理装置1は、「全体音声出力」ボタンに対する操作を受け付けた場合、設定テーブル101に含まれる全てのテキストについて、最初から最後まで順に音声出力を行う。また「テキスト追加」ボタンは、任意のテキストを追加するためのボタンである。情報処理装置1は、「テキスト追加」ボタンに対する操作を受け付けた場合、例えば図示しないテキスト追加のためのダイアログボックスを表示し、追加するテキストの入力と、テキストを発話する順番、間隔、話者及び表情等の設定とを受け付ける。情報処理装置1は、このダイアログボックスにて受け付けたテキスト及び設定を、設定テーブル101に適宜の位置に挿入することでテキストを追加する。 Also, the "Overall audio output" button provided at the top of the utterance audio editing screen is a button for audio output of all text. When the information processing apparatus 1 accepts an operation on the "output all audio" button, the information processing apparatus 1 sequentially audio-outputs all the texts included in the setting table 101 from the beginning to the end. Also, the "add text" button is a button for adding arbitrary text. When the information processing device 1 receives an operation on the "add text" button, for example, a dialog box for adding text (not shown) is displayed, and the input of the text to be added, the order of uttering the text, the interval, the speaker, and the Accept settings such as facial expressions. The information processing apparatus 1 adds the text by inserting the text and setting received in this dialog box into the setting table 101 at an appropriate position.
 また、本実施の形態に係る情報処理装置1は、テキスト情報に含まれる単語短い文章等を対象として、発音の修正操作をユーザから受け付ける。図6は、発音の修正操作を説明するための模式図である。図6には、設定テーブル101に「こんにちは。私の名前は湯川です。」のテキストが設定された発話音声編集画面が示されている。ユーザは、例えばマウス等の入力装置を利用して、このテキストに含まれる「湯川」の単語を選択する操作を行う。この操作に応じて情報処理装置1は、例えば「発音修正」のラベルが付されたボタンを表示する。このボタンに対する操作を受け付けた場合、情報処理装置1は、例えば発音修正ダイアログボックスを表示して、発音に関する設定をユーザから受け付ける。 In addition, the information processing apparatus 1 according to the present embodiment accepts a pronunciation correction operation from the user for sentences with short words included in the text information. FIG. 6 is a schematic diagram for explaining the pronunciation correction operation. FIG. 6 shows an utterance editing screen in which the setting table 101 is set with the text "Hello. My name is Yukawa." The user uses an input device such as a mouse to select the word "Yukawa" included in this text. In response to this operation, the information processing apparatus 1 displays a button labeled, for example, "pronunciation correction". When an operation on this button is accepted, the information processing apparatus 1 displays, for example, a pronunciation modification dialog box and accepts settings regarding pronunciation from the user.
 図7は、発音修正ダイアログボックスの一例を示す模式図である。図7の上段には発音修正前の状態を示し、下段には発音修正後の状態を示している。本例の発音修正ダイアログボックスは、例えば最上部に「発音修正」のタイトル文字列が表示され、その下方に「対象テキスト」のラベルが付されたテキストボックス及び「発音」のラベルが付されたテキストボックスが上下に並べて設けられ、更にその下方に「音声出力」のラベルが付されたボタン及び「完了」のラベルが付されたボタンが左右に並べて設けられている。 FIG. 7 is a schematic diagram showing an example of a pronunciation correction dialog box. The upper part of FIG. 7 shows the state before pronunciation correction, and the lower part shows the state after pronunciation correction. In the pronunciation correction dialog box of this example, for example, the title string "Pronunciation correction" is displayed at the top, and a text box labeled "Target text" and a label "Pronunciation" are attached below it. Text boxes are arranged vertically, and a button labeled "Voice Output" and a button labeled "Complete" are arranged horizontally below the text boxes.
 情報処理装置1は、発話音声編集画面においてユーザが選択した単語を、「対象テキスト」のテキストボックスに表示する。本例では図6に示した発話音声編集画面で選択された「湯川」が、「対象テキスト」のテキストボックスに表示されている。また情報処理装置1は、対象テキストを発話した場合の発音を、「発音」のテキストボックスに、例えばカタカナ又はひらがな等の発音表記で表示する。図7上段の例では、「湯川」の発音表記としてカタカナの「ユガワ」が表示されており、現在の設定は「ユガワ」の発音で「湯川」の単語を発話していることが示されている。 The information processing device 1 displays the word selected by the user on the speech editing screen in the "target text" text box. In this example, "Yukawa" selected on the utterance editing screen shown in FIG. 6 is displayed in the "target text" text box. The information processing apparatus 1 also displays the pronunciation when the target text is uttered in the "pronunciation" text box in phonetic notation such as katakana or hiragana. In the example in the upper part of FIG. 7, the katakana character "Yugawa" is displayed as the phonetic notation for "Yukawa", and the current setting indicates that the word "Yukawa" is being uttered with the pronunciation of "Yugawa". there is
 発音修正ダイアログボックスの「音声出力」ボタンは、「対象テキスト」に示された単語のみを音声出力するためのボタンである。「音声出力」ボタンに対する操作を受け付けた場合、情報処理装置1は、「発音」にて設定された発音で「対象テキスト」に示された単語のみを読み上げた音声出力を行う。図7上段の例では、「ユガワ」の発音での音声出力が行われる。 The "Speech output" button in the pronunciation correction dialog box is a button for outputting only the words indicated in the "target text". When an operation on the "speech output" button is accepted, the information processing apparatus 1 performs speech output by reading out only the word indicated in the "target text" with the pronunciation set in the "pronunciation". In the example in the upper part of FIG. 7, voice output is performed with the pronunciation of "Yugawa".
 情報処理装置1は、「発音」のテキストボックスに表示した発音表記に対するユーザの修正を受け付けることで、対象テキストの発音の修正を受け付ける。図7下段の例では、「湯川」の現在の発音としてテキストボックスに表示されている「ユガワ」を、キーボード等の入力装置を利用してユーザが「ユカワ」に修正している。図7下段の例においてユーザが「音声出力」ボタンに対する操作を行った場合、情報処理装置1は、「ユカワ」の発音での音声出力を行う。 The information processing device 1 accepts the correction of the pronunciation of the target text by accepting the user's correction of the phonetic notation displayed in the "pronunciation" text box. In the example in the lower part of FIG. 7, the user corrects "yugawa" displayed in the text box as the current pronunciation of "yukawa" to "yukawa" using an input device such as a keyboard. In the example shown in the lower part of FIG. 7, when the user operates the "voice output" button, the information processing apparatus 1 performs voice output with the pronunciation of "yukawa".
 発音修正ダイアログボックスの「完了」ボタンは、ユーザによる発音の修正を反省させて、このダイアログボックスを閉じるためのボタンである。「完了」ボタンに対する操作を受け付けた場合、情報処理装置1は、発音修正ダイアログボックスの「対象テキスト」の単語と「発音」テキストボックスに設定された発音とを対応付けて記憶し、テキスト情報に含まれる全ての同じ単語について設定された発音を適用した音声データを生成する。 The "Done" button in the pronunciation correction dialog box is a button for letting the user reflect on the pronunciation correction and close this dialog box. When an operation on the "Complete" button is accepted, the information processing apparatus 1 associates and stores the words in the "target text" of the pronunciation modification dialog box with the pronunciations set in the "pronunciation" text box, and stores them in the text information. Generates audio data with the set pronunciation applied to all the same words included.
 図4に示したフローチャートのステップS3~S5にて発話音声の編集に関する処理を行った情報処理装置1の表示処理部11fは、ステップS1にて取得した発表資料及びステップS2にて取得したテキスト情報等を基に、コンテンツデータに表示するアバター及び背景等の設定を行うためのコンテンツ編集画面を表示部14に表示する(ステップS6)。処理部11のアバターデータ生成部11b及び背景データ生成部11dは、コンテンツ編集画面が表示されている状態で操作部15に対するユーザの操作を受け付けることにより、アバター及び背景に関する編集を受け付ける(ステップS7)。アバターデータ生成部11bは、ステップS7にて受け付けた編集内容を反映したアバター映像データを生成する(ステップS8)。また背景データ生成部11dは、ステップS7にて受け付けた編集内容を反映した背景画像データを生成する(ステップS9)。 The display processing unit 11f of the information processing device 1, which has performed processing related to editing of the spoken voice in steps S3 to S5 of the flowchart shown in FIG. Based on the above, a content edit screen for setting the avatar, background, etc. to be displayed in the content data is displayed on the display unit 14 (step S6). The avatar data generation unit 11b and the background data generation unit 11d of the processing unit 11 accept editing of the avatar and the background by accepting user's operation on the operation unit 15 while the content editing screen is displayed (step S7). . The avatar data generation unit 11b generates avatar video data reflecting the editing contents accepted in step S7 (step S8). The background data generation unit 11d also generates background image data reflecting the editing contents accepted in step S7 (step S9).
 図8は、コンテンツ編集画面の一例を示す模式図である。本実施の形態に係る情報処理装置1は、ステップS1にて取得した発表資料及びステップS2にて取得したテキスト情報等を基に、図示のコンテンツ編集画面を表示する。コンテンツ編集画面は、例えば最上部に「コンテンツ編集」のタイトル文字列が示され、その下方には背景画像選択領域111、コンテンツ編集領域112及びアバター設定領域113が左右方向に並べて設けられている。 FIG. 8 is a schematic diagram showing an example of the content editing screen. The information processing apparatus 1 according to the present embodiment displays the illustrated content editing screen based on the presentation material acquired in step S1 and the text information acquired in step S2. The content editing screen shows, for example, a title string of "content editing" at the top, and a background image selection area 111, a content editing area 112, and an avatar setting area 113 are arranged in the horizontal direction below the title string.
 コンテンツ編集画面の背景画像選択領域111は、ユーザによる背景画像の選択を受け付けるための領域である。情報処理装置1は、発表資料に含まれる複数のスライドをそれぞれ背景画像として背景画像選択領域111に一覧表示する。コンテンツデータにおいては、背景画像選択領域111に一覧表示された複数の背景画像が、この領域に並べられた順番で表示されることになる。情報処理装置1は、背景画像選択領域111に一覧表示した複数の背景画像の中から1つの背景画像の選択を受け付け、選択された背景画像をコンテンツ編集領域112に表示する。また情報処理装置1は、背景画像の追加、削除、及び、表示順序の変更等の操作を受け付け、受け付けた操作に応じて一覧表示した複数の背景画像に対する追加、削除及び順序変更等を行う。 The background image selection area 111 of the content editing screen is an area for accepting selection of a background image by the user. The information processing apparatus 1 displays a list of a plurality of slides included in the presentation material in the background image selection area 111 as background images. In the content data, a plurality of background images listed in the background image selection area 111 are displayed in the order in which they are arranged in this area. The information processing apparatus 1 accepts selection of one background image from among the plurality of background images displayed in the background image selection area 111 and displays the selected background image in the content editing area 112 . The information processing apparatus 1 also accepts operations such as addition, deletion, and change of the display order of background images, and performs addition, deletion, order change, and the like for the plurality of background images displayed in a list according to the accepted operations.
 コンテンツ編集画面のアバター設定領域113は、コンテンツデータに登場する一又は複数のアバターに関する設定を受け付けるための領域である。アバター設定領域113内には、アバター選択領域、テキスト表示領域及び設定受付領域等が上下方向に並べて設けられている。情報処理装置1は、アバター設定領域113のアバター選択領域に、予め作成された一又は複数のアバターの画像及び名称等を一覧表示する。本例では、「バリュー博士」及び「カレッジ」の2つのアバターがアバター選択領域に表示され、「バリュー博士」のアバターが選択されている。情報処理装置1は、アバター選択領域で選択されたアバターをコンテンツ編集領域112に表示する。なお図示は省略するが、アバターの作成はアバター作成画面等にて行われる。ただしアバターの作成はユーザが行うのではなく、例えば有償又は無償で提供されるアバターをユーザが取得して利用してもよい。アバターの作成方法は既存の技術であるため、詳細な説明を省略する。 The avatar setting area 113 of the content editing screen is an area for accepting settings related to one or more avatars appearing in the content data. In the avatar setting area 113, an avatar selection area, a text display area, a setting reception area, and the like are arranged vertically. The information processing apparatus 1 displays a list of images, names, etc. of one or more avatars created in advance in the avatar selection area of the avatar setting area 113 . In this example, two avatars "Dr. Value" and "College" are displayed in the avatar selection area, and the avatar "Dr. Value" is selected. The information processing device 1 displays the avatar selected in the avatar selection area in the content editing area 112 . Although illustration is omitted, avatar creation is performed on an avatar creation screen or the like. However, the avatar is not created by the user, and the user may acquire and use an avatar provided for a fee or free of charge, for example. Since the method of creating an avatar is an existing technology, detailed description is omitted.
 アバター設定領域113のテキスト表示領域は、このアバターが発話するテキストが表示される領域である。情報処理装置1は、ステップS2にて取得したテキスト情報、又は、このテキスト情報に対して上述の発話音声編集画面において編集がなされたテキスト情報を基に、テキスト情報に含まれる一又は複数のテキストをテキスト表示領域に表示する。情報処理装置1は、テキスト情報に含まれる複数のテキストを出力順に選択してテキスト表示領域に表示し、ユーザはテキスト表示領域に表示するテキストを適宜に変更することができる。 The text display area of the avatar setting area 113 is an area where the text spoken by this avatar is displayed. Based on the text information acquired in step S2 or the text information obtained by editing the text information on the above-described speech editing screen, the information processing apparatus 1 extracts one or more texts included in the text information. is displayed in the text display area. The information processing apparatus 1 selects a plurality of texts included in the text information in output order and displays them in the text display area, and the user can appropriately change the text displayed in the text display area.
 アバター設定領域113の設定受付領域は、アバター選択領域にて選択されたアバターに関する複数の設定項目に対する設定を受け付ける領域である。本例ではアバターに関する設定項目として「ジェスチャー」、「サイズ」、「向き」及び「位置」等が示されている。ユーザは、例えば数値の直接入力又はプルダウンメニューからの選択等の種々の方法で、各設定項目に対する設定値の入力又は選択等を行うことができる。なお図示の設定項目は一例であってこれに限るものではなく、情報処理装置1は、図示の設定項目以外の種々の設定項目を設けてアバターに関する設定を受け付けてよい。情報処理装置1は、設定受付領域にアバターがテキストを発話する際の音声に関する設定、例えばアバターが発話する音声のピッチ、速度、深さ又は音量等の設定項目を設けてユーザからこれらの設定を受け付けてよい。 The setting acceptance area of the avatar setting area 113 is an area that accepts settings for a plurality of setting items related to the avatar selected in the avatar selection area. In this example, "gesture", "size", "orientation", "position" and the like are shown as setting items related to the avatar. The user can input or select a setting value for each setting item by various methods such as direct numerical input or selection from a pull-down menu. Note that the illustrated setting items are merely an example, and the information processing apparatus 1 may receive settings related to avatars by providing various setting items other than the illustrated setting items. The information processing apparatus 1 provides setting items related to the voice when the avatar utters the text in the setting reception area, for example, setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar, and accepts these settings from the user. may be accepted.
 コンテンツ編集画面のコンテンツ編集領域112は、背景画像選択領域111にて選択された背景画像に、アバター設定領域113のアバター選択領域にて選択されたアバターを重畳して表示する。これによりコンテンツ編集領域112には、最終的に生成されるコンテンツデータの一場面を再現した映像が表示される。ユーザは、コンテンツ編集領域112にて表示されたアバターの位置及び向き等を例えばマウス操作又はタッチ操作等により変更することができ、情報処理装置1は、これらのユーザの操作に応じてアバターの位置及び向き等の設定の変更を受け付ける。コンテンツ編集領域112にて設定の変更を受け付けた場合、情報処理装置1は、アバター設定領域113の設定受付領域に設けられた対応する設定項目の設定値を変更する。逆に、アバター設定領域113の設定受付領域にて設定の変更を受け付けた場合、情報処理装置1は、コンテンツ編集領域112に表示されているアバターの表示態様を受け付けた設定に応じて変更する。 The content editing area 112 of the content editing screen displays the avatar selected in the avatar selection area of the avatar setting area 113 superimposed on the background image selected in the background image selection area 111 . As a result, in the content editing area 112, an image that reproduces one scene of the finally generated content data is displayed. The user can change the position and orientation of the avatar displayed in the content editing area 112 by, for example, mouse operation or touch operation. and accepts changes in settings such as orientation. When the content editing area 112 accepts a setting change, the information processing apparatus 1 changes the setting value of the corresponding setting item provided in the setting accepting area of the avatar setting area 113 . Conversely, when a setting change is accepted in the setting acceptance area of avatar setting area 113, information processing apparatus 1 changes the display mode of the avatar displayed in content editing area 112 according to the accepted setting.
 また本実施の形態においてユーザは、コンテンツ編集領域112にて所定の操作を行うことにより、背景画像に対するキャプションの追加を行うことができる。例えばマウスの右クリックメニュー等の機能を用いてユーザがコンテンツ編集領域112の一点を指定し且つキャプション追加の操作を行った場合、情報処理装置1は、コンテンツ編集領域112にキャプションの文字列を入力するためのテキストボックスを表示すると共に、コンテンツ編集画面のアバター設定領域113に代えて、キャプション設定領域を表示する。 Also, in this embodiment, the user can add a caption to the background image by performing a predetermined operation in the content editing area 112 . For example, when the user designates one point in the content editing area 112 using a function such as a right-click menu of a mouse and performs an operation to add a caption, the information processing apparatus 1 inputs a caption character string in the content editing area 112. In addition to displaying a text box for setting the caption, a caption setting area is displayed instead of the avatar setting area 113 of the content editing screen.
 図9は、キャプション設定領域114が設けられたコンテンツ編集画面の一例を示す模式図である。コンテンツ編集画面のキャプション設定領域114は、コンテンツ編集領域112に入力されたキャプションの文字列に関する設定を受け付ける領域である。キャプション設定領域114には、例えば「フォント種別」、「サイズ」及び「位置」等の設定項目が設けられる。図示の例では、コンテンツ編集領域112に破線の矩形枠で示されたテキストボックス内に「デジタル時代のビジネス入門」の文字列がキャプションとして入力されている。情報処理装置1は、このキャプションに対する設定をキャプション設定領域114の各設定項目にて受け付け、受け付けた設定に応じた表示態様でキャプションを表示する。情報処理装置1は、入力されたキャプションの文字列及びキャプションの設定等の情報は、例えば背景画像データと共に記憶する。 FIG. 9 is a schematic diagram showing an example of a content editing screen on which a caption setting area 114 is provided. The caption setting area 114 of the content editing screen is an area for receiving settings related to the caption character string input in the content editing area 112 . The caption setting area 114 is provided with setting items such as "font type", "size", and "position". In the illustrated example, a character string "Introduction to Business in the Digital Age" is entered as a caption in a text box indicated by a dashed rectangular frame in the content editing area 112 . The information processing apparatus 1 receives settings for the caption in each setting item of the caption setting area 114, and displays the caption in a display mode according to the received settings. The information processing apparatus 1 stores information such as the input caption character string and caption settings together with, for example, the background image data.
 図4に示したフローチャートのステップS6~S9にてコンテンツの編集に関する処理を行った情報処理装置1の処理部11のコンテンツデータ生成部11eは、例えばユーザによるコンテンツデータ生成の操作を受け付けて、コンテンツデータを生成し(ステップS10)、生成したコンテンツデータを記憶部12のコンテンツデータ記憶部12bに記憶して、処理を終了する。このときにコンテンツデータ生成部11eは、ステップS5にて生成した音声データと、ステップS8にて生成したアバター映像データと、ステップS9にて生成した背景画像データとを統合して、背景画像に重畳されたアバターが発話する内容のコンテンツデータを生成する。 The content data generation unit 11e of the processing unit 11 of the information processing apparatus 1, which has performed processing related to content editing in steps S6 to S9 of the flowchart shown in FIG. Data is generated (step S10), the generated content data is stored in the content data storage section 12b of the storage section 12, and the process is terminated. At this time, the content data generation unit 11e integrates the audio data generated in step S5, the avatar video data generated in step S8, and the background image data generated in step S9, and superimposes them on the background image. Generates content data of the content uttered by the avatar.
 また本実施の形態においては、背景画像と、この背景画像上に配置する各種のパーツ画像と、アバターとを、同じレイヤーで又は異なるレイヤーで重畳的に配置することが可能である。情報処理装置1は、例えば発表資料及びテキスト情報等と共に、コンテンツに含める各種のパーツの画像ファイル等を取得する。情報処理装置1は、例えばコンテンツ編集画面においてユーザの操作を受け付けることにより、これらの各種のパーツをアバターと共に適宜の位置及び順番に配置する。ユーザは、例えばアバターの前に演説机を配置する、又は、アバター及び背景の間に装飾的なパーツを配置する等の編集を行うことができ、画面上の奥行などを表現してアバターによるプレゼンテーションの臨場感を高めることができる。 Also, in the present embodiment, it is possible to superimpose a background image, various parts images to be placed on the background image, and avatars on the same layer or on different layers. The information processing apparatus 1 acquires, for example, presentation materials, text information, and the like, as well as image files and the like of various parts to be included in the content. The information processing apparatus 1 arranges these various parts together with the avatar in an appropriate position and order by accepting a user's operation on the content editing screen, for example. The user can edit, for example, placing a speech desk in front of the avatar, or placing decorative parts between the avatar and the background, and can express the depth on the screen, etc. It is possible to enhance the realism of.
 またユーザは、例えばアバターを背景の後ろに配置する操作を行うことで、画面上でアバターを非表示とすることができる。これにより、例えばアバターではなく、背景となっている画像等に聴衆の注意を引きたい場合等に、非表示のアバターの台詞をナレーションとして出力することができるため、ユーザはより効果的なコンテンツを作成することができる。 The user can also hide the avatar on the screen, for example, by placing the avatar behind the background. As a result, for example, when it is desired to draw the audience's attention to a background image instead of the avatar, the lines of the hidden avatar can be output as narration, allowing the user to enjoy more effective content. can be created.
 また本実施の形態においては、発表資料と共に取得したテキスト情報には、アバターの台詞となる多くのテキストが含まれ得る。テキスト情報に含まれる多くの台詞とこれを発話するアバターとの対応付けは、本実施の形態において例えばコンテンツ編集画面にて行われる。ただし情報処理装置1は、例えば発表資料及びテキスト情報を取得する際に、予め選択された1つのアバターに全ての台詞を割り当ててもよい。情報処理装置1がこのような一括での台詞の割り当てを行うことにより、ユーザはその後のコンテンツ編集画面等において各台詞をアバターに割り当てる操作を行うことなく、コンテンツデータの生成を行うことができる。ただし情報処理装置1は、このような一括での台詞の割り当てを行った後、コンテンツ編集画面等において第1のアバターに割り当てられた台詞を第2のアバターに割り当てる等の編集操作をユーザから受け付けてよい。 In addition, in the present embodiment, the text information acquired together with the presentation material may contain a large amount of text that becomes the dialogue of the avatar. Many lines included in the text information are associated with the avatar that speaks them, for example, in the content editing screen in the present embodiment. However, the information processing apparatus 1 may assign all lines to one pre-selected avatar, for example, when acquiring presentation materials and text information. When the information processing apparatus 1 assigns lines collectively in this manner, the user can generate content data without having to perform an operation for assigning lines to avatars on the subsequent content editing screen or the like. However, after allocating lines in such a batch, the information processing apparatus 1 accepts an editing operation from the user, such as allocating the lines allocated to the first avatar to the second avatar on a content editing screen or the like. you can
 コンテンツデータを生成した情報処理装置1は、例えば動画再生のアプリケーションプログラムを用いてコンテンツデータを再生し、表示部14に表示することができる。また情報処理装置1は、例えば動画配信サイト等にコンテンツデータをアップロードしてもよい。 The information processing device 1 that has generated the content data can reproduce the content data using, for example, a video reproduction application program and display it on the display unit 14 . The information processing device 1 may also upload content data to, for example, a video distribution site.
<発話音声編集画面の別例>
 図10は、発話音声編集画面の別の例を示す模式図である。本実施の形態に係る情報処理装置1は、例えば図5に示した発話音声編集画面に代えて、図10に示す発話音声編集画面を表示してもよい。図10に示す発話音声編集画面は、2人のアバターが対話する形式のコンテンツデータを作成する場合に好適である。
<Another example of the speech editing screen>
FIG. 10 is a schematic diagram showing another example of the speech editing screen. The information processing apparatus 1 according to the present embodiment may display, for example, an utterance editing screen shown in FIG. 10 instead of the utterance editing screen shown in FIG. The speech editing screen shown in FIG. 10 is suitable for creating content data in a format in which two avatars interact.
 図10に示す発話音声編集画面では、画面の最上部に「発話音声編集画面」のタイトル文字列が表示され、このタイトル文字列の下方に「バリュー博士」及び「カレッジ」の2人のアバターの名前が左右に並べて表示されている。本例の発話音声編集画面では、画面の左側が「バリュー博士」の発話内容等の情報を表示する領域として用いられ、画面の右側が「カレッジ」の発話内容等の情報を表示する領域として用いられる。 On the spoken voice editing screen shown in FIG. 10, a title string of "spoken voice editing screen" is displayed at the top of the screen. The names are displayed side by side. In the utterance voice editing screen of this example, the left side of the screen is used as an area for displaying information such as the utterance content of "Dr. Value", and the right side of the screen is used as an area for displaying information such as the utterance content of "College". be done.
 また本例の発話音声編集画面では、四角形状の枠に各アバターが発話するテキスト情報が納められ、複数のテキスト情報が画面の上から下へ発話順の時系列に並べて表示される。このときに「バリュー博士」に関するテキスト情報は画面の左側に寄せて表示され、「カレッジ」に関するテキスト情報は画面の右側に寄せて表示される。ユーザは、例えば上下方向へのスライド操作を行うことによって、時系列順の複数のテキスト情報をスクロールさせ、一画面に収まらない複数のテキスト情報を確認することができる。またユーザは四角形状の枠内に収められたテキスト情報を任意に編集する操作を行うことができる。 In addition, on the spoken voice editing screen of this example, the text information spoken by each avatar is placed in a rectangular frame, and multiple pieces of text information are displayed in chronological order from top to bottom of the screen. At this time, the text information about "Dr. Value" is displayed on the left side of the screen, and the text information about "College" is displayed on the right side of the screen. The user can scroll a plurality of pieces of text information in chronological order by, for example, performing a slide operation in the vertical direction, and can confirm a plurality of pieces of text information that cannot fit on one screen. Also, the user can arbitrarily edit the text information contained within the rectangular frame.
 また本例の発話音声編集画面において、各アバターのテキスト情報が納められる四角形状の枠内には、例えば枠内の右下側の隅に、一又は複数のアイコンが設けられている。なお本図では、これらのアイコンを正方形の図形を用いて簡略化して示している。これらのアイコンは、対応するテキスト情報に対する設定の受け付け、対応するテキスト情報を音声出力する操作の受け付け、又は、対応するテキスト情報を削除する操作の受け付け等のように、ユーザから種々の操作を受け付けるためのものである。 Also, in the utterance voice editing screen of this example, one or more icons are provided, for example, in the lower right corner of the rectangular frame in which the text information of each avatar is stored. Note that in this figure, these icons are shown in a simplified form using square figures. These icons accept various operations from the user, such as accepting settings for the corresponding text information, accepting an operation to output the corresponding text information by voice, or accepting an operation to delete the corresponding text information. It is for
 また本例の発話音声編集画面では、時系列的に連続する2つの発話のテキスト情報の間に、発話間に設けるインターバルの時間設定を示す、左右方向に長い長方形状の枠が表示されている。本例では「バリュー博士」の「こんにちは。…よろしくね。」のテキスト情報と、「カレッジ」の「バリュー博士、…うれしいです。」のテキスト情報との間に、「間隔:0.5秒」の文字列が記された長方形状の枠が表示されている。これは、「バリュー博士」の発話と、「カレッジ」の発話との間に、0.5秒のインターバル、即ち両アバターが共に発話していない期間が設けられることを示している。ユーザは、長方形状の枠内の数値を修正することで、インターバルの時間を任意に設定することができる。 In the utterance audio editing screen of this example, a rectangular frame that is long in the horizontal direction is displayed between the text information of two utterances that are continuous in time series, indicating the time setting of the interval to be provided between utterances. . In this example, "interval: 0.5 seconds" is set between the text information of "Hello. Nice to meet you." of "Dr. Value" and the text information of "Dr. A rectangular frame with a character string of is displayed. This indicates that there is an interval of 0.5 seconds between the utterance of "Dr. Value" and the utterance of "College", that is, a period during which both avatars do not speak. The user can arbitrarily set the interval time by correcting the numerical values in the rectangular frame.
 このように本実施の形態に係る情報処理装置1は、アバターが発話する複数のテキスト情報を画面の上下方向へ時系列に並べ、且つ、2人のアバターが発話するテキスト情報を画面の左右に分けて示す発話音声編集画面を表示する。これによりユーザは、例えば2人のアバターが会話しながらプレゼンテーションを行う動画像等のコンテンツデータの生成を容易に行うことが期待できる。 As described above, the information processing apparatus 1 according to the present embodiment arranges a plurality of pieces of text information spoken by avatars in time series in the vertical direction of the screen, and arranges text information spoken by two avatars on the left and right sides of the screen. Display the separate speech editing screen. As a result, the user can expect to easily generate content data such as moving images in which, for example, two avatars make a presentation while talking.
 なお本例では、2人のアバターが発話を行う場合を例に説明したが、これに限るものではなく、3人以上のアバターが発話を行う場合にも同様の構成を適用することができる。例えば3人のアバターが発話を行う場合、発話音声編集画面を左側、中央及び右側の3つの領域に分け、各領域に各アバターを対応付けて発話内容のテキスト情報を時系列に表示することができる。 In this example, a case in which two avatars speak is explained as an example, but the present invention is not limited to this, and a similar configuration can be applied when three or more avatars speak. For example, when three avatars speak, it is possible to divide the speech editing screen into three areas, the left side, the center, and the right side, associate each avatar with each area, and display the text information of the utterance contents in chronological order. can.
 また情報処理装置1は、アバターが発話する作数のテキスト情報を画面の左右方向へ時系列に並べ、且つ、2人のアバターが発話するテキスト情報を画面の上下に分けて示す発話音声編集画面を表示してもよい。 In addition, the information processing device 1 arranges the text information of the number of pieces uttered by the avatars in chronological order in the horizontal direction of the screen, and displays the text information uttered by the two avatars by dividing the screen into the upper and lower parts of the screen. may be displayed.
<カメラワークの設定>
 本実施の形態に係る情報処理装置1は、コンテンツデータに表示するアバターとして3次元モデル、即ち3次元の仮想空間に再現される3次元のキャラクタのオブジェクトを用いてよい。情報処理装置1は、予め作成された又はユーザにより新たに作成されたアバターの3次元モデルのデータを読み込んで、このアバターを3次元の仮想空間に再現する。情報処理装置1は、3次元の仮想空間に適宜に配置された仮想カメラにてアバターを撮影することで2次元の画像を取得してコンテンツデータを生成することができる。
<Camera work settings>
Information processing apparatus 1 according to the present embodiment may use a three-dimensional model, that is, a three-dimensional character object reproduced in a three-dimensional virtual space as an avatar displayed in content data. The information processing apparatus 1 reads data of a three-dimensional model of an avatar created in advance or newly created by a user, and reproduces this avatar in a three-dimensional virtual space. The information processing device 1 can generate content data by acquiring a two-dimensional image by photographing an avatar with a virtual camera appropriately arranged in a three-dimensional virtual space.
 情報処理装置1は、コンテンツデータに含めるアバターの画像(動画像)を撮影するために、3次元の仮想空間における仮想カメラの位置、即ちカメラワークに関する設定をユーザから受け付ける。情報処理装置1は、例えば3次元仮想空間における前後、左右及び上下の位置(x座標、y座標及びz座標)、この位置からの仮想カメラの向き、並びに、これらの位置及び向きの時間的な変化等の設定をユーザから受け付ける。情報処理装置1は、受け付けた設定に従って3次元仮想空間に仮想カメラを配置し、仮想カメラを移動させてアバターを撮影し、コンテンツデータに含める2次元の画像を取得する。 The information processing device 1 accepts from the user settings related to the position of the virtual camera in the three-dimensional virtual space, that is, camerawork, in order to shoot an image (moving image) of the avatar to be included in the content data. The information processing device 1 can, for example, store front, back, left, right, and top and bottom positions (x-coordinate, y-coordinate, and z-coordinate) in a three-dimensional virtual space, the direction of the virtual camera from this position, and the temporal Receive settings such as changes from the user. The information processing apparatus 1 arranges a virtual camera in the three-dimensional virtual space according to the received settings, moves the virtual camera to photograph the avatar, and acquires a two-dimensional image to be included in the content data.
 図11は、カメラワークの設定方法の一例を説明するための模式図である。例えば図8に示したコンテンツ編集画面においてコンテンツ編集領域112に表示されたアバターに対する所定の操作、例えばマウスの右クリック操作などが行われた場合、情報処理装置1は、図11の上段に示すショット選択メニューのダイアログボックスを表示する。ショット選択メニューには、例えば、「頭部ショット」、「上半身ショット」及び「全身ショット」等の選択項目が上下に並べて表示される。ユーザは、これら複数の選択項目の中からいずれか1つを選択することができる。 FIG. 11 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation, such as a right-click operation of a mouse, is performed on the avatar displayed in the content editing area 112 on the content editing screen shown in FIG. Show the selection menu dialog box. In the shot selection menu, for example, selection items such as "head shot", "upper body shot" and "whole body shot" are displayed vertically. The user can select any one of these multiple selection items.
 ショット選択メニューにおいて例えば「頭部ショット」が選択された場合、情報処理装置1は、図11の下段左側に示すように、アバターの頭部及びその周辺部位を表示するため、3次元仮想空間において仮想カメラをアバターに近い位置に配置する。同様に、例えば「上半身ショット」又は「全身ショット」が選択された場合、情報処理装置1は、図11の下段の中央又は右側に示すように、アバターの上半身又は全身を表示するため、3次元仮想空間において仮想カメラをこれらの撮影に適した位置に配置する。 For example, when "head shot" is selected in the shot selection menu, the information processing apparatus 1 displays the head of the avatar and its peripheral parts in the three-dimensional virtual space as shown in the lower left part of FIG. Place the virtual camera close to the avatar. Similarly, for example, when "upper body shot" or "full body shot" is selected, the information processing apparatus 1 displays the upper body or the whole body of the avatar as shown in the lower center or right side of FIG. A virtual camera is placed at a position suitable for these shots in the virtual space.
 図12は、カメラワークの設定方法の一例を説明するための模式図である。例えば図8に示したコンテンツ編集画面においてコンテンツ編集領域112に表示されたアバターに対する所定の操作、例えばマウスの右クリック操作などが行われた場合、情報処理装置1は、図12の上段に示す方向選択メニューのダイアログボックスを表示する。方向選択メニューには、例えば、「左側」、「正面」及び「右側」等の選択項目が上下に並べて表示される。ユーザは、これら複数の選択項目の中からいずれか1つを選択することができる。 FIG. 12 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation, such as a right-click operation of a mouse, is performed on the avatar displayed in the content editing area 112 on the content editing screen shown in FIG. Show the selection menu dialog box. In the direction selection menu, for example, selection items such as "left", "front" and "right" are displayed vertically. The user can select any one of these multiple selection items.
 方向選択メニューにおいて例えば「左側」が選択された場合、情報処理装置1は、図12の下段左側に示すように、アバターの左側に仮想カメラを配置し、仮想カメラによるアバターの撮影を行う。同様に、例えば「正面」又は「右側」が選択された場合、情報処理装置1は、図12の下段の中央又は右側に示すように、アバターの正面又は右側に仮想カメラを配置する。なお本例では、設定に用いる左右の方向として、仮想カメラがわからアバターを見た場合の左右の方向を用いているが、これに限るものではなく、アバターから見た左右の方向を用いてもよい。 For example, when "left side" is selected in the direction selection menu, the information processing device 1 places the virtual camera on the left side of the avatar as shown in the lower left part of FIG. 12, and shoots the avatar with the virtual camera. Similarly, when "front" or "right" is selected, the information processing apparatus 1 arranges the virtual camera in front or right of the avatar, as shown in the lower middle or right side of FIG. In this example, the left and right directions used for setting are the left and right directions when the virtual camera sees the avatar, but are not limited to this. good.
 図13は、カメラワークの設定方法の一例を説明するための模式図である。例えば図8に示したコンテンツ編集画面においてコンテンツ編集領域112に表示されたアバターに対する所定の操作、例えばマウスの左クリックによるアバターの選択操作などが行われた場合、情報処理装置1は、このアバターの近傍に、ズームに関する設定を行うためのスライドバー(スライダー、スライダーバー、スクロールバー等)を表示する。 FIG. 13 is a schematic diagram for explaining an example of a camerawork setting method. For example, when a predetermined operation is performed on an avatar displayed in the content editing area 112 on the content editing screen shown in FIG. A slide bar (slider, slider bar, scroll bar, etc.) for setting zoom is displayed nearby.
 図13に示す例では、左右方向に長い形状のスライドバーがアバターの下方に表示されており、ユーザはスライドバーのつまみ部分を左右方向へスライドさせる操作を行うことができる。情報処理装置1は、例えばスライドバーのつまみ部分が左方向へスライドされた場合、図13の左側に示すように、アバターに対して仮想カメラを遠ざける(ズームアウトさせる)方向へ移動させる。情報処理装置1は、例えばスライドバーのつまみ部分が右方向へスライドされた場合、図13の右側に示すように、アバターに対して仮想カメラを近づける(ズームインさせる)方向へ移動させる。 In the example shown in FIG. 13, a horizontally elongated slide bar is displayed below the avatar, and the user can slide the knob of the slide bar horizontally. For example, when the knob portion of the slide bar is slid leftward, the information processing apparatus 1 moves the virtual camera away from the avatar (zooms out) as shown on the left side of FIG. 13 . For example, when the knob portion of the slide bar is slid to the right, the information processing apparatus 1 moves the virtual camera closer to the avatar (zooms in) as shown on the right side of FIG. 13 .
 このように本実施の形態に係る情報処理装置1は、3次元仮想空間に配置されたアバターを撮影する仮想カメラの位置及び向き等の設定を受け付けて、受け付けた設定に応じたカメラワークで仮想カメラによるアバターの撮影を行い、撮影により得られるアバターの2次元画像を含むコンテンツデータを生成する。これにより本実施の形態に係る情報処理システムでは、コンテンツデータにおいて表示されるアバターの向き及び大きさ等をユーザが容易に設定することができる。 As described above, the information processing apparatus 1 according to the present embodiment receives settings such as the position and orientation of a virtual camera that captures an avatar placed in a three-dimensional virtual space, and performs virtual camerawork according to the received settings. An avatar is photographed by a camera, and content data including a two-dimensional image of the avatar obtained by photographing is generated. Accordingly, in the information processing system according to the present embodiment, the user can easily set the orientation, size, etc. of the avatar displayed in the content data.
<地域又は年代等に応じたアバター>
 本実施の形態に係る情報処理装置1は、コンテンツデータを提供する地域又は提供する人の年代等に応じて、生成するコンテンツデータに含めるアバターの挙動を変化させる。このために情報処理装置1は、コンテンツデータを提供する地域又は提供する人の年代等の設定をユーザから受け付ける。情報処理装置1が設定を受け付ける地域は、例えば国、都道府県又は州等が採用され得る。また情報処理装置1は、例えば20代又は30代等のおおよその年代を設定として受け付けてもよく、また例えば25歳~40歳等のように年齢の範囲を数値の入力により受け付けてもよく、これら以外の方法で年代の設定を受け付けてもよい。
<Avatar according to region or age>
The information processing apparatus 1 according to the present embodiment changes the behavior of the avatar included in the content data to be generated according to the region where the content data is provided or the age of the person who provides the content data. For this reason, the information processing apparatus 1 accepts settings such as the region where the content data is provided or the age of the person who provides the content data from the user. For example, a country, a prefecture, a state, or the like can be adopted as the area for which the information processing apparatus 1 receives settings. Further, the information processing device 1 may accept an approximate age such as 20's or 30's as a setting, or may accept an age range such as 25 to 40 by inputting a numerical value. You may receive the setting of the age by the methods other than these.
 アバターが発話する言語が英語である場合、情報処理装置1は、例えば地域としてアメリカ、イギリス及びオーストラリア等の選択肢をユーザに提示し、これらの中から地域の設定を受け付ける。英語の発音及びアクセント等は地域毎の差異があり、情報処理装置1は、設定された地域に応じた発音及びアクセント等でアバターが発話を行うように、テキスト情報から音声への変換を行う。 When the language spoken by the avatar is English, the information processing device 1 presents the user with options such as the United States, the United Kingdom, and Australia as regions, and accepts setting of the region from among these. English pronunciation, accent, etc. differ from region to region, and the information processing device 1 converts text information into voice so that the avatar speaks with pronunciation, accent, etc., according to the set region.
 またアバターが発話する言語が日本語である場合、情報処理装置1は、例えば地域として関東地方、関西地方及び東北地方等の地方名を選択肢としてユーザに提示し、これらの中から地域の選択を受け付ける。また情報処理装置1は、例えば標準語、関西弁及び東北弁等の方言名を地域の選択肢としてユーザに提示してもよい。情報処理装置1は、設定された地域の方言に応じた発音及びアクセント等でアバターが発話を行うように、テキスト情報から音声への変換を行う。 Further, when the language spoken by the avatar is Japanese, the information processing device 1 presents the user with the names of regions such as the Kanto region, the Kansai region, and the Tohoku region as options, and allows the user to select one of these regions. accept. The information processing apparatus 1 may also present the user with dialect names such as standard Japanese, Kansai dialect, and Tohoku dialect as options for regions. The information processing device 1 converts the text information into voice so that the avatar speaks with pronunciation and accent according to the dialect of the set region.
 情報処理装置1は、例えばテキスト情報から音声データへの変換を機械学習により生成された学習モデルを利用して行う場合、地域毎に発音及びアクセント等を学習した学習モデルを用意し、ユーザが設定した地域に応じて学習モデルを使い分けてテキスト情報から音声データへの変換を行うことができる。 For example, when conversion from text information to speech data is performed using a learning model generated by machine learning, the information processing apparatus 1 prepares a learning model in which pronunciation, accent, etc. are learned for each region, and is set by the user. It is possible to convert text information into speech data by using different learning models depending on the region.
 また例えば日本の方言では、物の名称自体が異なる場合がある。そこで情報処理装置1は、設定された地域に応じて、アバターが発話として音声出力するテキスト情報に含まれる文言又は単語等を変化させる。情報処理装置1は、例えばテキスト情報に含まれ得る文言又は単語と、これらの各地方での表現とを対応付けたデータベースを有し、このデータベースとユーザにより設定された地域とに基づいて、テキスト情報に含まれる文言又は単語等をこの地域に適した表現に置き換え、置き換え後のテキスト情報を基に音声データを生成する。 Also, for example, in Japanese dialects, the names of things themselves may differ. Therefore, the information processing device 1 changes the phrases or words included in the text information output as speech by the avatar according to the set region. The information processing device 1 has, for example, a database that associates phrases or words that can be included in text information with expressions in each region, and based on this database and the region set by the user, text The phrases or words included in the information are replaced with expressions suitable for the region, and voice data is generated based on the text information after replacement.
 また例えば若い人々と高齢の人々とでは、同じ物又は事であっても名称又は表現等が異なる場合がある。そこで情報処理装置1は、設定された年代に応じて、音声出力するテキスト情報に含まれる文言又は単語等を変化させてよい。 Also, for example, young people and old people may use different names or expressions for the same thing or thing. Therefore, the information processing apparatus 1 may change the phrases or words included in the text information to be output as voice according to the set age.
 またコンテンツデータを視聴するユーザの年代によっては、聞き取ることができる会話の音量及びスピード等に差がある。例えば視聴するユーザが高齢者である場合には、アバターが大きい声でゆっくりと話すことが好ましい。そこで本実施の形態に係る情報処理装置1は、設定された年代に応じて、アバターが発話する際の音量及びスピード等を変化させてコンテンツデータを生成する。 Also, depending on the age of the user viewing the content data, there are differences in the volume and speed of the conversation that can be heard. For example, if the viewing user is an elderly person, it is preferable for the avatar to speak loudly and slowly. Therefore, the information processing apparatus 1 according to the present embodiment generates content data by changing the volume, speed, etc., when the avatar speaks according to the set age.
 また本実施の形態に係る情報処理システムでは、コンテンツデータにおいてアバターにジェスチャーを行わせることができる。ただし、同じジェスチャーであっても国によって異なる意味を持つ場合がある。そこで本実施の形態に係る情報処理装置1は、ユーザがアバターに対して行わせることを設定したジェスチャーを、ユーザが設定した地域に応じたジェスチャーに変更してコンテンツデータを生成する。 Also, in the information processing system according to the present embodiment, it is possible to make the avatar perform gestures in the content data. However, the same gesture may have different meanings in different countries. Therefore, the information processing apparatus 1 according to the present embodiment generates content data by changing the gesture that the user has set for the avatar to perform to a gesture according to the region set by the user.
 このように本実施の形態に係る情報処理装置1は、コンテンツデータを提供する地域又は年代等の設定をユーザから受け付け、設定された地域もしくは年代等に応じた発音、アクセント、文言、単語、音量もしくはスピード等でアバターが発話する、又は、設定された地域もしくは年代等に応じたジェスチャーをアバターが行うコンテンツデータを生成する。これによりユーザは、例えば特定の地域及び年代等に向けて生成したコンテンツデータを基に、別の地域及び年代等に向けたコンテンツデータを容易に生成することが期待できる。 As described above, the information processing apparatus 1 according to the present embodiment accepts settings such as the region or age for which content data is provided from the user, and uses pronunciation, accent, phrases, words, and volume according to the set region or age. Alternatively, content data is generated in which the avatar speaks at a speed or the like, or the avatar performs gestures according to the set region, age, or the like. As a result, the user can expect to easily generate content data for a different region and age based on content data generated for a specific region and age, for example.
 なお本実施の形態において情報処理装置1は、コンテンツデータを提供する地域又は年代の設定を受け付けているが、これら以外の設定、例えばコンテンツデータの提供先について性別、宗教、業界又は分野等の種々の設定を受け付けて、コンテンツデータの生成に反映させてよい。 In the present embodiment, the information processing apparatus 1 accepts the setting of the region or age to which the content data is to be provided. may be received and reflected in the generation of content data.
<アバター表示位置に応じた音像>
 上述のように、本実施の形態に係る情報処理装置1が生成するコンテンツデータは、例えば発表資料を背景画像とし、この背景画像の前にアバターを配置した画像(動画像)と、このアバターが発話する音声とを含む。コンテンツデータの再生により表示される画面のどの位置にアバターを配置するかを、ユーザは適宜に設定することができる。本実施の形態に係る情報処理装置1は、アバターの表示位置に応じて、このアバターが発話する音声の音像を設定することができる。
<Sound image according to avatar display position>
As described above, the content data generated by the information processing apparatus 1 according to the present embodiment includes, for example, an image (moving image) in which a presentation material is used as a background image, an avatar is placed in front of this background image, and Speech and speech. The user can appropriately set the position of the avatar on the screen displayed by reproducing the content data. The information processing apparatus 1 according to the present embodiment can set the sound image of the voice uttered by the avatar according to the display position of the avatar.
 音像は、例えばコンテンツデータを再生して音声を聞いたユーザが、この音声の音源の位置を認識する場所又は方向等である。本実施の形態においては、コンテンツデータを再生した画面においてアバターが表示する位置を左側、中央及び右側の3つに分け、情報処理装置1は、アバターの表示位置に応じて、このアバターが発する音声の音像を左側、中央及び右側の3つのいずれかに設定する。 A sound image is, for example, the location or direction at which a user who has played back content data and listened to the sound recognizes the position of the sound source of the sound. In the present embodiment, the position where the avatar is displayed on the screen where the content data is played back is divided into three positions: the left side, the center, and the right side. Set the sound image of to one of three: left, center, and right.
 情報処理装置1は、例えばアバターの表示位置が左側である場合、コンテンツデータに含めるステレオ音声の音声データについて、左側チャンネル(L)と右側チャンネル(R)との出力レベルの比を、R:L=2:1とする。これにより、コンテンツデータを再生したユーザが、アバターの発話の音源が左側に存在すると認識することを期待できる。また例えばアバターの表示位置が中央である場合、情報処理装置1は、音声データの左右の出力レベルの比を、R:L=1:1とする。また例えばアバターの表示位置が右側である場合、情報処理装置1は、音声データの左右の出力レベルの比を、R:L=1:2とする。 For example, when the display position of the avatar is on the left side, the information processing device 1 sets the ratio of the output levels of the left channel (L) and the right channel (R) to R:L for audio data of stereo audio included in the content data. = 2:1. As a result, it can be expected that the user who has reproduced the content data will recognize that the sound source of the avatar's utterance is on the left side. Further, for example, when the display position of the avatar is in the center, the information processing device 1 sets the ratio of the left and right output levels of the audio data to R:L=1:1. Further, for example, when the display position of the avatar is on the right side, the information processing device 1 sets the ratio of the left and right output levels of the audio data to R:L=1:2.
 なお本実施の形態において情報処理装置1は、ステレオ音声の左右の出力レベルを調整することで音像を設定しているが、音像の設定方法はこれに限るものではなく、どのような方法が採用されてもよい。 In the present embodiment, the information processing apparatus 1 sets the sound image by adjusting the left and right output levels of the stereo sound. may be
 例えば情報処理装置1は、頭部伝達関数(Head-Related transfer function: HRTF)を用いた音像定位技術により、アバターの表示位置に応じた音像の設定を行うことができる。図14は、音像定位技術の概要を説明するための模式図である。音像定位技術では、例えばFIR(Finite Impulse Response)フィルタ121~124が用いられる。2つのFIRフィルタ121,122にはステレオ音声の右側チャンネル(R)の音声がそれぞれ入力される。2つのFIRフィルタ123,124にはステレオ音声の左側チャンネル(L)の音声がそれぞれ入力される。FIRフィルタ121にて音声処理された右側チャネル(R)の音声と、FIRフィルタ123にて音声処理された左側チャンネル(L)の音声とが加算されたものが、新たな右側チャンネル(R´)の音声として出力される。またFIRフィルタ122にて音声処理された右側チャネル(R)の音声と、FIRフィルタ124にて音声処理された左側チャンネル(L)の音声とが加算されたものが、新たな左側チャンネル(L´)の音声として出力される。 For example, the information processing device 1 can set the sound image according to the display position of the avatar by sound image localization technology using a head-related transfer function (HRTF). FIG. 14 is a schematic diagram for explaining the outline of the sound image localization technique. In sound image localization technology, for example, FIR (Finite Impulse Response) filters 121 to 124 are used. Sounds of the right channel (R) of stereo sound are input to the two FIR filters 121 and 122, respectively. The two FIR filters 123 and 124 are supplied with left channel (L) stereo sound. A new right channel (R') is obtained by adding the right channel (R) sound processed by the FIR filter 121 and the left channel (L) sound processed by the FIR filter 123. is output as the sound of The right channel (R) sound processed by the FIR filter 122 and the left channel (L) sound processed by the FIR filter 124 are added to create a new left channel (L' ) is output as audio.
 情報処理装置1は、例えばアバターの表示位置に応じてFIRフィルタ121~124のパラメータを適宜に調整することによって、このアバターの発話音声に係る音像の位置を適宜に調整することができる。また情報処理装置1は、例えばアバターを表示可能な複数の位置に対応付けて、FIRフィルタ121~124のパラメータのセットを複数作成して記憶しておき、設定されたアバターの表示位置に対応するパラメータのセットを読み出して使用してもよい。FIRフィルタ121~124のパラメータは、例えば頭部伝達関数を用いて決定され得る。なお、頭部伝達関数を用いた音像定位技術は、既存の技術であるため、詳細な説明を省略する。 By appropriately adjusting the parameters of the FIR filters 121 to 124 according to, for example, the display position of the avatar, the information processing device 1 can appropriately adjust the position of the sound image associated with the uttered voice of the avatar. Further, the information processing apparatus 1 creates and stores a plurality of sets of parameters of the FIR filters 121 to 124 in association with, for example, a plurality of displayable positions of the avatar, and sets the parameters corresponding to the set display positions of the avatar. A set of parameters may be retrieved and used. The parameters of the FIR filters 121-124 can be determined using, for example, head-related transfer functions. Since the sound image localization technique using the head-related transfer function is an existing technique, detailed description thereof will be omitted.
 情報処理装置1は、上記の2つの出力、即ち新たな左側チャンネル(L)及び右側チャンネル(R)の音声を含むコンテンツデータを生成する。これにより、このコンテンツデータを再生したユーザは、表示位置に応じた音像でアバターの発話の音声を聞くことができる。 The information processing device 1 generates content data including the above two outputs, that is, the new left channel (L) and right channel (R) sounds. As a result, the user who has reproduced this content data can hear the utterance voice of the avatar with a sound image according to the display position.
<アバターの表情、音声、ジェスチャーの連動>
 例えば図5に示したように、本実施の形態に係る情報処理システムでは、ユーザがアバターの表情の設定を行うことができる。本実施の形態に係る情報処理装置1は、ユーザが設定したアバターの表情に応じて、このアバターが発話する音声のピッチ(音の高さ)及び音量等を調整してもよい。
<Avatar's facial expression, voice, and gesture interlocking>
For example, as shown in FIG. 5, in the information processing system according to this embodiment, the user can set the facial expression of the avatar. The information processing apparatus 1 according to the present embodiment may adjust the pitch (pitch), volume, etc. of the voice uttered by the avatar according to the facial expression of the avatar set by the user.
 例えばアバターの表情として「笑顔」が設定されている場合、情報処理装置1は、音声のピッチを高く、且つ、音量を高くする。また例えば表情として「怒り顔」が設定されている場合、情報処理装置1は、音声のピッチを低く、且つ、音量を高くする。情報処理装置1は、例えばアバターの表情と、音声のピッチの増減量及び音量の増減量との対応をデータベースに記憶しておき、ユーザが設定したアバターの表情に応じてデータベースからこれらの増減量を取得する。情報処理装置1は、例えばアバターの発話の音声のピッチ及び音量につて定められたデフォルト値に対し、データベースから取得した増減量を反映させることにより、表情に応じたピッチ及び音量の調整を行うことができる。 For example, when "smile" is set as the facial expression of the avatar, the information processing device 1 increases the pitch and volume of the voice. Further, for example, when an "angry face" is set as the facial expression, the information processing device 1 lowers the pitch of the voice and raises the volume. The information processing device 1 stores, for example, in a database the correspondence between the facial expression of the avatar and the pitch increase/decrease amount and the volume increase/decrease amount of the voice. to get The information processing device 1 adjusts the pitch and volume according to the expression by reflecting the increase/decrease amount obtained from the database to the default values of the pitch and volume of the utterance of the avatar, for example. can be done.
 また情報処理装置1は、例えばアバターの表情についてユーザが自動設定を選択した場合などに、テキスト情報の特徴に基づいてアバターの表情を決定してもよい。例えば情報処理装置1は、アバターが発話するテキスト情報に特定の単語又はキーワード等が含まれているか否かを判定し、特定の単語又はキーワード等が含まれている場合に、この単語又はキーワード等に対応付けられた表情をアバターの表情と決定する。例えばテキスト情報に「うれしい」又は「美味しい」等の単語が含まれている場合、情報処理装置1は、アバターの表情を「笑顔」とし、アバターの発話の音声についてピッチを高く、且つ、音量を高くすることができる。情報処理装置1は、例えばテキスト情報に含まれ得る特定の単語又はキーワード等と、アバターの表情とを対応付けたデータベースを有する。 The information processing device 1 may also determine the facial expression of the avatar based on the features of the text information, for example, when the user selects automatic setting for the facial expression of the avatar. For example, the information processing device 1 determines whether or not the text information spoken by the avatar contains a specific word, keyword, etc., and if the specific word, keyword, etc. is contained, the word, keyword, etc. is determined as the facial expression of the avatar. For example, if the text information includes words such as "happy" or "delicious", the information processing device 1 sets the avatar's facial expression to "smiling" and raises the pitch and volume of the avatar's utterance. can be higher. The information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with facial expressions of avatars.
 また情報処理装置1は、テキスト情報に特定の単語又はキーワード等が含まれている場合、この単語又はキーワード等を発話する際に、対応付けられたジェスチャーをアバターが行うよう、ジェスチャーの自動設定を行ってもよい。例えばテキスト情報に「Wow」の単語が含まれている場合、情報処理装置1は、アバターの口及び目を大きく開き、手を動かすジェスチャーを行わせることができる。また例えばテキスト情報に「No」の単語が含まれている場合、情報処理装置1は、アバターに首を横に振るジェスチャーを行わせることができる。情報処理装置1は、例えばテキスト情報に含まれ得る特定の単語又はキーワード等と、アバターのジェスチャーとを対応付けたデータベースを有する。 Further, when the text information includes a specific word or keyword, the information processing apparatus 1 automatically sets the gesture so that the avatar performs the associated gesture when uttering the word or keyword. you can go For example, when the text information includes the word "Wow", the information processing device 1 can cause the avatar to make a gesture of opening the mouth and eyes wide and moving the hand. Further, for example, when the text information includes the word "No", the information processing device 1 can cause the avatar to make a gesture of shaking its head sideways. The information processing device 1 has a database in which specific words or keywords that can be included in text information, for example, are associated with avatar gestures.
 また情報処理装置1は、例えばテキスト情報の入力に対して感情を推定するように予め機械学習がなされた学習モデルを用いて、テキスト情報に基づくアバターの表情の決定を行ってもよい。例えば、テキスト情報と感情とを対応付けた学習用データ(教師データ)を用いて学習モデルの教師あり学習の処理を行うことで、テキスト情報の入力に対して感情を推定する学習モデルを生成することができる。情報処理装置1は、アバターの発話内容に相当するテキスト情報をこの学習モデルへ入力し、学習モデルが出力する感情の推定結果を取得し、感情に対応付けられた表情をアバターの表情と決定することができる。なおこの場合に情報処理装置1は、例えば1つのコンテンツデータを生成するために用意された全テキスト情報を学習モデルへ入力してもよく、また例えばテキスト情報に含まれる文章毎に学習モデルへ入力してもよく、また例えばインターバルが設けられるまでの一纏まりの文章毎に学習モデルへ入力してもよく、これら以外のどのような単位又は量のテキスト情報を学習モデルへ入力してもよい。また学習モデルを生成する機械学習の処理は、情報処理装置1とは異なる装置にて行われてよい。 Also, the information processing device 1 may determine the facial expression of the avatar based on the text information, for example, using a learning model that has undergone machine learning in advance so as to estimate the emotion of the text information input. For example, by performing supervised learning processing of the learning model using learning data (teacher data) that associates text information and emotions, a learning model that estimates emotions for text information input is generated. be able to. The information processing device 1 inputs text information corresponding to the utterance content of the avatar to this learning model, acquires the emotion estimation result output by the learning model, and determines the facial expression associated with the emotion as the facial expression of the avatar. be able to. In this case, the information processing apparatus 1 may input all text information prepared for generating one piece of content data to the learning model. Alternatively, for example, each set of sentences until an interval is provided may be input to the learning model, or any other unit or amount of text information may be input to the learning model. Machine learning processing for generating a learning model may be performed by a device different from the information processing device 1 .
 このように本実施の形態に係る情報処理装置1は、アバターの表情に関する設定に応じて、アバターの発話に係る声の高さ又は音量を調整したコンテンツデータを生成する。また情報処理装置1は、アバターの発話内容に相当するテキスト情報に基づいて、このテキスト情報を発話する人等の感情を推定し、推定した感情に応じてアバターの表情を設定する。また情報処理装置1は、テキスト情報に特定の単語又はキーワード等が含まれているか否か等のように、テキスト情報に含まれる文の特徴を判定し、判定した特徴に応じた声の高さ又は音量でアバターが発話するコンテンツデータを生成する。また情報処理装置1は、テキスト情報に含まれ得る単語とアバターの表情又はジェスチャーとの対応をデータベースに記憶しておき、アバターがこの単語を発話する際に対応する表情又はジェスチャーを行うコンテンツデータを生成する。これらにより本実施の形態に係る情報処理装置1は、コンテンツデータにて表示されるアバターの表情、発話音声の高さ及び音量等、ジェスチャー、並びに、発話の内容等を連動させることが期待できる。 As described above, the information processing apparatus 1 according to the present embodiment generates content data in which the pitch or volume of the avatar's utterance is adjusted in accordance with the settings related to the facial expression of the avatar. The information processing device 1 also estimates the emotion of the person who speaks the text information based on the text information corresponding to the utterance content of the avatar, and sets the facial expression of the avatar according to the estimated emotion. In addition, the information processing apparatus 1 determines features of sentences included in the text information, such as whether or not a specific word or keyword is included in the text information, and adjusts the pitch of the voice according to the determined features. Alternatively, content data is generated in which the avatar speaks at a volume. The information processing device 1 also stores in a database the correspondence between words that can be included in the text information and facial expressions or gestures of the avatar, and stores content data in which the avatar makes corresponding facial expressions or gestures when uttering the words. Generate. As a result, the information processing apparatus 1 according to the present embodiment can be expected to link the expression of the avatar displayed in the content data, the pitch and volume of the uttered voice, the gesture, the content of the utterance, and the like.
<コンテンツの切替>
 上述の情報処理装置1が生成するコンテンツデータは、いわゆる動画像のコンテンツであり、視聴者が単に視聴するのみのコンテンツである。しかしながら本実施の形態に係る情報処理装置1は、例えばコンテンツの途中で視聴者からの情報入力を受け付けて、入力された情報に応じて内容が切り替わる双方向性のコンテンツを生成してもよい。
<Switch content>
The content data generated by the information processing apparatus 1 described above is so-called moving image content, and content that is simply viewed by the viewer. However, the information processing apparatus 1 according to the present embodiment may receive information input from the viewer in the middle of the content, for example, and generate interactive content in which the content is switched according to the input information.
 双方向性のコンテンツは、例えばレクチャー等の動画像の出力の途中に習熟度を確認するための試験を実施し、この試験の点数等に応じて次に出力する動画像を切り替える。このために情報処理装置1は、例えば試験の問題を作成する操作、及び、この試験の解答を作成する操作等をユーザから受け付けて、視聴者への問題出力と視聴者からの回答受付とを行うコンテンツを作成する。また情報処理装置1は、視聴者から受け付けた回答に基づく採点方法、及び、採点結果に応じて複数のコンテンツを切り替えるための切替条件等の設定をユーザから受け付ける。 For interactive content, for example, a test is conducted to check proficiency during the output of a video image such as a lecture, and the next video image to be output is switched according to the score of this test. For this purpose, the information processing apparatus 1 accepts, for example, an operation of creating test questions and an operation of creating answers for this test from the user, and outputs questions to the viewer and receives answers from the viewer. Create content that does. The information processing apparatus 1 also receives from the user settings such as a scoring method based on responses received from viewers and switching conditions for switching a plurality of contents according to the scoring results.
 図15は、コンテンツ切替の設定方法の一例を説明するための模式図である。図15に示す例では、発話音声編集画面においてコンテンツ切替の設定が行われている。図示の発話音声編集画面では、コンテンツの1番目の項目としてアバターが発話する台詞「こんにちは。それでは習熟度テストを実施しましょう。」が設定され、2番目の項目として「<習熟度確認テスト>」が設定されている。これにより、生成されるコンテンツデータでは、アバターが1番目の台詞を発話した後、予め作成された習熟度確認テストが実施される。 FIG. 15 is a schematic diagram for explaining an example of a content switching setting method. In the example shown in FIG. 15, content switching is set on the speech editing screen. In the utterance voice editing screen shown in the figure, the line spoken by the avatar is set as the first item of the content, and the second item is "<Proficiency check test>". is set. As a result, in the generated content data, after the avatar utters the first line, a proficiency check test prepared in advance is carried out.
 なおこのときに実施される習熟度確認テストは、例えば情報処理装置1が別途表示する試験コンテンツ作成画面等にてユーザにより予め作成される。図示は省略するが、試験コンテンツ作成画面では、例えば試験として4択問題が出題される場合、問題文と、4つの選択肢の文と、いずれの選択肢が正解であるかの解答との入力をユーザから受け付けて、情報処理装置1は受け付けた情報を基に試験コンテンツを生成する。試験コンテンツに対してユーザは任意の名称を付すことができ、本例では「習熟度確認テスト」という名称が付されたものとしている。また複数の問題を含む場合に、各問題の配点、及び、合計点を算出するための演算式等の設定を情報処理装置1はユーザから受け付けて試験コンテンツを生成してよい。 It should be noted that the proficiency level confirmation test to be carried out at this time is created in advance by the user, for example, on a test content creation screen separately displayed by the information processing apparatus 1, or the like. Although illustration is omitted, on the test content creation screen, for example, when a four-choice question is given as a test, the user inputs the question text, the sentences of the four options, and the answer indicating which of the options is the correct answer. , and the information processing apparatus 1 generates test content based on the received information. The user can give an arbitrary name to the test content, and in this example, the name "proficiency level confirmation test" is given. Further, when a plurality of questions are included, the information processing apparatus 1 may receive from the user settings such as calculation formulas for calculating the points and total points for each question, and generate test content.
 図示の発話音声編集画面では、「<習熟度確認テスト>」の項目に続く3番目の項目として「Branch if score<80 goto No.11」のようなコンテンツ切替の条件が設定されている。本例では習熟度確認テストの得点が変数scoreに格納されるものとし、得点が80未満の場合にコンテンツの11番目の項目へ分岐することが設定されている。なお、図15に示す分岐条件の記載方法は一例であり、コンテンツの切替はどのような形式で設定されてもよい。 On the utterance editing screen shown in the figure, the third item following the item "<proficiency check test>" is set as a content switching condition such as "Branch if score<80 goto No.11". In this example, the score of the proficiency check test is stored in the variable score, and if the score is less than 80, it is set to branch to the 11th item of the content. Note that the description method of the branch condition shown in FIG. 15 is an example, and content switching may be set in any format.
 本例の場合、3番目の項目として設定された習熟度確認テストの得点が80点以上であれば、続く4番目の項目に相当するコンテンツ、「それでは次に進みます。」とアバターが発話するコンテンツが出力される。また習熟度確認テストの得点が80点未満であれば、4番目から10番目までの項目は出力されず、11番目の項目に相当するコンテンツ、「補修コースを開始します。」とアバターが発話するコンテンツが出力される。 In this example, if the score of the proficiency check test set as the third item is 80 points or more, the content corresponding to the following fourth item, "Let's move on." content is output. Also, if the proficiency level confirmation test score is less than 80 points, the 4th to 10th items are not output, and the content corresponding to the 11th item, ``Start a repair course'', is spoken by the avatar. content is output.
<資料>
 図16~図44は、本実施の形態に係る情報処理システムに関連する資料である。
<Material>
16 to 44 are materials related to the information processing system according to this embodiment.
<まとめ>
 以上の構成の本実施の形態に係る情報処理装置1は、予め作成された発表資料を取得し、発表音声に係るテキスト情報を取得し、発表者のアバターに係る設定を受け付けて、発表資料と共にアバターが表示され且つテキスト情報に対応する音声をアバターが発話するコンテンツデータを生成する。これにより情報処理装置1は、ユーザによる発表のためのコンテンツデータの生成を支援することが期待できる。
<Summary>
The information processing apparatus 1 according to the present embodiment having the above configuration acquires presentation materials created in advance, acquires text information related to the presentation voice, receives settings related to the presenter's avatar, and Content data is generated in which an avatar is displayed and the avatar utters a voice corresponding to the text information. Accordingly, the information processing apparatus 1 can be expected to support generation of content data for presentation by the user.
 また本実施の形態に係る情報処理装置1は、発表者の発表に係る音声情報を取得し、取得した音声情報を変換することでテキスト情報を取得する。これにより情報処理装置1は、ユーザによるテキスト情報の作成の負担を軽減することが期待できる。 Further, the information processing apparatus 1 according to the present embodiment acquires the voice information related to the presentation of the presenter, and acquires the text information by converting the acquired voice information. Accordingly, the information processing apparatus 1 can be expected to reduce the user's burden of creating text information.
 また本実施の形態に係る情報処理装置1は、テキスト情報に含まれる単語の発音に係る設定を受け付けて、受け付けた設定に応じた発音でこの単語をアバターが発話するコンテンツデータを生成する。このときに情報処理装置1は、テキスト情報を表示してユーザから単語の選択を受け付け、選択された単語の発音表記を表示する。また情報処理装置1は、表示した発音表記の修正を受け付けて、修正を受け付けた発音表記での単語の発話をアバターに行わせる。これにより情報処理装置1は、ユーザによる単語の発音の設定操作を容易化することが期待できる。 The information processing apparatus 1 according to the present embodiment also receives settings related to the pronunciation of words included in the text information, and generates content data in which the avatar utters the words with the pronunciation according to the received settings. At this time, the information processing apparatus 1 displays text information, accepts word selection from the user, and displays the phonetic notation of the selected word. The information processing device 1 also accepts corrections to the displayed phonetic notation and causes the avatar to utter words in the corrected phonetic notation. As a result, the information processing apparatus 1 can be expected to facilitate the user's operation of setting the pronunciation of words.
 また本実施の形態に係る情報処理装置1は、取得したテキスト情報に含まれる複数のテキストについて、各テキストに対応する音声を出力するインターバルの設定、又は、各テキストを発話するアバターの設定を受け付けて、受け付けた設定に応じてアバターがテキストを発話するコンテンツデータを生成する。また情報処理装置1は、テキストを発話する際のアバターの表情又はジェスチャーの設定を受け付け、受け付けた設定に応じた表情又はジェスチャーでアバターが発話するコンテンツデータを生成する。これにより情報処理装置1は、ユーザによるアバターの設定操作を容易化することが期待できる。 Further, the information processing apparatus 1 according to the present embodiment accepts setting of an interval for outputting a voice corresponding to each text, or setting of an avatar that speaks each text, for a plurality of texts included in the acquired text information. content data in which the avatar speaks text according to the received settings. The information processing apparatus 1 also receives settings for facial expressions or gestures of the avatar when uttering text, and generates content data in which the avatar speaks with facial expressions or gestures according to the received settings. Accordingly, the information processing apparatus 1 can be expected to facilitate the user's avatar setting operation.
 また本実施の形態に係る情報処理装置1は、発表資料に基づく背景画像にアバターを重畳して表示するコンテンツ編集領域112(第1領域)と、コンテンツ編集領域112に表示するアバターに係る設定項目を表示するアバター設定領域113(第2領域)と、発表資料に含まれる複数の画像を表示する背景画像選択領域111(第3領域)とを含むコンテンツ編集画面を表示する。アバター設定領域113には、例えばアバターが行うジェスチャーの設定項目、アバターの位置に係る設定項目、アバターの向きに係る設定項目、又は、アバターのサイズに係る設定項目等が設けられる。またアバター設定領域113には、例えばアバターが発話する音声のピッチ、速度、深さ又は音量等の設定項目が設けられてもよい。これらにより情報処理装置1は、発表資料に基づく背景画像とアバターとに関するユーザの設定操作を容易化することが期待できる。 Information processing apparatus 1 according to the present embodiment also includes content editing area 112 (first area) in which an avatar is superimposed and displayed on a background image based on presentation materials, and setting items related to the avatar displayed in content editing area 112. and a background image selection area 111 (third area) for displaying a plurality of images included in the presentation material. The avatar setting area 113 includes, for example, setting items for gestures performed by the avatar, setting items for the position of the avatar, setting items for the orientation of the avatar, setting items for the size of the avatar, and the like. Also, the avatar setting area 113 may be provided with setting items such as the pitch, speed, depth, or volume of the voice uttered by the avatar. As a result, the information processing apparatus 1 can be expected to facilitate the user's setting operation regarding the background image and the avatar based on the presentation material.
 また本実施の形態に係る情報処理装置1は、発表資料に基づく背景画像と共に表示するキャプションの文字列の入力を受け付けて、受け付けた文字列を発表資料の背景画像と共に表示したコンテンツデータを生成する。これにより情報処理装置1は、発表資料及びアバターの発話に加えて、種々の情報を追加したコンテンツデータを生成することができる。 Further, the information processing apparatus 1 according to the present embodiment receives an input of a caption character string to be displayed together with a background image based on the presentation material, and generates content data in which the received character string is displayed together with the background image of the presentation material. . Thereby, the information processing apparatus 1 can generate content data to which various information is added in addition to presentation materials and utterances of avatars.
 なお本実施の形態において図5~図9に示した画面の構成等は一例であってこれに限るものではない。またこれらの画面において示したテキスト及び画像等も一例であって、これに限るものではない。 It should be noted that the configurations of the screens shown in FIGS. 5 to 9 in the present embodiment are only examples, and the present invention is not limited to these. Also, the texts and images shown on these screens are only examples, and the present invention is not limited to these.
 今回開示された実施形態はすべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上記した意味ではなく、請求の範囲によって示され、請求の範囲と均等の意味及び範囲内でのすべての変更が含まれることが意図される。 The embodiments disclosed this time are illustrative in all respects and should be considered not restrictive. The scope of the present invention is indicated by the scope of the claims rather than the meaning described above, and is intended to include all changes within the meaning and scope equivalent to the scope of the claims.
 コンピュータプログラムは、単一のコンピュータ上で実行されてもよく、又は、1つのサイトにおいて配置されるかもしくは複数のサイトにわたって分散して配置され、通信ネットワークによって相互接続された複数のコンピュータ上で実行されるように展開することができる。 A computer program can be executed on a single computer, or on multiple computers located at one site or distributed across multiple sites and interconnected by a communication network. can be expanded to
 各実施形態に記載した事項は相互に組み合わせることが可能である。また、請求の範囲に記載した独立請求項及び従属請求項は、引用形式に関わらず全てのあらゆる組み合わせにおいて、相互に組み合わせることが可能である。さらに、請求の範囲には他の2以上のクレームを引用するクレームを記載する形式(マルチクレーム形式)を用いているが、これに限るものではない。マルチクレームを少なくとも1つ引用するマルチクレーム(マルチマルチクレーム)を記載する形式を用いて記載してもよい。 The items described in each embodiment can be combined with each other. Also, the independent and dependent claims described in the claims can be combined with each other in all possible combinations regardless of the form of reference. Furthermore, the scope of claims uses a format in which claims referencing two or more other claims (multi-claim format), but is not limited to this. It may be described using a format for describing multiple claims (multi-multi-claim) that refers to at least one multiple claim.
 1 情報処理装置
 11 処理部
 11a 情報取得部
 11b アバターデータ生成部
 11c 音声データ生成部
 11d 背景データ生成部
 11e コンテンツデータ生成部
 11f 表示処理部
 12 記憶部
 12a プログラム
 12b コンテンツデータ記憶部
 13 通信部
 14 表示部
 15 操作部
 99 記録媒体
 101 設定テーブル
 111 背景画像選択領域
 112 コンテンツ編集領域
 113 アバター設定領域
 114 キャプション設定領域
1 information processing device 11 processing unit 11a information acquisition unit 11b avatar data generation unit 11c audio data generation unit 11d background data generation unit 11e content data generation unit 11f display processing unit 12 storage unit 12a program 12b content data storage unit 13 communication unit 14 display Unit 15 Operation unit 99 Recording medium 101 Setting table 111 Background image selection area 112 Content editing area 113 Avatar setting area 114 Caption setting area

Claims (23)

  1.  コンピュータに、
     発表資料を取得し、
     発表音声に係るテキスト情報を取得し、
     発表者のアバターに係る設定を受け付け、
     前記発表資料と共に前記アバターが表示され且つ前記テキスト情報に対応する音声を前記アバターが発話するコンテンツデータを生成する
     処理を実行させる、コンピュータプログラム。
    to the computer,
    Get presentation materials,
    Acquire text information related to the presentation voice,
    Accept settings related to the presenter's avatar,
    A computer program for generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information.
  2.  前記発表者の発表に係る音声情報を取得し、
     前記音声情報を変換して前記テキスト情報を取得する、
     請求項1に記載のコンピュータプログラム。
    Acquiring audio information related to the presentation of the presenter,
    converting the audio information to obtain the text information;
    2. A computer program as claimed in claim 1.
  3.  前記テキスト情報に含まれる単語の発音に係る設定を受け付け、
     受け付けた設定に応じた発音で前記単語を前記アバターが発話する前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Receiving settings related to pronunciation of words included in the text information,
    generating the content data in which the avatar utters the word with a pronunciation according to the accepted settings;
    2. A computer program as claimed in claim 1.
  4.  前記テキスト情報を出力し、
     出力した前記テキスト情報に含まれる単語の選択を受け付け、
     選択を受け付けた前記単語の発音表記を出力する、
     請求項3に記載のコンピュータプログラム。
    outputting said text information;
    Receiving a selection of words included in the output text information,
    outputting a phonetic transcription of the selected word;
    4. A computer program as claimed in claim 3.
  5.  前記発音表記の修正を受け付け、
     修正を受け付けた前記発音表記での前記単語の音声出力を行う、
     請求項4に記載のコンピュータプログラム。
    accept corrections to said phonetic notation;
    audibly outputting the word in the phonetic transcription for which corrections have been accepted;
    5. Computer program according to claim 4.
  6.  前記テキスト情報には、複数の発話テキストを含み、
     各発話テキストに対応する音声を出力するインターバルの設定、又は、各発話テキストを発話するアバターの設定を受け付け、
     受け付けた設定に応じて前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    The text information includes a plurality of spoken texts,
    Receiving the setting of the interval for outputting the voice corresponding to each uttered text, or the setting of the avatar that utters each uttered text,
    generating the content data according to the received settings;
    2. A computer program as claimed in claim 1.
  7.  前記発話テキストを発話する際の前記アバターの表情又はジェスチャーの設定を受け付け、
     受け付けた設定に応じた表情又はジェスチャーで前記アバターが前記発話テキストを発話する前記コンテンツデータを生成する、
     請求項6に記載のコンピュータプログラム。
    Receiving settings for facial expressions or gestures of the avatar when uttering the spoken text;
    generating the content data in which the avatar utters the utterance text with a facial expression or gesture according to the received setting;
    7. Computer program according to claim 6.
  8.  前記発表資料及び前記アバターを表示する第1領域と、前記第1領域に表示するアバターに係る設定項目を表示する第2領域と、前記発表資料に含まれる複数の画像を前記第1領域に表示する候補として表示する第3領域とを含む画面を表示する、
     請求項1に記載のコンピュータプログラム。
    A first area for displaying the presentation material and the avatar, a second area for displaying setting items related to the avatar to be displayed in the first area, and a plurality of images included in the presentation material being displayed in the first area. displaying a screen including a third area displayed as a candidate to
    2. A computer program as claimed in claim 1.
  9.  前記第2領域には、前記アバターが行うジェスチャーの設定項目、前記アバターの位置に係る設定項目、前記アバターの向きに係る設定項目、又は、前記アバターのサイズに係る設定項目を表示する、
     請求項8に記載のコンピュータプログラム。
    In the second area, setting items for gestures performed by the avatar, setting items for the position of the avatar, setting items for the orientation of the avatar, or setting items for the size of the avatar are displayed.
    Computer program according to claim 8.
  10.  前記アバターが発話する音声のピッチ、速度、深さ又は音量の設定を受け付け、
     受け付けた設定に応じた音声を前記アバターが発話する前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    accepting settings for the pitch, speed, depth or volume of the voice uttered by the avatar;
    generating the content data in which the avatar utters a voice according to the accepted settings;
    2. A computer program as claimed in claim 1.
  11.  前記発表資料と共に表示する文字情報の入力を受け付け、
     受け付けた前記文字情報を前記発表資料と共に表示した前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Receiving input of character information to be displayed together with the presentation materials,
    generating the content data in which the received character information is displayed together with the presentation material;
    2. A computer program as claimed in claim 1.
  12.  前記アバターが発話する複数のテキスト情報を画面の上下方向へ時系列に並べて示し、且つ、2人のアバターが発話するテキスト情報を画面の左右に分けて示すテキスト情報の編集画面を表示する、
     請求項1に記載のコンピュータプログラム。
    Displaying a text information editing screen showing a plurality of pieces of text information spoken by the avatar arranged in chronological order in the vertical direction of the screen, and showing the text information spoken by the two avatars divided into left and right sides of the screen;
    2. A computer program as claimed in claim 1.
  13.  3次元仮想空間に配置されたアバターに対する仮想カメラの撮影位置に関する設定を受け付け、
     受け付けた設定に応じて前記アバターを前記仮想カメラで撮影した画像を含む前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Receive settings related to the shooting position of the virtual camera for the avatar placed in the three-dimensional virtual space,
    generating the content data including an image of the avatar captured by the virtual camera according to the received settings;
    2. A computer program as claimed in claim 1.
  14.  地域又は年代に関する設定を受け付け、
     前記アバターが前記地域又は年代に応じたジェスチャーを行う前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Accept settings related to region or age,
    generating the content data in which the avatar makes a gesture according to the region or age;
    2. A computer program as claimed in claim 1.
  15.  地域又は年代に関する設定を受け付け、
     前記アバターが前記地域又は年代に応じた単語、アクセント又はスピードで発話する前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Accept settings related to region or age,
    generating the content data in which the avatar speaks words, accents, or speeds according to the region or age;
    2. A computer program as claimed in claim 1.
  16.  前記アバターの表示位置に応じてステレオ音声の左右の出力レベルを調整した前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    generating the content data in which left and right output levels of stereo sound are adjusted according to the display position of the avatar;
    2. A computer program as claimed in claim 1.
  17.  前記アバターの表示位置に応じて、当該アバターの発話に関する音像を設定する、
     請求項1に記載のコンピュータプログラム。
    setting a sound image related to the utterance of the avatar according to the display position of the avatar;
    2. A computer program as claimed in claim 1.
  18.  前記アバターの表情に関する設定を受け付け、
     前記アバターが前記表情に応じた声の高さ又は音量で発話する前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    Receiving settings related to facial expressions of the avatar,
    generating the content data in which the avatar speaks at a pitch or volume corresponding to the facial expression;
    2. A computer program as claimed in claim 1.
  19.  前記テキスト情報に基づいて感情を推定し、
     推定した感情に応じて前記アバターの表情を設定する、
     請求項18に記載のコンピュータプログラム。
    estimating an emotion based on the text information;
    setting the facial expression of the avatar according to the estimated emotion;
    19. Computer program according to claim 18.
  20.  前記テキスト情報に含まれる文の特徴を判定し、
     前記アバターが前記特徴に応じた声の高さ又は音量で発話する前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    determining features of sentences included in the text information;
    generating the content data in which the avatar speaks with a pitch or volume of voice according to the characteristics;
    2. A computer program as claimed in claim 1.
  21.  所定の単語と前記アバターの表情又はジェスチャーとの対応を記憶しておき、
     前記アバターが前記所定の単語を発話する際に、対応する表情又はジェスチャーを行う前記コンテンツデータを生成する、
     請求項1に記載のコンピュータプログラム。
    storing correspondence between predetermined words and facial expressions or gestures of the avatar;
    generating the content data that makes a corresponding facial expression or gesture when the avatar utters the predetermined word;
    2. A computer program as claimed in claim 1.
  22.  情報処理装置が、
     発表資料を取得し、
     発表音声に係るテキスト情報を取得し、
     発表者のアバターに係る設定を受け付け、
     前記発表資料と共に前記アバターが表示され且つ前記テキスト情報に対応する音声を前記アバターが発話するコンテンツデータを生成する、
     情報処理方法。
    The information processing device
    Get presentation materials,
    Acquire text information related to the presentation voice,
    Accept settings related to the presenter's avatar,
    generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information;
    Information processing methods.
  23.  処理部を備える情報処理装置であって、
     前記処理部が、
     発表資料を取得し、
     発表音声に係るテキスト情報を取得し、
     発表者のアバターに係る設定を受け付け、
     前記発表資料と共に前記アバターが表示され且つ前記テキスト情報に対応する音声を前記アバターが発話するコンテンツデータを生成する、
     情報処理装置。
     
    An information processing device comprising a processing unit,
    The processing unit
    Get presentation materials,
    Acquire text information related to the presentation voice,
    Accept settings related to the presenter's avatar,
    generating content data in which the avatar is displayed together with the presentation material and the avatar utters a voice corresponding to the text information;
    Information processing equipment.
PCT/JP2023/007458 2022-03-01 2023-03-01 Computer program, information processing method, and information processing device WO2023167212A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2022030928 2022-03-01
JP2022-030928 2022-03-01

Publications (1)

Publication Number Publication Date
WO2023167212A1 true WO2023167212A1 (en) 2023-09-07

Family

ID=87883865

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2023/007458 WO2023167212A1 (en) 2022-03-01 2023-03-01 Computer program, information processing method, and information processing device

Country Status (1)

Country Link
WO (1) WO2023167212A1 (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09325787A (en) * 1996-05-30 1997-12-16 Internatl Business Mach Corp <Ibm> Voice synthesizing method, voice synthesizing device, method and device for incorporating voice command in sentence
JP2008180942A (en) * 2007-01-25 2008-08-07 Xing Inc Karaoke system
WO2012147274A1 (en) * 2011-04-26 2012-11-01 Necカシオモバイルコミュニケーションズ株式会社 Input assistance device, input asssistance method, and program
JP2019179064A (en) * 2018-03-30 2019-10-17 日本放送協会 Voice synthesizing device, voice model learning device, and program therefor
JP2019208138A (en) * 2018-05-29 2019-12-05 住友電気工業株式会社 Utterance recognition device and computer program
US20200034025A1 (en) * 2018-07-26 2020-01-30 Lois Jean Brady Systems and methods for multisensory semiotic communications
WO2020095784A1 (en) * 2018-11-06 2020-05-14 日本電気株式会社 Display control device, display control method, and nontemporary computer-readable medium in which program is stored
JP2020076912A (en) * 2018-11-09 2020-05-21 稔高 小田原 Arithmetic calculation practice support device and arithmetic calculation practice support program
JP2020112895A (en) * 2019-01-08 2020-07-27 ソフトバンク株式会社 Control program of information processing apparatus, control method of information processing apparatus, and information processing apparatus
JP2021018472A (en) * 2019-07-17 2021-02-15 株式会社デンソー Information processing system

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH09325787A (en) * 1996-05-30 1997-12-16 Internatl Business Mach Corp <Ibm> Voice synthesizing method, voice synthesizing device, method and device for incorporating voice command in sentence
JP2008180942A (en) * 2007-01-25 2008-08-07 Xing Inc Karaoke system
WO2012147274A1 (en) * 2011-04-26 2012-11-01 Necカシオモバイルコミュニケーションズ株式会社 Input assistance device, input asssistance method, and program
JP2019179064A (en) * 2018-03-30 2019-10-17 日本放送協会 Voice synthesizing device, voice model learning device, and program therefor
JP2019208138A (en) * 2018-05-29 2019-12-05 住友電気工業株式会社 Utterance recognition device and computer program
US20200034025A1 (en) * 2018-07-26 2020-01-30 Lois Jean Brady Systems and methods for multisensory semiotic communications
WO2020095784A1 (en) * 2018-11-06 2020-05-14 日本電気株式会社 Display control device, display control method, and nontemporary computer-readable medium in which program is stored
JP2020076912A (en) * 2018-11-09 2020-05-21 稔高 小田原 Arithmetic calculation practice support device and arithmetic calculation practice support program
JP2020112895A (en) * 2019-01-08 2020-07-27 ソフトバンク株式会社 Control program of information processing apparatus, control method of information processing apparatus, and information processing apparatus
JP2021018472A (en) * 2019-07-17 2021-02-15 株式会社デンソー Information processing system

Similar Documents

Publication Publication Date Title
JP3016864B2 (en) Audio-visual work with writing, method of meaningfully combining oral pronunciation and writing in audio-visual work, and apparatus for linear and conversational applications
KR102035596B1 (en) System and method for automatically generating virtual character&#39;s facial animation based on artificial intelligence
US8359202B2 (en) Character models for document narration
US8370151B2 (en) Systems and methods for multiple voice document narration
CN110941954A (en) Text broadcasting method and device, electronic equipment and storage medium
US20120276504A1 (en) Talking Teacher Visualization for Language Learning
JP5553609B2 (en) Language learning content provision system using partial images
US10741089B2 (en) Interactive immersion system for movies, television, animation, music videos, language training, entertainment, video games and social networking
JP2001525078A (en) A method of producing an audiovisual work having a sequence of visual word symbols ordered with spoken word pronunciations, a system implementing the method and the audiovisual work
JP2003186379A (en) Program for voice visualization processing, program for voice visualization figure display and for voice and motion image reproduction processing, program for training result display, voice-speech training apparatus and computer system
US11610609B2 (en) Systems and methods for enhanced video books
KR101822026B1 (en) Language Study System Based on Character Avatar
JP5137031B2 (en) Dialogue speech creation device, utterance speech recording device, and computer program
WO2023167212A1 (en) Computer program, information processing method, and information processing device
KR20080025853A (en) Language traing method and apparatus by matching pronunciation and a character
KR20030079497A (en) service method of language study
Wolfe et al. Exploring localization for mouthings in sign language avatars
JP4052561B2 (en) VIDEO Attached Audio Data Recording Method, VIDEO Attached Audio Data Recording Device, and VIDEO Attached Audio Data Recording Program
JP7049718B1 (en) Language education video system
KR102659886B1 (en) VR and AI Recognition English Studying System
US20230245644A1 (en) End-to-end modular speech synthesis systems and methods
McGowen Facial Capture Lip-Sync
Jiménez-Crespo Specialized Practices in Translation Settings
KR20230057514A (en) Audio book platform providing system with visual content
CN117156199A (en) Digital short-man video production platform and production method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23763469

Country of ref document: EP

Kind code of ref document: A1