WO2023017732A1 - Storytelling information creation device, storytelling robot, storytelling information creation method, and program - Google Patents

Storytelling information creation device, storytelling robot, storytelling information creation method, and program Download PDF

Info

Publication number
WO2023017732A1
WO2023017732A1 PCT/JP2022/028936 JP2022028936W WO2023017732A1 WO 2023017732 A1 WO2023017732 A1 WO 2023017732A1 JP 2022028936 W JP2022028936 W JP 2022028936W WO 2023017732 A1 WO2023017732 A1 WO 2023017732A1
Authority
WO
WIPO (PCT)
Prior art keywords
robot
story
telling
image
listener
Prior art date
Application number
PCT/JP2022/028936
Other languages
French (fr)
Japanese (ja)
Inventor
ランディ ゴメス
Original Assignee
本田技研工業株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 本田技研工業株式会社 filed Critical 本田技研工業株式会社
Publication of WO2023017732A1 publication Critical patent/WO2023017732A1/en

Links

Images

Classifications

    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63HTOYS, e.g. TOPS, DOLLS, HOOPS OR BUILDING BLOCKS
    • A63H11/00Self-movable toy figures
    • AHUMAN NECESSITIES
    • A63SPORTS; GAMES; AMUSEMENTS
    • A63HTOYS, e.g. TOPS, DOLLS, HOOPS OR BUILDING BLOCKS
    • A63H5/00Musical or noise- producing devices for additional toy effects other than acoustical
    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J13/00Controls for manipulators
    • GPHYSICS
    • G09EDUCATION; CRYPTOGRAPHY; DISPLAY; ADVERTISING; SEALS
    • G09BEDUCATIONAL OR DEMONSTRATION APPLIANCES; APPLIANCES FOR TEACHING, OR COMMUNICATING WITH, THE BLIND, DEAF OR MUTE; MODELS; PLANETARIA; GLOBES; MAPS; DIAGRAMS
    • G09B5/00Electrically-operated educational appliances
    • G09B5/06Electrically-operated educational appliances with both visual and audible presentation of the material to be studied
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Definitions

  • the present invention relates to a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a program.
  • This application claims priority based on Japanese Patent Application No. 2021-130727 filed on August 10, 2021, the content of which is incorporated herein.
  • Storytelling is an inherently social activity that allows us humans to connect with others, find meaning in life, construct socially shared meanings, transmit information from person to person, and transfer knowledge to others. Helps pass it on to generations.
  • Cognitive psychologist Jerome Bruner points out that storytelling serves the dual purpose of creating meaning – making the queer familiar and defining the individual and collective self.
  • Audiobooks For reading aloud, for example, there are audiobooks read aloud by narrators. Audiobooks are provided by mediums such as cassette tapes and CDs (compact discs), and are also provided through distribution services such as the Internet. A document reading support device has been proposed that supports such reading by estimating an utterance style in consideration of the context (see, for example, Patent Document 1).
  • aspects of the present invention have been made in view of the above problems, and include a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a story-telling information creation device that can devise story-telling according to the user.
  • the purpose is to provide a program.
  • a story-telling information creation device includes an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and an expression of a robot.
  • a routine an image medium, and an acoustic signal, an image to be displayed on an image display device when reading a story, an acoustic signal to be output to the image display device when reading a story, and a generation unit that generates a voice signal, facial expression data, and action data of the story-telling robot when performing story-telling.
  • the generation unit considers the expressive power of the narrator unit for proceeding with the narration and the story-telling robot, activates a routine with expressive power, and creates an agency that supports multimedia. , a linking unit that creates immediate actions in response to a person who maintains the line of sight of the story-telling robot to a specific person, and an education unit that sets questions and answers. .
  • the generator may be configured with a behavior tree structure.
  • the generating unit generates an image of the listener, an audio signal obtained by collecting the audio of the listener, and the attitude of the listener. recognizing the state of the listener based on at least one of the detection results, and based on the recognized result, an image to be displayed on the image display device when the reading is performed; At least one of the sound signal to be output to the image display device when performing the reading, and the voice signal, facial expression data, and action data of the story-telling robot when performing the story-telling may be changed. good.
  • the facial expression data of the storytelling robot is an image corresponding to the eyes and an image corresponding to the mouth
  • the motion of the storytelling robot The data may be data relating to the motion of the portion corresponding to the neck.
  • the image display device and the story-telling robot may be arranged close to each other.
  • a storytelling robot includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; an image to be read, an acoustic signal to be output to the image display device, and an action generating means for causing the story-telling robot to perform the story-telling by using the voice signal, expression data, and action data of the story-telling robot.
  • a storytelling robot includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; any one of the above aspects (1) to (6); an image to be displayed on an image display device, an acoustic signal to be output to the image display device, and an action generating means for causing the storytelling robot to perform the storytelling using an audio signal, facial expression data, and action data of the storytelling robot.
  • the listener may be read aloud, or the listener may be encouraged to read aloud.
  • the communication device and the person acquire human information about the person, extract characteristic information about the person from the acquired personal information, and communicate with the person.
  • Multimodal learning of people's emotional interactions by using cognitive means for recognizing actions that occur between people and for recognizing actions that occur between people and the extracted feature information about the said person. and learning means, wherein the action generating means may generate actions based on the learned emotional interaction information of the person.
  • the generation unit detects an image of a listener, an audio signal obtained by collecting the listener's voice, a result of detecting the listener's posture, An image to be displayed on an image display device when storytelling is performed using a robot expression routine, image media, and an acoustic signal, and sound to be output to the image display device when the storytelling is performed A signal, an audio signal, facial expression data, and motion data of the storytelling robot when the storytelling is performed.
  • a program stores, in a computer, an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and a robot expression routine. and an image to be displayed on an image display device when reading aloud using an image medium and an acoustic signal, an acoustic signal to be output to the image display device when the reading is performed, and the reading A voice signal, facial expression data, and action data of a storytelling robot are generated when the storytelling is performed.
  • FIG. 10 is a diagram showing an example of story-telling by a story-telling robot while projecting a video of a story on an image display device; It is a figure which shows the operation
  • FIG. 10 is a diagram showing an example of story-telling by a story-telling robot while projecting a video of a story on an image display device; It is a figure which shows the operation
  • FIG. 10 illustrates an example perceptual element; It is a figure which shows the implementation example of the story-telling which concerns on embodiment.
  • FIG. 4 is a diagram showing an example of a narrator node;
  • FIG. 2 illustrates an example agency node;
  • It is a figure which shows an example of a connection node.
  • FIG. 4 is a diagram showing an example of an action tree configuration for storytelling;
  • It is a block diagram which shows the structural example of the story-telling robot which concerns on embodiment.
  • It is a figure which shows the structural example of the recognition production
  • It is a flowchart which shows the procedure example of the story-telling process of the story-telling robot which concerns on embodiment.
  • It is a figure which shows the example of a 1st evaluation result.
  • FIG. 4 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device of the embodiment;
  • FIG. It is a figure which shows the data example which the recognition part which concerns on embodiment recognizes.
  • FIG. 1 is a diagram showing a communication example of the storytelling robot 1 according to the present embodiment.
  • the story-telling robot 1 performs story-telling according to, for example, the user's state or situation (g17).
  • the "reading robot” is also referred to as "robot”.
  • the storytelling robot 1 communicates with an individual or a plurality of people 2 .
  • Communication is mainly dialogue g11 and gesture g12 (movement). Actions are represented by images displayed on the display in addition to actual actions.
  • the story-telling robot 1 receives the e-mail and informs the user that the e-mail has been received and the content (g13). Further, when the story-telling robot 1 requires a reply to an e-mail, for example, it makes a proposal g14 by communicating with the user as to whether advice is required. The story-telling robot 1 sends a reply (g15). In addition, the storytelling robot 1 presents a weather forecast g16 for a location according to the scheduled date and time and location, for example, according to the user's schedule.
  • the story-telling robot 1 of the present embodiment generates the social ability of the robot so that an emotional connection can be formed between the robot and the person, for example, according to the reaction or action of the person. can communicate with people.
  • the storytelling robot 1 can communicate with a person and a robot by sympathizing with each other on an emotional level. Then, the story-telling robot 1 devisely performs story-telling according to the situation of the person to whom it is to be read-aloud.
  • FIG. 2 is a diagram showing an example of the outline of the storytelling robot 1 according to the present embodiment.
  • the storytelling robot 1 has three display units 111 (111a, 111b, 111c).
  • the imaging unit 102a is attached above the display unit 111a
  • the imaging unit 102b is attached above the display unit 111b.
  • the display units 111a and 111b correspond to the human eye and present images and image information corresponding to the human eye.
  • the screen size of the display units 111a and 111b is, for example, 3 inches.
  • the speaker 112 is attached to the housing 120 in the vicinity of the display section 111c that displays an image corresponding to the human mouth.
  • the display unit 111c is composed of, for example, a plurality of LEDs (light emitting diodes), each of which can be addressed and can be individually turned on and off.
  • the sound pickup unit 103 is attached to the housing 120 .
  • the storytelling robot 1 also includes a boom 121 .
  • Boom 121 is movably attached to housing 120 via movable portion 131 .
  • a horizontal bar 122 is rotatably attached to the boom 121 via a movable portion 132 .
  • a display portion 111a is rotatably attached to the horizontal bar 122 via a movable portion 133, and a display portion 111b is rotatably attached via a movable portion .
  • the outer shape of the reading robot 1 shown in FIG. 2 is an example, and is not limited to this.
  • the reading robot 1 has, for example, five motion degrees of freedom (base rotation, neck leaning, eye stroke, eye rotation, eye tilt). It has tilt)) and enables expressive movements.
  • the story-telling robot 1 can then communicate via text-to-speech (TTS), animation routines (open-loop combinations of movements, sounds, and eyes expressing specific emotions), projected screens, etc. .
  • TTS text-to-speech
  • animation routines open-loop combinations of movements, sounds, and eyes expressing specific emotions
  • projected screens etc.
  • FIG. 3 is a diagram showing an example of story-telling by the story-telling robot 1 while projecting the image of the story on the image display device 7 .
  • the storytelling robot 1 is placed, for example, on the table Tab and in front of the image display device 7 .
  • the story-telling robot 1 may be placed beside the image display device 7 or the like. That is, the image display device 7 and the story-telling robot 1 are arranged close to each other.
  • the image displayed on the image display device 7 may be a still image or a moving image.
  • the image also includes music to match the image.
  • the image display device 7 has a speaker, and reproduces music included in the video from the speaker.
  • FIGS. 4 and 5 are diagrams showing images displayed on the image display device 7 during story-telling according to the present embodiment, and examples of actions and facial expressions of the story-telling robot 1.
  • FIG. 4 an image g201 is a first example image displayed on the image display device 7, and images g203 and g205 are examples of the motion and expression of the storytelling robot 1 during storytelling.
  • the image g203 is the action and facial expression of being surprised
  • the image g205 is the action and facial expression of being disappointed.
  • an image g211 is a second example image displayed on the image display device 7, and images g213 and g215 are examples of actions and facial expressions of the storytelling robot 1 during storytelling.
  • the image g213 is the action and facial expression when happy
  • the image g215 is the action and facial expression when it is interesting.
  • the story-telling robot 1 moves parts corresponding to the neck and eyes according to the contents of the story and the state of the user Hu during the story-telling, and displays the display part corresponding to the eyes and the mouth.
  • the action and facial expression are changed.
  • the images shown in FIGS. 4 and 5 and the facial expressions and actions of the storytelling robot 1 are examples, and are not limited to these.
  • the storytelling robot 1 may emit sounds of exclamation points such as "Oh!
  • facial expressions and actions of the storytelling robot 1 are designed to enhance persuasiveness and interactivity, and reflect a realistic repertoire of gestures that a human narrator would make.
  • facial expressions include, for example, eye movements, facial expressions of eyes, mouth facial expressions due to the shape of the mouth, and the like.
  • Actions include, for example, neck movements, eye movements, etc., as described above.
  • the story-telling robot 1 recognizes the environment using sight and hearing in addition to the five degrees of freedom described above.
  • the story-telling robot 1 acquires visual information using the data captured by the imaging unit 102 included in the story-telling robot 1 .
  • the imaging unit 102 includes, for example, an RGB camera and a depth sensor or a distance sensor.
  • the storytelling robot 1 acquires auditory information using an acoustic signal picked up by the sound pickup unit 103 of the storytelling robot 1 .
  • the sound pickup unit 103 is a microphone array including a plurality of microphones.
  • FIG. 6 is a diagram illustrating an example sensory element.
  • the storytelling robot 1 perceives the voice direction, voice recognition, the listener (user), the direction of the listener's (user's) head, the direction of the listener's body, the position of the hands, etc., as in the image g301.
  • An image g304 is an image captured by the capturing unit 102 provided in the storytelling robot 1, and is a viewpoint seen from the robot.
  • Image g302 is an image obtained by extracting an image including the listener's face from image g304.
  • An image g303 is an image resulting from performing face authentication on the image g302 by a well-known technique.
  • the listener's name is Fred, and 2 is assigned as identification information (ID).
  • a face recognition unit 141 performs face recognition
  • a gesture recognition unit 142 performs gesture recognition
  • a voice recognition unit 143 performs voice recognition, estimation of a sound source direction, and the like.
  • the story-telling robot 1 acquires position information using the acquired information captured by the imaging unit 102 .
  • the storytelling robot 1 then obtains the poses of the different human parts (limbs, body, head), for example, using well-known skeleton detection (see, for example, Reference 1).
  • the storytelling robot 1 detects various gestures such as a hand waving gesture and a pointing gesture by a well-known gesture detection method (for example, Japanese Patent Application Laid-Open No. 2021-031630), Estimate the pointing direction of a person's hand.
  • the storytelling robot 1 can identify the face by comparing with the information registered in the third database 150, and can also identify facial features such as a smile when the person is nearby.
  • the storytelling robot 1 identifies the sound source direction and separates the sound sources from the sound signal collected by the sound collection unit 103 using a technique such as beamforming. Note that the storytelling robot 1 may perform processing such as noise suppression and utterance section estimation on the collected sound signal. Further, the story-telling robot 1 converts the separated speech whose speaker is specified into text and understands the language. Thereby, the story-telling robot 1 obtains auditory information.
  • the role of the narrator The narrator tells the story and sets the point of view.
  • storytelling the narrator speaks live to tell the story to the listener. And reading aloud can improve story comprehension and attention.
  • voice When telling a story by voice, especially in order to attract children, it is necessary to aim for rich expressiveness and speak according to the situation. For example, when parents tell stories to their children, they often "act out" the characters and stories in their own words, exaggerating, varying, and expressively conveying the words.
  • the expressiveness of the narrator's voice contains social and emotional cues that call attention and help draw the audience into the characters and the world of the story. Live, the narrator uses direct gazes and gestures to highlight key points in the story, directing attention and interest.
  • FIG. 7 An example of content 500 (FIG. 7) used for storytelling will be described.
  • the content 500 used in the embodiments was designed by the author to, for example, support audience agency and engagement, facilitate learning of the educational value of stories, and support the role of storyteller. These components consist of narrative, animated projections, sound, and robot performance and interaction with the audience.
  • the stories were created in collaboration with writers and educators.
  • the creative director collaborated with designers, illustrators, animators, voice artists, and sound designers to create visuals, character performances, and sounds that projected story elements.
  • the content enhances the robot's persuasiveness and sympathizes with the characters in the story.
  • a gesture is an action that embodies the meaning associated with speech. For example, when speaking a positive opinion, the story-telling robot 1 realizes a gesture by “nodding”.
  • An emblem is an idiomatic sign with an agreed upon meaning, such as a smile to express happiness. Therefore, the storytelling robot 1 realizes an emblem by making a smiling expression to express happiness.
  • a pantomime is a gesture or series of gestures that tells a story, usually without words.
  • the robot can mimic and reflect the sadness of the characters in the story by exaggeratedly bending forward or dropping their eyes.
  • the pantomime was designed with particular attention to eye movements.
  • a third database 150 of the storytelling robot 1 was added to maintain a sense of active listening and engagement with the robot. designed and housed a library of eye movements that are deployed during emotional routines. These consisted of subtly distributed saccadeic movements.
  • eye-tracking is used to change the line-of-sight direction to support interaction and engagement with the listener.
  • creative storytelling content is designed to be experienced through interaction between the robot and the listener, dictated reading, and a live experience using the image display device 7, it is possible to confirm the role of the robot in storytelling. is important. Therefore, in the embodiment, creative contents such as the image display device 7 and robot gesture performance are used to overcome the limitations and problems of speech transmission by a speech synthesis system.
  • the first role uses the human narrator's voice as the storytelling element of the project.
  • the robot then takes on the role of facilitator of the story, combining gestures, sympathetic reactions, and simple questions and answers according to the story to interact with the story.
  • the audio signal for this role is a recording of a human narrator's vocal performance when telling a story expressively.
  • the robot's speech synthesis system then combined it with appropriate gestural movements for question-answer interactions.
  • the second role is that of the robot as a narrator. It combines a robot speech synthesis system for both storytelling and question-answering, as well as robot gestures.
  • the framework needs to be able to define the robot's behavior and access the perceptual results by combining different actuation components and modalities. It is preferable that the programming performed on the robot to achieve such a purpose, in other words, a "choreographer of storytelling," can be easily programmed even by a person who does not specialize in robotics engineering.
  • the framework should be versatile enough to consider different narratives, and preferably reuse elements to create new narratives and actions.
  • Behavior Trees were adopted as the basis for this "choreographer who tells a story”.
  • An action tree is a method for structuring the flow and control of multiple tasks in a decision-making application.
  • An action tree models actions as a hierarchical tree composed of nodes. The tree is traversed from top to bottom at a constant rate according to well-defined rules, executing tasks and commands encountered along the way. Note that the status of tasks and commands is reported up the chain, and the flow changes accordingly.
  • Nodes are classified according to their function as follows, for example: I. Composite: Controls the flow of the tree itself. These nodes resemble control structures in traditional programming languages. II. Decorator: Processes or modifies the status received from the child. III. Leaf: Where the actual tasks are performed, there are atomic tasks or other functionality that the robot can perform. Therefore, these nodes cannot have children.
  • action tree naturally separates the logic from the actual task. Also, when developing a tree, you only need to care about the leaf nodes. This flow can be later defined and constantly rearranged to create new storytelling actions or to extend what is already being done.
  • An important advantage of action trees is composability and reusability (due to the tree's hierarchical nature).
  • FIG. 7 is a diagram showing an implementation example of storytelling according to the present embodiment.
  • content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, audio signals (prompts/prompts for voice/speech synthesis). audio, etc.) 503 .
  • the generation unit 144 (FIG. 12, storytelling information creation device) includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443. .
  • the generating unit 144 uses the input from the sensors (the photographing unit 102, the sound collecting unit 103, and the sensor 104 (FIG. 12)) and the content 500 to output information to the reading robot 1 and to the image display device 7. Generate information.
  • the story-telling robot 1 may be equipped with the production
  • Another device includes the generation unit 144
  • These nodes can be used, for example, in the GUI so that when creating new stories, developers can "drag and drop” and combine them with Composite and Decorator nodes to create new storytelling applications. (Graphical User Interface).
  • FIGS. 8-11 examples of predefined nodes used in the action tree for storytelling are shown in FIGS. 8-11.
  • FIG. 8 is a diagram showing an example of a narrator node. The narrator node advances the story.
  • FIG. 9 is a diagram showing an example of an agency node. Agency nodes provide multimedia and presentation support.
  • FIG. 10 is a diagram showing an example of a connection node. The connection node provides a closed-loop immediate reaction of the robot.
  • FIG. 11 is a diagram showing an example of an action tree configuration for storytelling.
  • the recognition generation unit 140 recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the listener's voice, and a result of detecting the posture of the listener. can be Note that the structures, connection relationships, nodes, etc. shown in FIGS. 8 to 11 are examples, and are not limited to these.
  • Narrator (1441) (Narrator section); various blocks are created to advance the narration.
  • AudioPlay Allows the narrator to play pre-recorded audio and controls various aspects of the audio.
  • Speech-to-speech synthesis (SpeakTTS): Using a text-to-speech engine, the robot utters a programmatic sentence. This enables the story-telling robot 1 to react according to the situation during story-telling. In addition, you can control the prosody of your speech using tags, for example.
  • Agency contains multiple blocks (eg, ExecuteRoutine, ProjectorImage and ProjectorVideo) that take into account the expressiveness of the robot, launch expressive routines, and support multimedia.
  • ExecuteRoutine allows execution of a predefined robot expression routine. These routines are open-loop combinations of all the robot's behavioral modalities, including movements, sounds, eyes and mouths, to express various emotions such as joy and romance.
  • Projector Image and Projector Video use the image display device 7 to display still images and animations.
  • This block can create immediacy actions that respond to people, such as keeping a robot's line of sight on a particular person, and can connect multiple blocks (e.g. TrackPerson).
  • TrackPerson tracks detected persons using perceptual functions (implemented as an additional block, eg GetPeople, which returns information about all persons in the field of view). It can be combined with other blocks (eg, GetClosestPerson, etc.) to find and track the closest person or consider other proximity conditions.
  • This block comprises several blocks (eg, AskQuestion, ListensPerson, GetASR) related to setting questions and answers.
  • AskQuestion asks a question using speech synthesis and sets the expected answer in the action tree memory.
  • ListensPerson indicates which person to listen to when there is a group of people.
  • GetASR obtains recognized speech from a specified person. This can be used to compare the expected answer to a yes or no question, or to have the robot respond appropriately based on the result.
  • FIG. 12 is a block diagram showing a configuration example of the storytelling robot 1 according to this embodiment.
  • the storytelling robot 1 includes a receiving unit 101, an imaging unit 102, a sound pickup unit 103, a sensor 104, a communication device 100, a storage unit 106, a first database 107, a second database 109, a display unit 111, A speaker 112 , an actuator 113 , a transmitter 114 , a third database 150 and content 500 are provided.
  • the communication device 100 includes a recognition section 105 (cognition means), a learning section 108 (learning means), an action generation section 110 (action generation means), and a recognition generation section 140 .
  • the motion generating section 110 includes an image generating section 1101 , a sound generating section 1102 , a driving section 1103 and a transmission information generating section 1104 .
  • the receiving unit 101 acquires information (e.g., e-mail, blog information, news, weather forecast, etc.) from, for example, the Internet via a network, and outputs the acquired information to the recognition unit 105 and the action generation unit 110.
  • information e.g., e-mail, blog information, news, weather forecast, etc.
  • the receiving unit 101 acquires information from the first database 107 on the cloud and outputs the acquired information to the recognition unit 105 .
  • the imaging unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) imaging device, a CCD (Charge Coupled Device) imaging device, or the like.
  • the imaging unit 102 also includes a depth sensor.
  • the photographing unit 102 outputs photographed images (human information, which is information about a person; still images, continuous still images, moving images) and depth information to the recognition unit 105 and the action generation unit 110 .
  • the storytelling robot 1 may include a plurality of imaging units 102 . In this case, the imaging unit 102 may be attached to the front and rear of the housing of the storytelling robot 1, for example. Note that the imaging unit 102 may be a distance sensor.
  • the sound pickup unit 103 is, for example, a microphone array composed of a plurality of microphones.
  • the sound pickup unit 103 outputs acoustic signals (human information) picked up by a plurality of microphones to the recognition unit 105 and the action generation unit 110 .
  • the sound pickup unit 103 may sample each sound signal picked up by the microphone using the same sampling signal, convert the analog signal into a digital signal, and then output the signal to the recognition unit 105 .
  • the sensor 104 includes, for example, a temperature sensor that detects the temperature of the environment, an illuminance sensor that detects the illuminance of the environment, a gyro sensor that detects the tilt of the storytelling robot 1 housing, and a movement of the storytelling robot 1 housing. They are an acceleration sensor, an atmospheric pressure sensor for detecting atmospheric pressure, and the like.
  • the sensor 104 outputs the detected value to the recognition unit 105 and the motion generation unit 110 . Note that the depth sensor may be included in the sensor 104 .
  • the storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used for recognition, algorithms for recognition, and the like.
  • the first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used for speech recognition, a comparison image database and image features used for image recognition, and the like. do. Each data and feature amount will be described later. Note that the first database 107 may be placed on the cloud or may be connected via a network.
  • the second database 109 stores data related to relationships between people, such as social components, social norms, social customs, psychology, and humanities, which are used during learning. Note that the second database 109 may be placed on the cloud or may be connected via a network.
  • the communication device 100 recognizes an action that occurs between the story-telling robot 1 and a person, or an action that occurs between a plurality of people, and based on the recognized content and the data stored in the second database 109, the human's emotional response. Learn to interact. Then, the communication device 100 generates the social ability of the storytelling robot 1 from the learned contents.
  • the social ability is, for example, the ability to interact between people, such as dialogue, behavior, understanding, empathy, etc. between people.
  • the recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between multiple people.
  • the recognizing unit 105 acquires the image captured by the capturing unit 102 , the acoustic signal collected by the sound collecting unit 103 , and the detection value detected by the sensor 104 .
  • the recognition unit 105 may acquire information received by the reception unit 101 .
  • the recognition unit 105 Based on the acquired information and the data stored in the first database 107, the recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between a plurality of persons. Note that the recognition method will be described later.
  • the recognizing unit 105 outputs the recognized recognition result (feature amount related to sound, feature information related to human behavior) to the learning unit 108 .
  • the recognition unit 105 performs well-known image processing (for example, binarization processing, edge detection processing, clustering processing, image feature amount extraction processing, etc.) on the image captured by the imaging unit 102 .
  • the recognition unit 105 performs well-known speech recognition processing (sound source identification processing, sound source localization processing, noise suppression processing, speech section detection processing, sound source extraction processing, acoustic feature amount calculation processing, etc.) on the acquired acoustic signal.
  • the recognition unit 105 extracts a sound signal (or sound signal) of a target person, animal, or object from the acquired sound signal based on the recognition result, and recognizes the extracted sound signal (or sound signal).
  • the result is output to the motion generation unit 110 .
  • the recognition unit 105 extracts an image of a target person or object from the acquired image based on the recognition result, and outputs the extracted image to the action generation unit 110 as a recognition result.
  • the learning unit 108 uses the recognition results output by the recognition unit 105 and the data stored in the second database 109 to learn human emotional interactions.
  • a learning unit 108 stores a model generated by learning. The learning method will be described later.
  • the recognition generation unit 140 recognizes and recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the voice of the listener, and a result of detecting the posture of the listener. Information based on the result is output to the motion generator 110 and the image display device 7 .
  • a configuration example and an operation example of the recognition generation unit 140 will be described later with reference to FIG. 13 .
  • the motion generation unit 110 generates facial expression information and motion information of the robot based on the information generated by the generation unit 144 (FIG. 13) included in the recognition generation unit 140 when reading aloud.
  • the motion generation unit 110 acquires information received from the reception unit 101 , images captured by the imaging unit 102 , acoustic signals collected by the sound collection unit 103 , and recognition results from the recognition unit 105 .
  • the action generation unit 110 generates actions (utterances, gestures, images) for the user based on the learned result and the acquired information.
  • the image generation unit 1101 generates an output image (still image, continuous still images, or moving image) to be displayed on the display unit 111 based on the learned result and the acquired information, and outputs the generated output image. Displayed on the display unit 111 .
  • the action generating unit 110 causes the display unit 111 to display an animation such as a facial expression, present an image to be presented to the user, and communicate with the user.
  • Images that are displayed include images that correspond to the movements of a person's eyes, images that correspond to the movements of a person's mouth, and information such as the user's destination (maps, weather maps, weather forecasts, information on shops and resorts, etc.). etc.), an image of a person calling the user via the Internet line.
  • the audio generation unit 1102 generates an output audio signal to be output to the speaker 112 based on the learned result and the acquired information, and causes the speaker 112 to output the generated output audio signal.
  • the action generator 110 causes the speaker 112 to output an audio signal to communicate with the user.
  • the voice signals to be output are voice signals assigned to the storytelling robot 1, voice signals of a person calling the user via the Internet line, and the like.
  • the drive unit 1103 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal.
  • the motion generating unit 110 controls the motion of the storytelling robot 1 to express emotions and communicate with the user.
  • the transmission information generation unit 1104 Based on the learned result and the acquired information, the transmission information generation unit 1104 generates transmission information (audio signal, image ) is generated, and the generated transmission information is transmitted from the transmission unit 114 .
  • the display unit 111 is a liquid crystal image display device, an organic EL (Electro Luminescence) image display device, or the like. Display unit 111 displays an output image output by image generation unit 1101 of communication device 100 .
  • the speaker 112 outputs the output audio signal output by the audio generation unit 1102 .
  • the actuator 113 drives the action section according to the drive signal output by the drive section 1103 .
  • the transmission unit 114 transmits the transmission information output by the transmission information generation unit 1104 to the transmission destination via the network.
  • a third database 150 associates and stores names and identification information with listener facial images, and stores a library of eye movements developed during emotional routines.
  • the third database 150 also stores data used for gesture recognition, language models used for speech recognition, and the like. Note that the third database 150 may be placed on the cloud or may be connected via a network.
  • the content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, acoustic signals (prompts/audio for voice/speech synthesis). etc.) 503 .
  • the content 500 may be placed on the cloud or may be connected via a network.
  • the story-telling robot 1 communicates with an individual or a plurality of people 2 using, for example, the method described in Japanese Patent Application No. 2020-108946.
  • FIG. 13 is a diagram showing a configuration example of the recognition generation unit 140 according to this embodiment.
  • the recognition generation unit 140 includes, for example, a face recognition unit 141, a gesture recognition unit 142, a voice recognition unit 143, and a generation unit 144, as shown in FIG.
  • the generation unit 144 also includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443, as described above.
  • the recognition generation unit 140 performs face recognition, gesture recognition, and voice recognition.
  • the recognition generation unit 140 generates information to be output to the action generation unit 110 and information to be output to the image display device 7 using inputs from the sensors (image capturing unit 102, sound collection unit 103, sensor 104) and the content 500. do.
  • the face recognition unit 141 refers to the data stored in the third database 150 and uses a well-known method to recognize the face of the person included in the captured image. When the data is not stored in the third database 150, the face recognition unit 141 adds a name and identification information to the recognized face image and stores it in the third database.
  • the gesture recognition unit 142 refers to the data stored in the third database 150, and uses a known technique to determine the position and tilt of the person's head, body orientation, and hand position included in the captured image. detect and track.
  • the speech recognition unit 143 refers to the data stored in the third database 150 and uses well-known techniques to perform processing such as sound source localization, sound source separation, noise suppression, speech segment detection, and speaker identification.
  • the narrator/agency unit 1441 receives information detected by the sensor during reading, such as the human narrator's body and facial expression, proximity movement, line of sight, eye saccade, way of speaking, tone of voice, volume, and the like. This is a trained model trained as teacher data. When reading aloud, the narrator/agency unit 1441 receives information detected by the sensor, and outputs, for example, how to move the neck, the line of sight, the movement of the eyes and the mouth, the way of speaking, the tone of voice, the volume, and the like.
  • the connection unit 1442 receives the information detected by the sensor during reading and the information output by the narrator/agency unit 1441, and learns, for example, actions with immediacy according to the actions of the listener as teacher data. is a model. At the time of story-telling, the connection unit 1442 receives at least one of the information detected by the sensor and the information output by the narrator/agency unit 1441, and outputs, for example, immediate actions according to the actions of the listener. .
  • the education unit 1443 receives as input the information detected by the sensor during reading aloud and the information output by the unit 1442 for linking. This is a trained model trained as teacher data. At the time of reading, the education unit 1443 receives at least one of the information detected by the sensor and output by the linking unit 1442. For example, information that encourages positive values and two-way conversation between the narrator and the listener. to output
  • the first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and a comparison image database.
  • the language model database stores language models.
  • a language model is a probabilistic model that gives a probability that an arbitrary character string is a Japanese sentence or the like.
  • the language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, or the like.
  • the acoustic model database stores sound source models.
  • a sound source model is a model used to identify a sound source of a collected acoustic signal.
  • An acoustic feature amount is a feature amount calculated after transforming a collected sound signal into a signal in the frequency domain by performing a Fast Fourier Transform.
  • Acoustic features for example, static Mel-Scale Log Spectrum (MSLS), delta MSLS, and one delta power, are calculated every predetermined time (eg, 10 ms).
  • MSLS is obtained by using a spectral feature amount as a feature amount for acoustic recognition and performing inverse discrete cosine transform on MFCC (Mel Frequency Cepstrum Coefficient).
  • the dialogue corpus database stores the dialogue corpus.
  • the dialogue corpus is a corpus that is used when the storytelling robot 1 and the user have a dialogue, and is, for example, a scenario corresponding to the contents of the dialogue.
  • the comparison image database stores images used for pattern matching, for example.
  • Images used for pattern matching include, for example, images of the user, images of the user's family, images of the user's pets, and images of the user's friends and acquaintances.
  • the image feature amount is, for example, a feature amount extracted from an image of a person or an object by well-known image processing. Note that the example described above is just an example, and the first database 107 may store other data.
  • the second database 109 stores, for example, social constituent elements, social norms, data on psychology, and data on humanities.
  • the social components are, for example, age, sex, occupation, and relationships between multiple people (parents and children, couples, lovers, friends, acquaintances, co-workers, neighbors, teachers and students, etc.).
  • Social norms are rules and manners between individuals and multiple people, and are associated with speech, gestures, etc. according to age, gender, occupation, and relationships between multiple people.
  • Data related to psychology are, for example, data on findings obtained from past experiments and verifications (for example, attachment relationships between mothers and infants, complexes such as the Oedipus complex, conditioned reflexes, fetishism, etc.).
  • Data related to the humanities are, for example, data on religious rules, customs, national characteristics, regional characteristics, and characteristic acts, actions, and utterances of a country or region.
  • the data is such that when they give consent, they express their consent by nodding without saying it in words.
  • the humanities-related data is, for example, data on what is considered important and what is prioritized depending on the country or region.
  • FIG. 14 is a flow chart showing an example of the procedure of the story-telling process of the story-telling robot 1 according to this embodiment.
  • Step S1 The generating unit 144 generates content to be used for storytelling.
  • the communication device 100 may acquire content generated by an external device via the receiving unit 101, for example.
  • Step S2 The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).
  • Step S3 The face recognition unit 141 of the recognition generation unit 140 detects an image including the listener's face from the captured image, and recognizes the listener's face. If the face image of the listener is not registered in the third database 150, the recognition generator 140 acquires the name of the listener by talking to the listener and making the listener pronounce his/her name, for example.
  • Step S4 The generation unit 144 outputs an image to be displayed on the image display device 7 and an acoustic signal to be output to the image display device 7 based on the acquired content. Subsequently, the action generator 110 starts reading the content using the acquired content and listener's name.
  • Step S5 The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).
  • Step S6 The face recognition unit 141 recognizes the orientation and facial expression of the listener from the captured image and the detection values detected by the sensor 104 .
  • the gesture recognition unit 142 detects the listener's motion from the captured image and the detection value detected by the sensor 104 .
  • the speech recognition unit 143 performs speech recognition on the acoustic signal picked up by the sound pickup unit 103 .
  • Step S7 The generation unit 144 generates facial expression information and motion information of the storytelling robot 1 based on the acquired content and the acquired sensor information. Subsequently, the motion generation unit 110 generates images to be displayed on the display units 111a and 111b corresponding to the eyes of the storytelling robot 1 and displays corresponding to the mouth based on the generated facial expression information and motion information of the storytelling robot 1. An image to be displayed on the unit 111c, an audio signal to be output from the speaker 112, and a drive signal to drive the actuator 113 are generated.
  • Step S8 The recognition generation unit 140 determines whether or not the story-telling has ended.
  • the end of the story-telling is not limited to the end of the content, and may be ended even in the middle of the content depending on the reaction of the listener. Moreover, if the content is long, the recognition generation unit 140 may end the content even in the middle of the content, depending on the reading time or the reaction of the listener. If the recognition generation unit 140 determines that the story-telling has ended (step S8; YES), it ends the process. When the recognition generation unit 140 determines that the story-telling has not ended (step S8; NO), the process returns to step S5.
  • processing procedure shown in FIG. 14 is an example, and is not limited to this. Some processes may be performed in parallel, and the order may be changed. Also, the listener's name may be acquired during the reading after the reading is started.
  • the images and sounds of the content displayed on the image display device 7 may be changed in accordance with the reaction of the listener during the reading.
  • the content may be multiple branching stories, and two or more pieces of music may be provided along with the progress of the story.
  • Such content may be generated by a creator or the like operating the generation unit 144 having the configuration shown in FIGS. 7 to 11, for example.
  • the generation unit 144 generates an image to be displayed on the image display device 7 when the story-telling is performed, an acoustic signal to be output by the image display device 7 when the story-telling is performed, and an image to be displayed on the image display device 7 when the story-telling is performed.
  • At least one of the voice signal, facial expression data, and action data of the speaking robot 1 may be changed.
  • FIG. 15 is a diagram showing a first evaluation result example.
  • items are defined as follows.
  • ⁇ "Content The minimum requirement for evaluation is to convey the content with a story in a simple manner.
  • ⁇ "Persuasiveness the point of evaluation is whether the article is persuasive and credible as a whole.
  • ⁇ "Realism The point of evaluation is how essential the elements contained in the story are, and the overall feeling that the story is brought to life through distribution.
  • “Interactivity” the point of evaluation is whether or not the subject feels able to participate in the storytelling process.
  • the storytelling robot 1 performs not only simple actions, but also meaningful actions accompanied by socially meaningful multimodal actions. Therefore, rather than examining the causal relationship between random movements and attention, we examined the impact of socially relevant robot communication affordances.
  • FIG. 16 is a diagram showing a second evaluation result example.
  • the horizontal axis is discrete time and the vertical axis is head orientation (1 or 0).
  • Graph g401 is an example of the temporal change in listener's attention allocation when the image display device 7 and storytelling robot 1 are used in "Agency ON”.
  • a graph g411 is an example of the change over time of the listener's attention distribution when the image display device 7 and the story-telling robot 1 are used in the "agency off" mode.
  • a line g402 indicates the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in "Agency ON".
  • a line g403 is for reference and shows the time change of listener's attention allocation when only the image display device 7 is used and the story-telling robot 1 is not placed in front of the image display device 7.
  • FIG. A line g412 shows the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in the "agency off" state.
  • the storytelling robot 1 performs actions or backtracks. It is possible to read aloud in a rich communication by inserting a question at the beginning or inserting a question at the beginning. According to this embodiment, a meaningful interaction between a human and a robot is realized, and it is accepted by users to the maximum extent possible.
  • the approach of the present embodiment maintains long-term attention and greater acceptance and can inspire future creative content designs for social robots. Also, according to the present embodiment, using a materialized agent, such as a robot, for storytelling can be "agency-on” by leveraging its communication affordances (materialization, expressiveness, and other modalities). , the listener's interest can be maintained more.
  • a materialized agent such as a robot
  • the story-telling robot 1 is not limited to story-telling, and similar effects can be obtained by playing the role of a "facilitator" that supports story-telling.
  • FIG. 17 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device 100 of the embodiment.
  • a recognition result 201 is an example of a result recognized by the recognition unit 105 .
  • the recognition result 201 is, for example, an interpersonal relationship, an interpersonal mutual relationship, or the like.
  • the multimodal learning and understanding 211 is an example of learning content performed by the learning unit 108 .
  • the learning method 212 is machine learning or the like.
  • the learning object 213 is social constituent elements, social models, psychology, humanities, and the like.
  • Social abilities 221 are social skills, such as empathy, individualization, adaptability, and emotional ajodance.
  • FIG. 18 is a diagram showing an example of data recognized by the recognition unit 105 according to the embodiment.
  • personal data 301 and interpersonal relationship data 351 are recognized as shown in FIG.
  • Personal data is behavior that occurs in one person, and is data acquired by the imaging unit 102 and sound pickup unit 103, and data obtained by performing voice recognition processing, image recognition processing, etc. on the acquired data.
  • Personal data includes, for example, voice data, semantic data resulting from voice processing, voice volume, voice inflection, uttered words, facial expression data, gesture data, head posture data, face direction data, line of sight data , co-occurrence expression data, physiological information (body temperature, heart rate, pulse rate, etc.), and the like.
  • the data to be used may be selected by the designer of the storytelling robot 1, for example. In this case, for example, the designer of the story-telling robot 1 may set important features of the personal data in communication for actual communication or demonstration between two persons.
  • the recognition unit 105 recognizes the user's emotion as personal data based on the information extracted from each of the acquired speech and image. In this case, the recognition unit 105 recognizes based on, for example, the loudness and intonation of the voice, the utterance duration, the facial expression, and the like. Then, the story-telling robot 1 of the embodiment controls the user's emotions so as to maintain good emotions and to work to maintain a good relationship with the user.
  • the recognition unit 105 estimates the user's nationality, hometown, etc., based on the acquired speech and image, and the data stored in the first database 107 .
  • the recognizing unit 105 extracts the user's life schedule such as wake-up time, going-out time, returning home time, bedtime, etc., based on the acquired utterances and images and the data stored in the first database 107 .
  • the recognition unit 105 recognizes the user's sex, age, occupation, hobbies, career, preferences, family structure, religion based on the acquired utterance, image, life schedule, and data stored in the first database 107. , the degree of attachment to the storytelling robot 1, etc. are estimated.
  • the story-telling robot 1 updates the information about the user's social background based on the conversation, the image, and the data stored in the first database 107 .
  • the social background and degree of attachment to the storytelling robot 1 are not limited to inputtable levels such as age, gender, and career. Recognize based on the volume and intonation of the voice to the undulations and topics. In this way, the recognizing unit 105 learns things that the user is not aware of by themselves based on daily conversations and facial expressions during conversations.
  • Personal relationship data is data related to the relationship between the user and other people.
  • the interpersonal relationship data includes, for example, the distance between people, whether or not the eyes of the people who are having a conversation meet each other, the inflection of the voice, the loudness of the voice, and the like.
  • the distance between people varies depending on the interpersonal relationship.
  • the interpersonal relationship is L1 for couples or friends, and the interpersonal relationship between businessmen is L2, which is larger than L1.
  • first database 107 or storage unit 106 Such personal data, interpersonal relationship data, and information about the user's social background are stored in first database 107 or storage unit 106 .
  • the recognition unit 105 collects and learns personal data for each user, and estimates the social background for each person.
  • social background may be obtained, for example, via a network and the receiving unit 101. In that case, the user inputs his/her social background or selects an item using, for example, a smartphone. good too.
  • the recognition unit 105 estimates the distance (interval) between the people communicating with each other based on the acquired utterances, images, and data stored in the first database 107 .
  • the recognition unit 105 detects whether or not the lines of sight of the person communicating with each other, based on the acquired utterance, image, and data stored in the first database 107 .
  • the recognition unit 105 Based on the acquired utterance and the data stored in the first database 107, the recognition unit 105 recognizes the content of the utterance, the loudness of the voice, the inflection of the voice, the received e-mail, the transmitted e-mail, and the transmission and reception of the received e-mail. Based on the previous partner, the relationship between friends, co-workers, relatives and parents is estimated.
  • the recognition unit 105 selects one, for example, at random from among several combinations of initial values of social backgrounds and personal data stored in the first database 107, and performs communication. may be started. Then, if it is difficult to continue communication with the user due to the behavior generated by the randomly selected combination, the recognition unit 105 may reselect another combination.
  • the learning unit 108 performs learning using the personal data 301 and interpersonal relationship data 351 recognized by the recognition unit 105 and data stored in the second database 109 .
  • a relationship with a person at a distance of 0 to 50 cm is an intimate relationship
  • a relationship with a person at a distance of 50 to 1 m is a personal relationship
  • a relationship with a distance of 1 to 4 m from a person is a social relationship
  • a relationship with a distance of 4 m or more from a person is a public relationship.
  • Such social norms are used as rewards (implicit rewards) based on whether gestures and utterances conform to the social norms during learning.
  • the interpersonal relationship may be set according to the environment in which it is used and the user by setting the feature value of the reward during learning. Specifically, you can set multiple intimacy settings, such as setting a rule to not talk to people who are not good at robots, and a rule to actively talk to people who like robots. good. Then, in a real environment, the recognizing unit 105 recognizes which type the user is based on the result of processing the user's utterance and image, and the learning unit 108 selects a rule. good.
  • the human trainer may evaluate the behavior of the story-telling robot 1 and provide a reward (implicit reward) according to the social structure and norms that he/she knows.
  • FIG. 19 is a diagram illustrating an example of an agent creation method used by the action generator 110 according to the embodiment.
  • the area indicated by reference numeral 300 is a diagram showing the flow from input to creation of an agent and output (agent).
  • the image captured by the capturing unit 102 and the information 310 captured by the sound capturing unit 103 are information about people (users, people related to the users, other people) and environmental information around people.
  • Raw data 302 acquired by the imaging unit 102 and the sound collecting unit 103 is input to the recognition unit 105 .
  • the recognition unit 105 extracts a plurality of pieces of information from the input raw data 302 (voice volume, voice intonation, utterance content, uttered words, user's line of sight, user's head posture, user's face orientation, etc.). , user's ecological information, distance between people, whether or not people's lines of sight intersect, etc.) are extracted and recognized.
  • the recognition unit 105 uses a plurality of pieces of extracted and recognized information to perform multimodal understanding using, for example, a neural network.
  • the recognition unit 105 identifies an individual based on, for example, at least one of an audio signal and an image, and assigns identification information (ID) to the identified individual.
  • ID identification information
  • the recognition unit 105 recognizes the motion of each identified person based on at least one of the audio signal and the image.
  • the recognition unit 105 recognizes the line of sight of the identified person by, for example, performing well-known image processing and tracking processing on the image.
  • the recognition unit 105 performs, for example, speech recognition processing (sound source identification, sound source localization, sound source separation, speech segment detection, noise suppression, etc.) on a speech signal to recognize speech.
  • the recognition unit 105 recognizes the head posture of the identified person by, for example, performing well-known image processing on the image. For example, when two people are photographed in the photographed image, the recognition unit 105 recognizes the interpersonal relationship based on the speech content, the distance between the two persons in the photographed image, and the like.
  • the recognition unit 105 recognizes (estimates) the social distance between the story-telling robot 1 and the user, for example, according to the result of processing each of the captured image and the collected sound signal.
  • the learning unit 108 performs reinforcement learning 304 instead of deep learning. Reinforcement learning involves learning to select the most relevant features (including social constructs and social norms). In this case, multiple pieces of information used in multimodal understanding are used as features for input. Inputs to the learning unit 108 are, for example, raw data itself, name ID (identification information), influence of face, recognized gesture, keyword from voice, and the like.
  • the output of the learning unit 108 is the behavior of the storytelling robot 1 .
  • the behavior that is output may be anything that you want to define according to your purpose, such as voice responses, robot routines, or the angle of orientation for the robot to rotate.
  • a neural network or the like may be used for detection. In this case, different modalities of the body may be used to detect human activity.
  • which feature to use may be selected in advance by, for example, the designer of the storytelling robot 1 .
  • by using implicit and explicit rewards during learning social models and social constructs can be incorporated.
  • the output of the reinforcement learning is the agent 305 .
  • the agent used by the action generator 110 is created.
  • the area indicated by reference numeral 350 is a diagram showing how the reward is used.
  • Implicit rewards 362 are used to learn implicit responses.
  • the raw data 302 includes the user's reactions, and the multimodal understanding 303 of this raw data 302 is described above.
  • the learning unit 108 generates an implicit reaction system 372 using the implicit reward 362 and the social model etc. stored in the second database 109 .
  • the implicit reward may be obtained by reinforcement learning or may be given by a human.
  • the implicit reaction system may also be a model acquired through learning.
  • a human trainer evaluates the behavior of the story-telling robot 1 and gives a reward 361 according to the social structure and social norms that he/she knows.
  • the agent adopts the action that maximizes the reward for the input.
  • the agent adopts behaviors (utterances and gestures) that maximize positive feelings toward the user.
  • Learning unit 108 uses this explicit reward 361 to generate explicit reaction system 371 .
  • the explicit response system may be a model acquired through learning.
  • the explicit reward may be given by the user by evaluating the behavior of the story-telling robot 1. For example, the reward may be estimated based on whether or not the action desired by the user is taken.
  • Learning unit 108 outputs agent 305 using these learning models during operation.
  • explicit rewards which are user reactions
  • implicit rewards are prioritized over implicit rewards. This is because the user's reaction is more reliable in communication.
  • the learning means performs learning using an implicit reward and an explicit reward, and the implicit reward is obtained using feature information about the person. , a multimodally learned reward, wherein the explicit reward is a reward based on the result of evaluating the behavior of the communication device toward the person generated by the action generating means.
  • the story-telling robot 1 of the embodiment it is provided with a sound pickup unit that picks up an acoustic signal and a shooting unit that shoots an image including the user, and the recognition means is based on the picked-up acoustic signal.
  • speech recognition processing is performed to extract feature information about voice; image processing is performed on the photographed image to extract feature information about human behavior included in the image; and the feature information about the human behavior, wherein the feature information about the voice is at least one of a voice signal, voice volume information, voice inflection information, and utterance meaning, and the person
  • the characteristic information about the action is at least one of facial expression information of a person, gesture information of a person, head posture information of a person, face direction information of a person, line of sight information of a person, and distance between people. is.
  • a program for realizing all or part of the functions of the reading robot 1 and the recognition generating unit 140 in the present invention is recorded on a computer-readable recording medium, and the program recorded on this recording medium is transferred to a computer system. All or part of the processing may be performed by the story-telling robot 1 and the recognition generation unit 140 by reading and executing the processing.
  • the "computer system” referred to here includes hardware such as an OS and peripheral devices.
  • the "computer system” includes a WWW system provided with a home page providing environment (or display environment).
  • the term “computer-readable recording medium” refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems.
  • “computer-readable recording medium” means a volatile memory (RAM) inside a computer system that acts as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. , includes those that hold the program for a certain period of time.
  • RAM volatile memory
  • the program may be transmitted from a computer system storing this program in a storage device or the like to another computer system via a transmission medium or by transmission waves in a transmission medium.
  • the "transmission medium” for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line.
  • the program may be for realizing part of the functions described above. Further, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.
  • Reference Signs List 1 story-telling robot, 101 receiving unit, 102 photographing unit, 103 sound collecting unit, 104 sensor, 100 communication device, 105 recognition unit, 106 storage unit, 107 first database, 108 learning Part 109 Second database 110 Action generation unit 111 Display unit 112 Speaker 113 Actuator 114 Transmission unit 140 Recognition generation unit 150 Third database 500 Content 1101 Image generator 1102 Voice generator 1103 Drive unit 1104 Transmission information generator 141 Face recognition unit 142 Gesture recognition unit 143 Voice recognition unit 144 Generation unit 1441 Narrator/agent Department, 1442 ... Connection Department, 1443 ... Education Department

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Business, Economics & Management (AREA)
  • Mechanical Engineering (AREA)
  • Robotics (AREA)
  • Educational Administration (AREA)
  • Educational Technology (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Toys (AREA)

Abstract

This storytelling information creation device comprises a generation unit that generates an image to be displayed on an image display device during storytelling, an acoustic signal to be outputted to the image display device during storytelling, and a voice signal, facial expression data, and motion data for a storytelling robot during storytelling, by using a photographed image of a listener, a picked-up voice signal of the listener's voice, the result of detecting the listener's posture, an expression routine of the robot, image media, and an acoustic signal.

Description

読み聞かせ情報作成装置、読み聞かせロボット、読み聞かせ情報作成方法、プログラムStory-telling information creation device, story-telling robot, story-telling information creation method, program
 本発明は、読み聞かせ情報作成装置、読み聞かせロボット、読み聞かせ情報作成方法、プログラムに関する。
 本願は、2021年8月10日に出願された日本国特願2021-130727号に基づき優先権を主張し、その内容をここに援用する。
The present invention relates to a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a program.
This application claims priority based on Japanese Patent Application No. 2021-130727 filed on August 10, 2021, the content of which is incorporated herein.
 物語は本質的に社会的な活動であり、私たち人間が他者とつながり、人生に意味を見出し、社会的に共有される意味を構築し、人から人へ情報を伝達し、知識を次世代に伝えるのに役立つ。認知心理学者のジェローム・ブルーナーは、読み聞かせが、意味を作る-奇妙なものを親しみやすくし、個人や集団の自己を定義するという二重の目的を果たしていると指摘している。 Storytelling is an inherently social activity that allows us humans to connect with others, find meaning in life, construct socially shared meanings, transmit information from person to person, and transfer knowledge to others. Helps pass it on to generations. Cognitive psychologist Jerome Bruner points out that storytelling serves the dual purpose of creating meaning – making the queer familiar and defining the individual and collective self.
 読み聞かせとして、例えば書籍等をナレーター等が朗読したオーディオブックがある。オーディオブックは、カセットテープ、CD(コンパクトディスク)等の媒体による提供や、インターネット等での配信サービスも行われている。このような読み上げを、文脈を考慮した発話スタイルを推定することで支援する文書読み上げ支援装置が提案されている(例えば特許文献1参照)。 For reading aloud, for example, there are audiobooks read aloud by narrators. Audiobooks are provided by mediums such as cassette tapes and CDs (compact discs), and are also provided through distribution services such as the Internet. A document reading support device has been proposed that supports such reading by estimating an utterance style in consideration of the context (see, for example, Patent Document 1).
特開2015-215626号公報JP 2015-215626 A
 しかしながら、従来のオーディオブックでは、音声データが同一であるため、何度も再生すると利用者が飽きてしまう課題があった。また、従来の読み上げ装置でも文脈を考慮した発話スタイルを推定しても、何度も利用すると利用者が飽きてしまう課題があった。また、利用者によっては、1つの物語の途中で飽きてしまう場合もあった。 However, conventional audiobooks have the same audio data, so there was a problem that users would get bored if they played it over and over again. In addition, even if the conventional text-to-speech device estimates an utterance style that takes into consideration the context, there is a problem that the user becomes bored after repeated use. Further, some users may become bored in the middle of one story.
 本発明に係る態様は、上記の問題点に鑑みてなされたものであって、利用者に応じて読み聞かせを工夫することができる読み聞かせ情報作成装置、読み聞かせロボット、読み聞かせ情報作成方法、プログラムを提供することを目的とする。 Aspects of the present invention have been made in view of the above problems, and include a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a story-telling information creation device that can devise story-telling according to the user. The purpose is to provide a program.
 上記課題を解決するために、本発明は以下の態様を採用した。
 (1)本発明の一態様に係る読み聞かせ情報作成装置は、聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成する生成部、を備える。
In order to solve the above problems, the present invention employs the following aspects.
(1) A story-telling information creation device according to an aspect of the present invention includes an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and an expression of a robot. Using a routine, an image medium, and an acoustic signal, an image to be displayed on an image display device when reading a story, an acoustic signal to be output to the image display device when reading a story, and a generation unit that generates a voice signal, facial expression data, and action data of the story-telling robot when performing story-telling.
 (2)上記態様(1)において、前記生成部は、ナレーションを進めるためのナレーター部と、前記読み聞かせロボットの表現力を考慮し、表現力のあるルーチンを起動し、マルチメディアに対応するエージェンシー部と、特定の人物に前記読み聞かせロボットの視線を維持する人に反応する即時性のある行動を作成する結びつき部と、質問と回答の設定を行う教育部と、を備えるようにしてもよい。 (2) In the above aspect (1), the generation unit considers the expressive power of the narrator unit for proceeding with the narration and the story-telling robot, activates a routine with expressive power, and creates an agency that supports multimedia. , a linking unit that creates immediate actions in response to a person who maintains the line of sight of the story-telling robot to a specific person, and an education unit that sets questions and answers. .
 (3)上記態様(1)または(2)において、前記生成部は、行動木(Behaviour Trees)構造によって構成されているようにしてもよい。 (3) In the aspect (1) or (2) above, the generator may be configured with a behavior tree structure.
 (4)上記態様(1)から(3)のうちのいずれか1つにおいて、前記生成部は、前記聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果のうちの少なくとも1つに基づいて前記聞き手の状態を認識し、前記認識した結果に基づいて、前記読み聞かせを行わせる際に前記画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に前記読み聞かせロボットの音声信号と表情データと動作データのうちの少なくとも1つを変更するようにしてもよい。 (4) In any one of the above aspects (1) to (3), the generating unit generates an image of the listener, an audio signal obtained by collecting the audio of the listener, and the attitude of the listener. recognizing the state of the listener based on at least one of the detection results, and based on the recognized result, an image to be displayed on the image display device when the reading is performed; At least one of the sound signal to be output to the image display device when performing the reading, and the voice signal, facial expression data, and action data of the story-telling robot when performing the story-telling may be changed. good.
 (5)上記態様(1)から(4)のうちのいずれか1つにおいて、前記読み聞かせロボットの表情データは、目に相当する画像と口に相当する画像であり、前記読み聞かせロボットの動作データは、首に相当する部分の動作に関するデータであるようにしてもよい。 (5) In any one of the above modes (1) to (4), the facial expression data of the storytelling robot is an image corresponding to the eyes and an image corresponding to the mouth, and the motion of the storytelling robot The data may be data relating to the motion of the portion corresponding to the neck.
 (6)上記態様(1)から(5)のうちのいずれか1つにおいて、前記画像表示装置と前記読み聞かせロボットは、近接して配置されるようにしてもよい。 (6) In any one of the above aspects (1) to (5), the image display device and the story-telling robot may be arranged close to each other.
 (7)本発明の一態様に係る読み聞かせロボットは、画像を撮影する撮影部と、音響信号を収音する収音部と、前記聞き手の状態を検出するセンサと、目に相当する第1の表示部と、口に相当する第2の表示部と、首に相当する可動部と、上記態様(1)から(6)のうちのいずれか1つによって生成された前記画像表示装置に表示させる画像と、前記画像表示装置に出力させる音響信号と、前記読み聞かせロボットの音声信号と表情データと動作データを用いて、前記読み聞かせを行わせる動作生成手段と、を備える。 (7) A storytelling robot according to an aspect of the present invention includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; an image to be read, an acoustic signal to be output to the image display device, and an action generating means for causing the story-telling robot to perform the story-telling by using the voice signal, expression data, and action data of the story-telling robot.
 (8)本発明の一態様に係る読み聞かせロボットは、画像を撮影する撮影部と、音響信号を収音する収音部と、前記聞き手の状態を検出するセンサと、目に相当する第1の表示部と、口に相当する第2の表示部と、首に相当する可動部と、上記態様(1)から(6)のうちのいずれか1つと、前記1つの態様によって生成された前記画像表示装置に表示させる画像と、前記画像表示装置に出力させる音響信号と、前記読み聞かせロボットの音声信号と表情データと動作データを用いて、前記読み聞かせを行わせる動作生成手段と、を備える。 (8) A storytelling robot according to an aspect of the present invention includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; any one of the above aspects (1) to (6); an image to be displayed on an image display device, an acoustic signal to be output to the image display device, and an action generating means for causing the storytelling robot to perform the storytelling using an audio signal, facial expression data, and action data of the storytelling robot. .
 (9)上記態様(7)または(8)において、前記聞き手に対して読み聞かせを行う、または、前記聞き手に対して前記読み聞かせの促進を行うようにしてもよい。 (9) In the above aspect (7) or (8), the listener may be read aloud, or the listener may be encouraged to read aloud.
 (10)上記態様(7)から(9)のうちのいずれか1つにおいて、人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、を更に備え、前記動作生成手段は、学習された前記人の感情的な相互作用情報に基づいて、行動を生成するようにしてもよい。 (10) In any one of the above aspects (7) to (9), the communication device and the person acquire human information about the person, extract characteristic information about the person from the acquired personal information, and communicate with the person. Multimodal learning of people's emotional interactions by using cognitive means for recognizing actions that occur between people and for recognizing actions that occur between people and the extracted feature information about the said person. and learning means, wherein the action generating means may generate actions based on the learned emotional interaction information of the person.
 (11)本発明の一態様に係る読み聞かせ情報作成方法は、生成部が、聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成する。 (11) In a story-telling information creation method according to an aspect of the present invention, the generation unit detects an image of a listener, an audio signal obtained by collecting the listener's voice, a result of detecting the listener's posture, An image to be displayed on an image display device when storytelling is performed using a robot expression routine, image media, and an acoustic signal, and sound to be output to the image display device when the storytelling is performed A signal, an audio signal, facial expression data, and motion data of the storytelling robot when the storytelling is performed.
 (12)本発明の一態様に係るプログラムは、コンピュータに、聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成させる。 (12) A program according to an aspect of the present invention stores, in a computer, an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and a robot expression routine. and an image to be displayed on an image display device when reading aloud using an image medium and an acoustic signal, an acoustic signal to be output to the image display device when the reading is performed, and the reading A voice signal, facial expression data, and action data of a storytelling robot are generated when the storytelling is performed.
 上記態様(1)~(12)によれば、利用者に応じて読み聞かせを工夫することができる。また、上記態様(1)~(12)によれば、利用者の注意を持続させることができる。 According to the above aspects (1) to (12), it is possible to devise reading aloud according to the user. Further, according to the aspects (1) to (12), the user's attention can be maintained.
実施形態に係る読み聞かせロボットのコミュニケーション例を示す図である。It is a figure which shows the communication example of the story-telling robot which concerns on embodiment. 実施形態に係る読み聞かせロボットの外形例を示す図である。It is a figure which shows the example of the outline of the story-telling robot which concerns on embodiment. 画像表示装置で物語の映像を映しながらの読み聞かせロボットによる読み聞かせ例を示す図である。FIG. 10 is a diagram showing an example of story-telling by a story-telling robot while projecting a video of a story on an image display device; 実施形態に係る読み聞かせ中に画像表示装置に映し出される画像と読み聞かせロボットの動作と表情例を示す図である。It is a figure which shows the operation|movement and facial expression example of the image projected on the image display apparatus during story-telling which concerns on embodiment, and a story-telling robot. 実施形態に係る読み聞かせ中に画像表示装置に映し出される画像と読み聞かせロボットの動作と表情例を示す図である。It is a figure which shows the operation|movement and facial expression example of the image projected on the image display apparatus during story-telling which concerns on embodiment, and a story-telling robot. 知覚要素例を示す図である。FIG. 10 illustrates an example perceptual element; 実施形態に係る読み聞かせの実装例を示す図である。It is a figure which shows the implementation example of the story-telling which concerns on embodiment. 語り手ノードの一例を示す図である。FIG. 4 is a diagram showing an example of a narrator node; エージェンシー・ノードの一例を示す図である。FIG. 2 illustrates an example agency node; 結びつきノードの一例を示す図である。It is a figure which shows an example of a connection node. 読み聞かせのための行動木構成の一例を示す図である。FIG. 4 is a diagram showing an example of an action tree configuration for storytelling; 実施形態に係る読み聞かせロボットの構成例を示すブロック図である。It is a block diagram which shows the structural example of the story-telling robot which concerns on embodiment. 実施形態に係る認識生成部の構成例を示す図である。It is a figure which shows the structural example of the recognition production|generation part which concerns on embodiment. 実施形態に係る読み聞かせロボットの読み聞かせ処理の手順例を示すフローチャートである。It is a flowchart which shows the procedure example of the story-telling process of the story-telling robot which concerns on embodiment. 第1の評価結果例を示す図である。It is a figure which shows the example of a 1st evaluation result. 第2の評価結果例を示す図である。It is a figure which shows the example of a 2nd evaluation result. 実施形態のコミュニケーション装置が行う認知と学習と社会的能力の流れを示す図である。4 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device of the embodiment; FIG. 実施形態に係る認知部が認識するデータ例を示す図である。It is a figure which shows the data example which the recognition part which concerns on embodiment recognizes. 実施形態に係る動作生成部が用いるエージェント作成方法例を示す図である。It is a figure which shows the example of the agent preparation method which the action production|generation part which concerns on embodiment uses.
 以下、本発明の実施の形態について図面を参照しながら説明する。なお、以下の説明に用いる図面では、各部材を認識可能な大きさとするため、各部材の縮尺を適宜変更している。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, in the drawings used for the following description, the scale of each member is appropriately changed so that each member has a recognizable size.
<概要>
 図1は、本実施形態に係る読み聞かせロボット1のコミュニケーション例を示す図である。図1のように、読み聞かせロボット1は、例えば利用者の状態や状況に応じて、読み聞かせを行う(g17)。なお、以下の説明では、「読み聞かせロボット」を「ロボット」ともいう。また、読み聞かせロボット1は、個人または複数の人2とのコミュニケーションを行う。コミュニケーションは、主に対話g11と仕草g12(動作)である。動作は、実際の動作に加え、表示部に表示される画像によって表現する。また、読み聞かせロボット1は、利用者にインターネット回線等を介して電子メールが送信された際、電子メールを受信し電子メールが届いたことと内容を知らせる(g13)。また、読み聞かせロボット1は、例えば電子メールに返答が必要な場合に、アドバイスが必要か利用者とコミュニケーションをとって提案g14を行う。読み聞かせロボット1は、返答を送信する(g15)。また、読み聞かせロボット1は、例えば利用者の予定に合わせて、予定日時や場所に応じた場所の天気予報の提示g16を行う。
<Overview>
FIG. 1 is a diagram showing a communication example of the storytelling robot 1 according to the present embodiment. As shown in FIG. 1, the story-telling robot 1 performs story-telling according to, for example, the user's state or situation (g17). In the following description, the "reading robot" is also referred to as "robot". In addition, the storytelling robot 1 communicates with an individual or a plurality of people 2 . Communication is mainly dialogue g11 and gesture g12 (movement). Actions are represented by images displayed on the display in addition to actual actions. Further, when the e-mail is sent to the user via the Internet line or the like, the story-telling robot 1 receives the e-mail and informs the user that the e-mail has been received and the content (g13). Further, when the story-telling robot 1 requires a reply to an e-mail, for example, it makes a proposal g14 by communicating with the user as to whether advice is required. The story-telling robot 1 sends a reply (g15). In addition, the storytelling robot 1 presents a weather forecast g16 for a location according to the scheduled date and time and location, for example, according to the user's schedule.
 このように、本実施形態の読み聞かせロボット1は、ロボットと人との間に感情的な繋がりを形成ことができるようにロボットの社会的能力を生成して、例えば人の反応や行動に応じて人とのコミュニケーションを行うことができる。また、読み聞かせロボット1は、人とロボットが感情のレベルで共感してコミュニケーションを行うことができる。そして、読み聞かせロボット1は、読み聞かせる対象の人の状況に応じて、工夫して読み聞かせを行う。 In this way, the story-telling robot 1 of the present embodiment generates the social ability of the robot so that an emotional connection can be formed between the robot and the person, for example, according to the reaction or action of the person. can communicate with people. In addition, the storytelling robot 1 can communicate with a person and a robot by sympathizing with each other on an emotional level. Then, the story-telling robot 1 devisely performs story-telling according to the situation of the person to whom it is to be read-aloud.
<読み聞かせロボット1の外形例>
 次に、読み聞かせロボット1の外形例を説明する。
 図2は、本実施形態に係る読み聞かせロボット1の外形例を示す図である。図2の正面図g101、側面図g102の例では、読み聞かせロボット1は3つの表示部111(111a、111b、111c)を備えている。また図2の例では、撮影部102aは表示部111aの上部に取り付けられ、撮影部102bは表示部111bの上部に取り付けられている。表示部111a、111bは、人の目に相当し、人の目に相当する画像や画像情報を提示する。表示部111a、111bの画面の大きさは、例えば3インチである。スピーカー112は、筐体120の人の口に相当する画像を表示する表示部111cの近傍に取り付けられている。表示部111cは、例えば複数のLED(発光ダイオード)で構成され、各LEDはアドレス指定可能であり、個別に点灯消灯駆動できる。収音部103は、筐体120に取り付けられている。
<Outline example of story-telling robot 1>
Next, an example of the appearance of the storytelling robot 1 will be described.
FIG. 2 is a diagram showing an example of the outline of the storytelling robot 1 according to the present embodiment. In the example of the front view g101 and the side view g102 of FIG. 2, the storytelling robot 1 has three display units 111 (111a, 111b, 111c). In the example of FIG. 2, the imaging unit 102a is attached above the display unit 111a, and the imaging unit 102b is attached above the display unit 111b. The display units 111a and 111b correspond to the human eye and present images and image information corresponding to the human eye. The screen size of the display units 111a and 111b is, for example, 3 inches. The speaker 112 is attached to the housing 120 in the vicinity of the display section 111c that displays an image corresponding to the human mouth. The display unit 111c is composed of, for example, a plurality of LEDs (light emitting diodes), each of which can be addressed and can be individually turned on and off. The sound pickup unit 103 is attached to the housing 120 .
 また、読み聞かせロボット1は、ブーム121を備える。ブーム121は、筐体120に可動部131を介して可動可能に取り付けられている。ブーム121には、水平バー122が可動部132を介して回転可能に取り付けられている。
 また、水平バー122には、表示部111aが可動部133を介して回転可能に取り付けられ、表示部111bが可動部134を介して回転可能に取り付けられている。なお、図2に示した読み聞かせロボット1の外形は一例であり、これに限らない。
The storytelling robot 1 also includes a boom 121 . Boom 121 is movably attached to housing 120 via movable portion 131 . A horizontal bar 122 is rotatably attached to the boom 121 via a movable portion 132 .
Further, a display portion 111a is rotatably attached to the horizontal bar 122 via a movable portion 133, and a display portion 111b is rotatably attached via a movable portion . In addition, the outer shape of the reading robot 1 shown in FIG. 2 is an example, and is not limited to this.
 このような構成により、読み聞かせロボット1は、例えば5つの動作自由度(ベースローテーション(base rotation)、ネックリーン(neck leaning)、アイストローク(eye stroke)、アイローテーション(eye rotation)、アイチルト(eye tilt))を持ち、表現力豊かな動作を可能にしている。そして、読み聞かせロボット1は、音声合成(TTS)、アニメーション・ルーチン(動き、音、目で特定の感情を表現するオープンループの組み合わせ)、投影されたスクリーンなどを介してコミュニケーションをとることができる。 With such a configuration, the reading robot 1 has, for example, five motion degrees of freedom (base rotation, neck leaning, eye stroke, eye rotation, eye tilt). It has tilt)) and enables expressive movements. The story-telling robot 1 can then communicate via text-to-speech (TTS), animation routines (open-loop combinations of movements, sounds, and eyes expressing specific emotions), projected screens, etc. .
<読み聞かせロボット1による読み聞かせ例>
 次に、読み聞かせロボット1による読み聞かせ例を、図3~図5を用いて説明する。 図3は、画像表示装置7で物語の映像を映しながらの読み聞かせロボット1による読み聞かせ例を示す図である。図3の例のように、読み聞かせロボット1は、例えばテーブルTab上に置かれていて、かつ画像表示装置7の前に置かれている。なお、読み聞かせロボット1は、画像表示装置7の横等に置かれていてもよい。すなわち、画像表示装置7と読み聞かせロボット1は、近接して配置される。画像表示装置7に表示される画像は、静止画であっても動画であってもよい。また、画像には、画像に合わせた音楽が含まれる。画像表示装置7はスピーカーを備え、スピーカーから映像に含まれる音楽を再生する。
<Example of story-telling by story-telling robot 1>
Next, an example of story-telling by the story-telling robot 1 will be described with reference to FIGS. 3 to 5. FIG. FIG. 3 is a diagram showing an example of story-telling by the story-telling robot 1 while projecting the image of the story on the image display device 7 . As in the example of FIG. 3, the storytelling robot 1 is placed, for example, on the table Tab and in front of the image display device 7 . Note that the story-telling robot 1 may be placed beside the image display device 7 or the like. That is, the image display device 7 and the story-telling robot 1 are arranged close to each other. The image displayed on the image display device 7 may be a still image or a moving image. The image also includes music to match the image. The image display device 7 has a speaker, and reproduces music included in the video from the speaker.
 図4、図5は、本実施形態に係る読み聞かせ中に画像表示装置7に映し出される画像と読み聞かせロボット1の動作と表情例を示す図である。図4において、画像g201は画像表示装置7に表示される第1の画像例であり、画像g203とg205は、読み聞かせ中の読み聞かせロボット1の動作と表情例である。例えば画像g203は驚いた場合の動作と表情であり、画像g205は残念だった場合の動作と表情である。図5において、画像g211は画像表示装置7に表示される第2の画像例であり、画像g213とg215は、読み聞かせ中の読み聞かせロボット1の動作と表情例である。例えば画像g213はうれしかった場合の動作と表情であり、画像g215は興味深かった場合の動作と表情である。 FIGS. 4 and 5 are diagrams showing images displayed on the image display device 7 during story-telling according to the present embodiment, and examples of actions and facial expressions of the story-telling robot 1. FIG. In FIG. 4, an image g201 is a first example image displayed on the image display device 7, and images g203 and g205 are examples of the motion and expression of the storytelling robot 1 during storytelling. For example, the image g203 is the action and facial expression of being surprised, and the image g205 is the action and facial expression of being disappointed. In FIG. 5, an image g211 is a second example image displayed on the image display device 7, and images g213 and g215 are examples of actions and facial expressions of the storytelling robot 1 during storytelling. For example, the image g213 is the action and facial expression when happy, and the image g215 is the action and facial expression when it is interesting.
 図4、図5のように、読み聞かせロボット1は、読み聞かせ中、物語の内容と利用者Huの状態に合わせて首、目に相当する部分を動作させ、目に相当する表示部と口に相当する表示部に表示させる画像を変化させて、動作と表情を変化させる。なお、図4、5に示した画像、読み聞かせロボット1の表情や動作は一例であり、これに限らない。
 なお、読み聞かせロボット1は、表情や動作に合わせて、例えば「おー!」、「まあ!」等の感嘆符の音声を発するようにしてもよい。このような読み聞かせロボット1の表情や動作は、説得力とインタラクティビティを高めるために設計されており、人間の語り手が行うであろう現実的なジェスチャーのレパートリーを反映している。また、表情には、例えば、目の動き、目の表情、口の形状による口による表情等が含まれる。動作には、上述したように、例えば、首の動き、目の外観の動き等が含まれる。
As shown in FIGS. 4 and 5, the story-telling robot 1 moves parts corresponding to the neck and eyes according to the contents of the story and the state of the user Hu during the story-telling, and displays the display part corresponding to the eyes and the mouth. By changing the image displayed on the display unit corresponding to , the action and facial expression are changed. It should be noted that the images shown in FIGS. 4 and 5 and the facial expressions and actions of the storytelling robot 1 are examples, and are not limited to these.
Note that the storytelling robot 1 may emit sounds of exclamation points such as "Oh!" Such facial expressions and actions of the storytelling robot 1 are designed to enhance persuasiveness and interactivity, and reflect a realistic repertoire of gestures that a human narrator would make. Moreover, facial expressions include, for example, eye movements, facial expressions of eyes, mouth facial expressions due to the shape of the mouth, and the like. Actions include, for example, neck movements, eye movements, etc., as described above.
<読み聞かせロボット1の知覚機能>
 次に、読み聞かせロボット1の知覚機能例を説明する。読み聞かせロボット1は、上述した5自由度の動作に加えて、視覚と聴覚を使って環境を認識する。読み聞かせロボット1は、読み聞かせロボット1が備える撮影部102が撮影し、取得したデータを用いて視覚情報を取得する。撮影部102は、例えばRGBカメラと深度センサ、または距離センサを備える。また、読み聞かせロボット1は、読み聞かせロボット1が備える収音部103が収音した音響信号を用いて聴覚情報を取得する。収音部103は、複数のマイクロホンを備えるマイクロホンアレイである。
<Perceptual function of story-telling robot 1>
Next, an example of the perceptual function of the storytelling robot 1 will be described. The story-telling robot 1 recognizes the environment using sight and hearing in addition to the five degrees of freedom described above. The story-telling robot 1 acquires visual information using the data captured by the imaging unit 102 included in the story-telling robot 1 . The imaging unit 102 includes, for example, an RGB camera and a depth sensor or a distance sensor. In addition, the storytelling robot 1 acquires auditory information using an acoustic signal picked up by the sound pickup unit 103 of the storytelling robot 1 . The sound pickup unit 103 is a microphone array including a plurality of microphones.
 図6は、知覚要素例を示す図である。
 読み聞かせロボット1は、画像g301のように、音声方向、音声認識、聞き手(利用者)、聞き手(利用者)の頭の方向、聞き手の体の方向、手の位置等を知覚する。
 画像g304は、読み聞かせロボット1が備える撮影部102で撮影した画像であり、ロボットから見た視点である。
 画像g302は、画像g304から聞き手の顔を含む画像を抽出した画像である。
 画像g303は、画像g302に対して、周知の手法で顔認証した結果の画像である。この例では、聞き手の名前はFredであり、識別情報(ID)として2が割り振られる。なお、読み聞かせロボット1が備える第3データベース150(図12)に、聞き手の顔画像や名前が登録されていない場合、読み聞かせロボット1は、例えば聞き手に語りかけて名前を聞き、返答による音声信号に基づいて名前を取得して第3データベース150に登録する。
 なお、顔認識部141(図13)が顔認識を行い、ジェスチャー認識部142(図13)がジェスチャー認識を行い、音声認識部143(図13)が音声認識や音源方向の推定等を行う。
FIG. 6 is a diagram illustrating an example sensory element.
The storytelling robot 1 perceives the voice direction, voice recognition, the listener (user), the direction of the listener's (user's) head, the direction of the listener's body, the position of the hands, etc., as in the image g301.
An image g304 is an image captured by the capturing unit 102 provided in the storytelling robot 1, and is a viewpoint seen from the robot.
Image g302 is an image obtained by extracting an image including the listener's face from image g304.
An image g303 is an image resulting from performing face authentication on the image g302 by a well-known technique. In this example, the listener's name is Fred, and 2 is assigned as identification information (ID). If the face image and name of the listener are not registered in the third database 150 (FIG. 12) of the story-telling robot 1, the story-telling robot 1, for example, speaks to the listener to hear the name, and responds with an audio signal. and register it in the third database 150.
A face recognition unit 141 (FIG. 13) performs face recognition, a gesture recognition unit 142 (FIG. 13) performs gesture recognition, and a voice recognition unit 143 (FIG. 13) performs voice recognition, estimation of a sound source direction, and the like.
 読み聞かせロボット1は、撮影部102が撮影し、取得した情報を用いて位置情報を取得する。そして、読み聞かせロボット1は、例えば周知のスケルトン検出(例えば参考文献1参照)を用いて、異なる人物のパーツ(手足,体,頭)のポーズを得る。さらに、読み聞かせロボット1は、骨格検出に加えて、周知のジェスチャー検出の手法(例えば特開2021-031630号)手を振るジェスチャーや指差すジェスチャーなどの様々なジェスチャーを検出し、その場にいる人物の手による指差す方向を推定する。そして、読み聞かせロボット1は、第3データベース150に登録されている情報と比較して顔を識別し、人が近くにいるときに笑顔などの顔の特徴も識別することができる。 The story-telling robot 1 acquires position information using the acquired information captured by the imaging unit 102 . The storytelling robot 1 then obtains the poses of the different human parts (limbs, body, head), for example, using well-known skeleton detection (see, for example, Reference 1). Furthermore, in addition to skeleton detection, the storytelling robot 1 detects various gestures such as a hand waving gesture and a pointing gesture by a well-known gesture detection method (for example, Japanese Patent Application Laid-Open No. 2021-031630), Estimate the pointing direction of a person's hand. Then, the storytelling robot 1 can identify the face by comparing with the information registered in the third database 150, and can also identify facial features such as a smile when the person is nearby.
 読み聞かせロボット1は、収音部103が収音した音響信号に対して、例えばビームフォーミング等の手法を用いて音源方向を特定したり、音源を分離する。なお、読み聞かせロボット1は、収音された音響信号に対して雑音抑圧、発話区間の推定等の処理を行うようにしてもよい。さらに、読み聞かせロボット1は、分離され話者が特定された音声をテキストに変換して言語理解する。これにより、読み聞かせロボット1は、聴覚情報を得る。 The storytelling robot 1 identifies the sound source direction and separates the sound sources from the sound signal collected by the sound collection unit 103 using a technique such as beamforming. Note that the storytelling robot 1 may perform processing such as noise suppression and utterance section estimation on the collected sound signal. Further, the story-telling robot 1 converts the separated speech whose speaker is specified into text and understands the language. Thereby, the story-telling robot 1 obtains auditory information.
 参考文献1;J. Shotton, A. Fitzgibbon, A. Blake, A. Kipman, M. Finocchio, R. Moore, and T. Sharp, “Real-time human pose recognition in parts from a single depth image” in CVPR. IEEE, June 2011 Reference 1; J. Shotton, A. Fitzgibbon, A. Blake, A. Kipman, M. Finocchio, R. Moore, and T. Sharp, “Real-time human pose recognition in parts from a single depth image” in CVPR IEEE, June 2011
<読み聞かせにおける読み聞かせロボット1の制御の概要>
 次に、読み聞かせにおける読み聞かせロボット1の制御の概要について説明する。
 子供に対する読み聞かせは、自己表現、共感的理解、双方向のコミュニケーションに役立ち、子どもたちの話す・聞く練習をサポートすることが、これまでの研究によって知られている。
 また、ロボットが体現することで、ジェスチャーや視線を利用することができる。そして、ロボットの視線の頻度や方向、簡単な頭の動きは、教育現場で教材への関与を高める社会的効果をもたらすことが、これまでの研究によって知られている。
 さらに、視線とジェスチャーの組み合わせは、話をするロボットの説得力を高める。そして、ジェスチャーは、聞き手に向けられた視線と組み合わせるとより説得力が増すことが、これまでの研究によって知られている。
<Overview of control of story-telling robot 1 in story-telling>
Next, the outline of the control of the story-telling robot 1 in story-telling will be described.
Previous studies have shown that reading to children helps with self-expression, empathic understanding, two-way communication, and supports children's speaking and listening practice.
In addition, by embodied by a robot, it is possible to use gestures and line of sight. Studies to date have shown that the frequency and direction of the robot's line of sight and simple head movements have social effects that increase participation in learning materials in educational settings.
In addition, the combination of gaze and gesture increases the persuasiveness of the talking robot. Previous studies have shown that gestures are more persuasive when combined with the gaze of the listener.
 しかしながら、これらの研究成果では、子どもたちが物語の語り手としてのロボットに興味を失ってしまうこという課題があった、この理由は、物語に特有の感情やジェスチャーがないためではないかと考えられている。 However, in these research results, there was a problem that children lost interest in robots as storytellers. there is
 このため、本実施形態では、読み聞かせにおいて、以下の4つの点(語り手(Narrator)、エージェンシー(Agency)、結びつき(Engagement)、教育(Education))を考慮して、物語の作成と、読み聞かせロボット1の表情と動作と発話を行うようにした。 For this reason, in the present embodiment, in the storytelling, the following four points (Narrator, Agency, Engagement, Education) are taken into account to create a story and read aloud. The facial expression, motion, and speech of the robot 1 are performed.
 I.語り手の役割
 語り手は、物語を語り、視点を設定する。読み聞かせにおいては、語り手がライブでリ聞き手に物語を伝えるために発声する。そして、読み聞かせは、物語の理解度と注意力を高めることができる。物語を音声で伝える際、特に子供たちを惹きつけるためには、豊かな表現力を目指して、状況に応じた話し方をする必要がある。例えば、親が子供に物語を語るとき、親は自分の言葉でキャラクターや物語を「演じる」ことが多く、誇張したり、変化させたり、表現豊かに言葉を伝えている。語り手の声の表現力には、注意を促す社会的・感情的な手がかりが含まれており、登場人物や物語の世界に観客を引き込むのに役立つ。ライブでは、語り手が直接視線やジェスチャーを使って物語の重要なポイントを強調し、注目や関心を誘導している。
I. The role of the narrator The narrator tells the story and sets the point of view. In storytelling, the narrator speaks live to tell the story to the listener. And reading aloud can improve story comprehension and attention. When telling a story by voice, especially in order to attract children, it is necessary to aim for rich expressiveness and speak according to the situation. For example, when parents tell stories to their children, they often "act out" the characters and stories in their own words, exaggerating, varying, and expressively conveying the words. The expressiveness of the narrator's voice contains social and emotional cues that call attention and help draw the audience into the characters and the world of the story. Live, the narrator uses direct gazes and gestures to highlight key points in the story, directing attention and interest.
 II.エージェンシー
 読み聞かせは、意味形成に直接関連している。物語を伝えるためのマルチモーダルなアプローチは、意味形成プロセスにおける記号論的資源の順序付けと配置における聞き手への働きかけを可能にする。身体や顔の表情、近接運動、視線、目のサッケード、話し方、声のトーン、音量などは、感情を込めた読み聞かせを行うためのモーダルリソースの一例である。メディアテクノロジーが加わると、モーダルリソースは、画像の色、テキスト、動き、音楽、音などに拡大する。ライブの読み聞かせをプロジェクションやアニメーションなどのコミュニケーション・メディアと組み合わせることで、意味づけのプロセスにおける観客の主体性をサポートする豊富なモード・リソースを提供することができる。
II. Agency Storytelling is directly related to meaning formation. A multimodal approach to storytelling allows us to influence the listener in the ordering and placement of semiotic resources in the meaning-making process. Body and facial expressions, proximity movements, gaze, eye saccades, speech style, tone of voice, and volume are examples of modal resources for emotional reading. With the addition of media technology, modal resources expand to include image colors, text, motion, music, sounds, and more. Combining live storytelling with communication media such as projection and animation can provide a wealth of modal resources that support audience agency in the meaning-making process.
 III.結びつき
 聞き手の行動に応じた即時性のある行動は、結びつきを高めることがわかっているため、ロボットと視聴者の間で、短くて斬新なインタラクションを維持することをサポートするストーリーを選択することが重要である。これらは、ロボットが聞き手の行動を学習したり反応したりすることに直接反応する高レベルの即時性反応(クローズドループ)と、限られたプロンプトによって予測される限られたインタラクションに対するスクリプト化されたプログラムされた反応(オープンループ)のような低レベルの即時性反応のいずれかである。クローズドループ型のレスポンスは、結びつきを高める効果がある。なお、シンプルなオープンループ型のレスポンスも、結びつきを高めるために効果的である。キャラクターや設定の美的魅力などの要素は、結びつきを高めるように設計されており、シンプルなオープンループの問いかけと聞き手による返答によって、聞き手が物語の一部であると感じられるような即時性のある行動を設計することが重要である。
III. Connectivity Since we know that immediacy of action in response to the listener's actions enhances connection, we should choose stories that support maintaining short, novel interactions between the robot and the viewer. is important. These were scripted for high-level immediacy responses (closed-loop), where the robot responds directly to learning and reacting to listener actions, and for limited interactions predicted by limited prompts. Any low-level immediacy response, such as a programmed response (open loop). Closed-loop responses have the effect of increasing connection. A simple open-loop response can also be effective in increasing engagement. Elements such as the aesthetic appeal of the characters and setting are designed to enhance the connection, and the immediacy of the story makes the listener feel part of the story through simple open-loop questions and responses by the listener. Designing behavior is important.
 IV.教育
 子供向けの読み聞かせは、言語と感情のリテラシーを促進する。物語は、子どもたちと語り手の間にポジティブな価値観や双方向の会話を促すようにデザインする必要がある。シンプルな質問と答えのやり取りは、子どもたちが想像力を働かせて行動の結果を考えることを促すことができる。また、時間をかけて教育的な要素を追加することで、物語に含まれるより大きな概念を子どもたちに教えることもできる。
IV. Education Reading to children promotes verbal and emotional literacy. Stories should be designed to encourage positive values and two-way dialogue between children and the narrator. A simple question-and-answer exchange can encourage children to use their imagination and think about the consequences of their actions. You can also add educational elements over time to teach children about the larger concepts involved in the story.
 なお、上述した4つは一例であり、考慮する要素は、これに限らず、他の要素も含めるようにしてもよい。 It should be noted that the above four are just examples, and the elements to be considered are not limited to these, and other elements may also be included.
<コンテンツの例>
 ここで、読み聞かせに用いるコンテンツ500(図7) の例を説明する。
 実施形態で用いるコンテンツ500は、例えば、観客の主体性と関与をサポートし、物語の教育的価値の学習を促進し、語り手の役割をサポートするように作成者が設計した。これらの構成要素は、物語、アニメーション・プロジェクション、音、そしてロボットのパフォーマンスと観客とのインタラクションで構成されている。物語は、作家や教育者と一緒に作成した。クリエイティブディレクターは、デザイナー、イラストレーター、アニメーター、ボイスアーティスト、サウンドデザイナーと協力して、ストーリー要素を投影したビジュアルとキャラクターの演技、サウンドを制作した。物語に特化した目や体のジェスチャーが読み聞かせロボット1と聞き手とのインタラクションを促進するという知見に基づき、コンテンツは、ロボットの説得力を高め、ロボットが物語の登場人物に共感していることを伝えるために、物語のインタラクションに特化したロボットの体と目のための新しいアニメーション動作をデザインした。
<Content example>
Here, an example of content 500 (FIG. 7) used for storytelling will be described.
The content 500 used in the embodiments was designed by the author to, for example, support audience agency and engagement, facilitate learning of the educational value of stories, and support the role of storyteller. These components consist of narrative, animated projections, sound, and robot performance and interaction with the audience. The stories were created in collaboration with writers and educators. The creative director collaborated with designers, illustrators, animators, voice artists, and sound designers to create visuals, character performances, and sounds that projected story elements. Based on the knowledge that story-specific eye and body gestures promote interaction between story-telling robot 1 and listeners, the content enhances the robot's persuasiveness and sympathizes with the characters in the story. To convey this, we designed new animation movements for the robot's body and eyes that are dedicated to narrative interactions.
<読み聞かせロボット1のコンテンツに合わせた動作例>
 ここで、読み聞かせロボット1のコンテンツに合わせた動作例を説明する。
 図4、図5のような読み聞かせロボット1の表情や動作は、説得力とインタラクティビティを高めるために設計されており、人間の語り手が行うであろう現実的なジェスチャーのレパートリーを反映している。
<Example of operation according to the contents of story-telling robot 1>
Here, an operation example of the story-telling robot 1 according to the contents will be described.
The facial expressions and actions of the storytelling robot 1 as shown in FIGS. 4 and 5 are designed to increase persuasiveness and interactivity, and reflect the realistic repertoire of gestures that a human narrator would make. there is
 これには、体のアニメーション的なジェスチャーの動きと目のサッケードが含まれ、「Kendon’s Continuum」として知られるジェスチャーの分類スキーム(参考文献2参照)の4つのカテゴリーのうち3つに分類される。これらのジェスチャーは、ジェスティキュレーション(gesticulations)、エンブレム(emblems)、パントマイム(pantomimes)で構成されており、ケンドンのコンティニュアムの4番目のカテゴリーであり、ロボットには手がないので手のジェスチャーは難しい。 This includes animated gestural movements of the body and eye saccades, which fall into three of four categories in the gesture classification scheme known as "Kendon's Continuum" (see Reference 2). be. These gestures, which consist of gestures, emblems and pantomimes, are the fourth category in Kendon's continuum, and are used as hands because robots do not have hands. Gestures are difficult.
 参考文献2;D. McNeill, “Hand and mind: What gestures reveal about thought”, University of Chicago Press, 1992. Reference 2; D. McNeill, "Hand and mind: What gestures reveal about thought", University of Chicago Press, 1992.
 ジェスティキュレーションとは、音声に関連した意味を体現する動作である。例えば、肯定的な意見を語るときに、読み聞かせロボット1は、「うなずく」ことでジェスティキュレーションを実現する。 A gesture is an action that embodies the meaning associated with speech. For example, when speaking a positive opinion, the story-telling robot 1 realizes a gesture by “nodding”.
 エンブレムとは、幸せを表すために微笑むような、合意された意味を持つ慣用的なサインのことである。このため、読み聞かせロボット1は、幸せを表すために微笑むような表情をすることでエンブレムを実現する。 An emblem is an idiomatic sign with an agreed upon meaning, such as a smile to express happiness. Therefore, the storytelling robot 1 realizes an emblem by making a smiling expression to express happiness.
 パントマイムとは、物語を伝えるジェスチャーや一連のジェスチャーのことで、通常は言葉を使わずに行われる。例えば、ロボットが大げさに体を前に倒したり、目を伏せたりすることで、物語の登場人物の悲しみを模倣し、反映させることができる。 A pantomime is a gesture or series of gestures that tells a story, usually without words. For example, the robot can mimic and reflect the sadness of the characters in the story by exaggeratedly bending forward or dropping their eyes.
 また、社会的行動を導き、解釈するために目を使う能力は、社会的相互作用の中心的な側面である。このため、実施形態では、目の動きに特に注意を払ってパントマイムを設計した。物語の中の感情に関連してデザインされたジェスチャーの動きに加えて、ロボットの話を積極的に聞き、関与しているという認識を維持するために、読み聞かせロボット1の第3データベース150には、感情的なルーチンの間に展開される眼球の動きのライブラリをデザインして収納した。これらは、微妙に分散したサッカディックムーブメントで構成した。また、読み聞かせロボット1では、アイトラッキングにより、視線の方向を変えて、聞き手とのインタラクションやエンゲージメントをサポートした。 Also, the ability to use eyes to guide and interpret social behavior is a central aspect of social interaction. For this reason, in the embodiment, the pantomime was designed with particular attention to eye movements. In addition to the gestural movements designed to relate to the emotions in the story, a third database 150 of the storytelling robot 1 was added to maintain a sense of active listening and engagement with the robot. designed and housed a library of eye movements that are deployed during emotional routines. These consisted of subtly distributed saccadeic movements. In the story-telling robot 1, eye-tracking is used to change the line-of-sight direction to support interaction and engagement with the listener.
 さらに、クリエイティブ・ストーリーテリングのコンテンツは、ロボットと聞き手とのインタラクション、口述による読み聞かせ、画像表示装置7によるライブで体験するように設計されているため、読み聞かせにおけるロボットの役割を確認することが重要である。このため、実施形態では、画像表示装置7やロボットのジェスチャー・パフォーマンスなどのクリエイティブなコンテンツは、音声合成システムによる音声伝達の限界や課題を克服するために利用する。 Furthermore, since the creative storytelling content is designed to be experienced through interaction between the robot and the listener, dictated reading, and a live experience using the image display device 7, it is possible to confirm the role of the robot in storytelling. is important. Therefore, in the embodiment, creative contents such as the image display device 7 and robot gesture performance are used to overcome the limitations and problems of speech transmission by a speech synthesis system.
 なお、音声の表現力は、物語の中で注目を集め、情報を強調するための重要な要素である。実施形態では、音声合成システムが、表現、変調、ニュアンスの範囲が限られているため、読み聞かせための音声伝達に以下の2つの役割を設定した。 In addition, the expressiveness of the voice is an important element for attracting attention and emphasizing information in the story. In the embodiment, since speech synthesis systems have a limited range of expressions, modulations, and nuances, we set two roles for speech delivery for storytelling:
 第1の役割は、人間の語り手の声をプロジェクトの読み聞かせの要素として使用する。そして、ロボットは、物語のファシリテーターの役割を担い、ストーリーに応じたジェスチャーや共感的な反応、簡単な質問と回答を組み合わせて、物語と対話するようにする。なお、この役割の音声信号は、人間の語り手が表情豊かに物語を語る際の音声パフォーマンスを録音したものである。そして、ロボットの音声合成システムには、質問と回答のインタラクションのために適切なジェスチャーの動きと組み合わせた。 The first role uses the human narrator's voice as the storytelling element of the project. The robot then takes on the role of facilitator of the story, combining gestures, sympathetic reactions, and simple questions and answers according to the story to interact with the story. Note that the audio signal for this role is a recording of a human narrator's vocal performance when telling a story expressively. The robot's speech synthesis system then combined it with appropriate gestural movements for question-answer interactions.
 第2の役割は、語り手としてのロボットである。これは、ロボットの音声合成システムを読み聞かせと質問応答の両方に組み合わせ、さらにロボットのジェスチャー動作と組み合わせたものである。 The second role is that of the robot as a narrator. It combines a robot speech synthesis system for both storytelling and question-answering, as well as robot gestures.
 実施形態では、これらの要素を組み合わせて提供されるすべての潜在的なインタラクションを、ロボットにプログラミングするためにマッピングした。 In the embodiment, all potential interactions provided by combining these elements are mapped for programming into the robot.
<読み聞かせロボット1の処理プログラム>
 次に、読み聞かせロボット1の処理プログラム例を説明する。
 まず、デザインされた読み聞かせのコンセプトを実際のロボットに移すためのフレームワークが必要である。なお、ロボットは、単に物語を再現するオープンループとしては機能しない。ロボットは、感覚的な入力に反応して、結びつき、エージェンシー、教育に関連した行動を引き起こす必要がある。さらに、語り手としてのロボットは、進行役としても読み聞かせ役そのものとしても機能する必要がある。
<Processing program of story-telling robot 1>
Next, an example of a processing program of the story-telling robot 1 will be described.
First, we need a framework to transfer the designed storytelling concept to the actual robot. It should be noted that the robot does not simply function as an open loop that reproduces the story. Robots need to respond to sensory inputs to trigger behaviors related to bonding, agency, and education. Furthermore, the robot as a narrator needs to function both as a facilitator and as a storyteller itself.
 したがって、フレームワークは、異なるアクチュエーションコンポーネントとモダリティを組み合わせることで、ロボットの行動を定義し、知覚結果にアクセスすることができるようにする必要がある。このような目的を達成するためにロボットに対して行うプログラミング、言い換えれば「ストーリーテリングの振り付け師」は、ロボット工学を専門としない人でも、簡単にプログラミングできるものであることが好ましい。このフレームワークは、色々な物語を考慮できるだけの汎用性があり、新しい物語や行動を作るために要素を再利用できることが好ましい。 Therefore, the framework needs to be able to define the robot's behavior and access the perceptual results by combining different actuation components and modalities. It is preferable that the programming performed on the robot to achieve such a purpose, in other words, a "choreographer of storytelling," can be easily programmed even by a person who does not specialize in robotics engineering. The framework should be versatile enough to consider different narratives, and preferably reuse elements to create new narratives and actions.
 実施形態では、この「ストーリーを語る振り付け師」の基礎として、行動木(Behaviour Trees;BT)を採用した。行動木は、意思決定を行うアプリケーションにおいて、複数のタスクの流れと制御を構造化する手法である。行動木をフレームワークに用いることで、ロボットの専門家ではない人が新しい動作を実装する際に、簡単なセマンティクスを提供することができる。 In the embodiment, Behavior Trees (BT) were adopted as the basis for this "choreographer who tells a story". An action tree is a method for structuring the flow and control of multiple tasks in a decision-making application. By using action trees as a framework, we can provide simple semantics for non-robot experts to implement new actions.
 以下に、実施形態における行動木の主な要素を説明する。
 行動木は、行動をノードで構成された階層的なツリーとしてモデル化する。
 木は、明確に定義されたルールに従って一定の速度で上から下に向かって走査され、その際に遭遇するタスクとコマンドを実行する。なお、タスクやコマンドの状態は、チェーンをさかのぼって報告され、それに応じてフローが変化する。
Main elements of the behavior tree in the embodiment are described below.
An action tree models actions as a hierarchical tree composed of nodes.
The tree is traversed from top to bottom at a constant rate according to well-defined rules, executing tasks and commands encountered along the way. Note that the status of tasks and commands is reported up the chain, and the flow changes accordingly.
 ノードは、その機能によって次のように、例えば次のように分類される。
 I.コンポジット(Composite):ツリー自体の流れを制御する。これらのノードは、伝統的なプログラミング言語の制御構造に似ている。
 II.デコレーター(Decorator):子から受け取ったステータスを処理したり、変更したりする。
 III.リーフ(Leaf):実際のタスクが実行される場所であり、ロボットが実行できる原子的なタスクまたはその他の機能性がある。そのため、これらのノードは子を持つことができない。
Nodes are classified according to their function as follows, for example:
I. Composite: Controls the flow of the tree itself. These nodes resemble control structures in traditional programming languages.
II. Decorator: Processes or modifies the status received from the child.
III. Leaf: Where the actual tasks are performed, there are atomic tasks or other functionality that the robot can perform. Therefore, these nodes cannot have children.
 この分類からもわかるように、行動木は、ロジックと実際のタスクを自然な形で切り離している。また、ツリーを開発する際には、リーフノードだけを気にすればよい。このフローは、後に定義され絶えず再配置され、新しい読み聞かせの動作を生み出したり、既に行われていることを拡張したりすることができる。行動木の重要な利点は、(ツリーの階層性による)合成可能性と再利用可能性がある点である。 As you can see from this classification, the action tree naturally separates the logic from the actual task. Also, when developing a tree, you only need to care about the leaf nodes. This flow can be later defined and constantly rearranged to create new storytelling actions or to extend what is already being done. An important advantage of action trees is composability and reusability (due to the tree's hierarchical nature).
<読み聞かせの実装>
 読み聞かせの実装は、図7に示すように、行動木エンジン用にリーフノードの「パレット」を用いた。
 図7は、本実施形態に係る読み聞かせの実装例を示す図である。図7のように、コンテンツ500は、例えば、ロボットの表現のルーチン(喜び、悲しみ等)501、画像メディア(写真、静止画、動画等)502、音響信号(音声・音声合成のためのプロンプト・オーディオ等)503を備える。
<Implementation of storytelling>
The storytelling implementation used a "palette" of leaf nodes for the action tree engine, as shown in FIG.
FIG. 7 is a diagram showing an implementation example of storytelling according to the present embodiment. As shown in FIG. 7, content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, audio signals (prompts/prompts for voice/speech synthesis). audio, etc.) 503 .
 また、図7のように、行動木による読み聞かせの構成において、生成部144(図12、読み聞かせ情報作成装置)は、例えば、語り手・エージェンシー部1441、結びつき部1442、および教育部1443を備える。生成部144は、センサ(撮影部102、収音部103、センサ104(図12))からの入力とコンテンツ500を用いて、読み聞かせロボット1に出力する情報と、画像表示装置7に出力する情報を生成する。なお、生成部144は、読み聞かせロボット1が備えていてもよく、他の装置(例えばパーソナルコンピュータ、タブレット端末等)が備えていてもよい。 In addition, as shown in FIG. 7, in the configuration of storytelling using an action tree, the generation unit 144 (FIG. 12, storytelling information creation device) includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443. . The generating unit 144 uses the input from the sensors (the photographing unit 102, the sound collecting unit 103, and the sensor 104 (FIG. 12)) and the content 500 to output information to the reading robot 1 and to the image display device 7. Generate information. In addition, the story-telling robot 1 may be equipped with the production|generation part 144, and another apparatus (for example, a personal computer, a tablet terminal, etc.) may be equipped with it.
 ここでは、他の装置が生成部144を備えている例を説明する。
 これらのノードは、新しい物語を作成する際、開発者が「ドラッグ・アンド・ドロップ」して、CompositeノードやDecoratorノードと組み合わせることで、新しいストーリーテリング・アプリケーションを作ることができるように、例えばGUI(Graphical User Interface)で構成されている。
Here, an example in which another device includes the generation unit 144 will be described.
These nodes can be used, for example, in the GUI so that when creating new stories, developers can "drag and drop" and combine them with Composite and Decorator nodes to create new storytelling applications. (Graphical User Interface).
 ここで、読み聞かせのために行動木で使用される定義済みのノード例を、図8~図11に示す。
 図8は、語り手ノードの一例を示す図である。語り手ノードは物語を進行させる。図9は、エージェンシー・ノードの一例を示す図である。エージェンシー・ノードは、マルチメディアと表現のサポートを提供する。図10は、結びつきノードの一例を示す図である。結びつきノードは、ロボットのクローズドループの即時反応を提供する。図11は、読み聞かせのための行動木構成の一例を示す図である。なお、認識生成部140は、聞き手の反応を、聞き手を撮影した画像と、聞き手の音声を収音した音声信号と、聞き手の姿勢を検出した結果のうちの少なくとも1つに基づいて認識するようにしてもよい。
 なお、図8~図11に示した構造、接続関係、ノード等は一例であり、これに限らない。
Here, examples of predefined nodes used in the action tree for storytelling are shown in FIGS. 8-11.
FIG. 8 is a diagram showing an example of a narrator node. The narrator node advances the story. FIG. 9 is a diagram showing an example of an agency node. Agency nodes provide multimedia and presentation support. FIG. 10 is a diagram showing an example of a connection node. The connection node provides a closed-loop immediate reaction of the robot. FIG. 11 is a diagram showing an example of an action tree configuration for storytelling. Note that the recognition generation unit 140 recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the listener's voice, and a result of detecting the posture of the listener. can be
Note that the structures, connection relationships, nodes, etc. shown in FIGS. 8 to 11 are examples, and are not limited to these.
 以下に、図8~図11における各ノードの概略を説明する。
・ナレーター(Narrator)(1441)(ナレーター部);ナレーションを進めるために様々なブロックが作られる。
An outline of each node in FIGS. 8 to 11 will be described below.
Narrator (1441) (Narrator section); various blocks are created to advance the narration.
・オーディオ再生(AudioPlay);事前に録音した音声を語り手が再生できるようにし、音声のさまざまな側面を制御する。 • AudioPlay: Allows the narrator to play pre-recorded audio and controls various aspects of the audio.
・発話音声合成(SpeakTTS);音声合成(Text-to-Speech)エンジンを使って、ロボットがプログラムで文章を発する。これによって、読み聞かせ中の状況に応じて、読み聞かせロボット1が反応できるようになる。さらに、例えばタグを使って発話の韻律をコントロールすることもできる。 Speech-to-speech synthesis (SpeakTTS): Using a text-to-speech engine, the robot utters a programmatic sentence. This enables the story-telling robot 1 to react according to the situation during story-telling. In addition, you can control the prosody of your speech using tags, for example.
・エージェンシー(Agency)(1441);ロボットの表現力を考慮したり、表現力のあるルーチンを起動したり、マルチメディアに対応したりする複数のブロック(例えば、ExecuteRoutine、ProjectorImage and ProjectorVideo)を備える。例えば、ExecuteRoutineは、事前に定義したロボットの表現ルーチンの実行を許可する。これらのルーチンは、動きや音、目や口など、ロボットのすべての動作様式のオープンループの組み合わせで、喜びや悲しみなどのさまざまな感情を表現する。ProjectorImage and ProjectorVideoは、画像表示装置7を使って、静止画やアニメーションを表示できるようにする。 Agency (1441); contains multiple blocks (eg, ExecuteRoutine, ProjectorImage and ProjectorVideo) that take into account the expressiveness of the robot, launch expressive routines, and support multimedia. For example, ExecuteRoutine allows execution of a predefined robot expression routine. These routines are open-loop combinations of all the robot's behavioral modalities, including movements, sounds, eyes and mouths, to express various emotions such as joy and sorrow. Projector Image and Projector Video use the image display device 7 to display still images and animations.
・結びつき(Engagement)(1442);このブロックは、特定の人物にロボットの視線を維持するような、人に反応する即時性のある行動を作成することができ、複数のブロックを(例えばTrackPerson)を備える。TrackPersonは、知覚機能を用いて検出した人物を追跡する(例えばGetPeopleのような追加ブロックとして実装されており、視野内のすべての人物に関する情報を返えす)。他のブロック(例えばGetClosestPersonなど)と組み合わせることで、最も近い人物を探して追跡したり、他の近接条件を考慮したりすることができる。 • Engagement (1442); this block can create immediacy actions that respond to people, such as keeping a robot's line of sight on a particular person, and can connect multiple blocks (e.g. TrackPerson). Prepare. TrackPerson tracks detected persons using perceptual functions (implemented as an additional block, eg GetPeople, which returns information about all persons in the field of view). It can be combined with other blocks (eg, GetClosestPerson, etc.) to find and track the closest person or consider other proximity conditions.
・教育(Education)(1443);このブロックでは、質問と回答の設定に関連する複数のブロック(例えば、AskQuestion、ListensPerson、GetASR)を備える。AskQuestionは、音声合成を使用して質問を行い、予想される答えを行動木メモリに設定する。ListensPersonは、複数の人から成るグループがあるときにどの人の話を聞くかを示す。GetASRは、指定した人から認識した音声を取得する。これを使用して、YesまたはNoの質問の場合に期待される答えと比較したり、その結果をもとにロボットに適切な反応をさせたりすることができる。 • Education (1443); this block comprises several blocks (eg, AskQuestion, ListensPerson, GetASR) related to setting questions and answers. AskQuestion asks a question using speech synthesis and sets the expected answer in the action tree memory. ListensPerson indicates which person to listen to when there is a group of people. GetASR obtains recognized speech from a specified person. This can be used to compare the expected answer to a yes or no question, or to have the robot respond appropriately based on the result.
<読み聞かせロボット1の構成例>
 次に、読み聞かせロボット1の構成例を説明する。
 図12は、本実施形態に係る読み聞かせロボット1の構成例を示すブロック図である。図12のように、読み聞かせロボット1は、受信部101、撮影部102、収音部103、センサ104、コミュニケーション装置100、記憶部106、第1データベース107、第2データベース109、表示部111、スピーカー112、アクチュエータ113、送信部114、第3データベース150、およびコンテンツ500を備えている。
<Configuration example of story-telling robot 1>
Next, a configuration example of the story-telling robot 1 will be described.
FIG. 12 is a block diagram showing a configuration example of the storytelling robot 1 according to this embodiment. As shown in FIG. 12, the storytelling robot 1 includes a receiving unit 101, an imaging unit 102, a sound pickup unit 103, a sensor 104, a communication device 100, a storage unit 106, a first database 107, a second database 109, a display unit 111, A speaker 112 , an actuator 113 , a transmitter 114 , a third database 150 and content 500 are provided.
 コミュニケーション装置100は、認知部105(認知手段)、学習部108(学習手段)、動作生成部110(動作生成手段)、および認識生成部140を備えている。
 動作生成部110は、画像生成部1101、音声生成部1102、駆動部1103、送信情報生成部1104を備えている。
The communication device 100 includes a recognition section 105 (cognition means), a learning section 108 (learning means), an action generation section 110 (action generation means), and a recognition generation section 140 .
The motion generating section 110 includes an image generating section 1101 , a sound generating section 1102 , a driving section 1103 and a transmission information generating section 1104 .
<読み聞かせロボット1の機能、動作>
 次に、読み聞かせロボット1の各機能部の機能、動作について、図1を参照して説明する。
<Functions and operations of story-telling robot 1>
Next, the function and operation of each functional unit of the storytelling robot 1 will be described with reference to FIG.
 受信部101は、ネットワークを介して、例えばインターネットから情報(例えば電子メール、ブログ情報、ニュース、天気予報等)を取得し、取得した情報を認知部105と動作生成部110に出力する。または、受信部101は、例えば第1データベース107がクラウド上にある場合、クラウド上の第1データベース107から情報を取得し、取得した情報を認知部105に出力する。 The receiving unit 101 acquires information (e.g., e-mail, blog information, news, weather forecast, etc.) from, for example, the Internet via a network, and outputs the acquired information to the recognition unit 105 and the action generation unit 110. Alternatively, for example, when the first database 107 is on the cloud, the receiving unit 101 acquires information from the first database 107 on the cloud and outputs the acquired information to the recognition unit 105 .
 撮影部102は、例えばCMOS(Complementary Metal Oxide Semiconductor;相補性金属酸化膜半導体)撮影素子、またはCCD(Charge Coupled Device;電荷結合素子)撮影素子等である。また、撮影部102は、深度センサを備えている。撮影部102は、撮影した画像(人に関する情報である人情報;静止画、連続した静止画、動画)と深度情報を認知部105と動作生成部110に出力する。なお、読み聞かせロボット1は、撮影部102を複数備えていてもよい。この場合、撮影部102は、例えば読み聞かせロボット1の筐体の前方と後方に取り付けられていてもよい。なお、撮影部102は、距離センサであってもよい。 The imaging unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) imaging device, a CCD (Charge Coupled Device) imaging device, or the like. The imaging unit 102 also includes a depth sensor. The photographing unit 102 outputs photographed images (human information, which is information about a person; still images, continuous still images, moving images) and depth information to the recognition unit 105 and the action generation unit 110 . Note that the storytelling robot 1 may include a plurality of imaging units 102 . In this case, the imaging unit 102 may be attached to the front and rear of the housing of the storytelling robot 1, for example. Note that the imaging unit 102 may be a distance sensor.
 収音部103は、例えば複数のマイクロホンで構成されるマイクロホンアレイである。収音部103は、複数のマイクロホンが収音した音響信号(人情報)を認知部105と動作生成部110に出力する。なお、収音部103は、マイクロホンが収音した音響信号それぞれを、同じサンプリング信号でサンプリングして、アナログ信号からデジタル信号に変換した後、認知部105に出力するようにしてもよい。 The sound pickup unit 103 is, for example, a microphone array composed of a plurality of microphones. The sound pickup unit 103 outputs acoustic signals (human information) picked up by a plurality of microphones to the recognition unit 105 and the action generation unit 110 . Note that the sound pickup unit 103 may sample each sound signal picked up by the microphone using the same sampling signal, convert the analog signal into a digital signal, and then output the signal to the recognition unit 105 .
 センサ104は、例えば環境の温度を検出する温度センサ、環境の照度を検出する照度センサ、読み聞かせロボット1の筐体の傾きを検出するジャイロセンサ、読み聞かせロボット1の筐体の動きを検出する加速度センサ、気圧を検出する気圧センサ等である。センサ104は、検出した検出値を認知部105と動作生成部110に出力する。なお、深度センサは、センサ104が備えていてもよい。 The sensor 104 includes, for example, a temperature sensor that detects the temperature of the environment, an illuminance sensor that detects the illuminance of the environment, a gyro sensor that detects the tilt of the storytelling robot 1 housing, and a movement of the storytelling robot 1 housing. They are an acceleration sensor, an atmospheric pressure sensor for detecting atmospheric pressure, and the like. The sensor 104 outputs the detected value to the recognition unit 105 and the motion generation unit 110 . Note that the depth sensor may be included in the sensor 104 .
 記憶部106は、例えば、認知部105が認識すべき項目、認識の際に用いられる各種値(しきい値、定数)、認識を行うためのアルゴリズム等を記憶する。 The storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used for recognition, algorithms for recognition, and the like.
 第1データベース107は、例えば、音声認識の際に用いられる言語モデルデータベースと音響モデルデータベースと対話コーパスデータベースと音響特徴量、画像認識の際に用いられる比較用画像データベースと画像特徴量、等を格納する。なお、各データ、特徴量については後述する。なお、第1データベース107は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used for speech recognition, a comparison image database and image features used for image recognition, and the like. do. Each data and feature amount will be described later. Note that the first database 107 may be placed on the cloud or may be connected via a network.
 第2データベース109は、学習時に用いられる、例えば社会構成要素、社会規範、社会的慣習、心理学、人文学等、人と人との関係性に関するデータを格納する。なお、第2データベース109は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 The second database 109 stores data related to relationships between people, such as social components, social norms, social customs, psychology, and humanities, which are used during learning. Note that the second database 109 may be placed on the cloud or may be connected via a network.
 コミュニケーション装置100は、読み聞かせロボット1と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知し、認知した内容と第2データベース109が格納するデータとに基づいて人間の感情的な相互作用を学習する。そして、コミュニケーション装置100は、学習した内容から読み聞かせロボット1の社会的能力を生成する。なお、社会能力とは、例えば、人と人との間で行われる対話、行動、理解、共感等、人と人との間の相互作用を行う能力である。 The communication device 100 recognizes an action that occurs between the story-telling robot 1 and a person, or an action that occurs between a plurality of people, and based on the recognized content and the data stored in the second database 109, the human's emotional response. Learn to interact. Then, the communication device 100 generates the social ability of the storytelling robot 1 from the learned contents. The social ability is, for example, the ability to interact between people, such as dialogue, behavior, understanding, empathy, etc. between people.
 認知部105は、読み聞かせロボット1と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。認知部105は、撮影部102が撮影した画像、収音部103が収音した音響信号、およびセンサ104が検出した検出値を取得する。なお、認知部105は、受信部101が受信した情報を取得するようにしてもよい。認知部105は、取得した情報と、第1データベース107に格納されているデータに基づいて、読み聞かせロボット1と人の間に生じる働きかけ、または複数人の間に生じる働きかけを認知する。なお、認知方法については後述する。認知部105は、認知した認知結果(音に関する特徴量、人行動に関する特徴情報)を学習部108に出力する。なお、認知部105は、撮影部102が撮影した画像に対して周知の画像処理(例えば、二値化処理、エッジ検出処理、クラスタリング処理、画像特徴量抽出処理等)を行う。認知部105は、取得した音響信号に対して、周知の音声認識処置(音源同定処理、音源定位処理、雑音抑圧処理、音声区間検出処理、音源抽出処理、音響特徴量算出処理等)を行う。認知部105は、認知された結果に基づいて、取得された音響信号から対象の人または動物または物の音声信号(または音響信号)を抽出して、抽出した音声信号(または音響信号)を認識結果として動作生成部110に出力する。認知部105は、認知された結果に基づいて、取得された画像から対象の人または物の画像を抽出して、抽出した画像を認識結果として動作生成部110に出力する。 The recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between multiple people. The recognizing unit 105 acquires the image captured by the capturing unit 102 , the acoustic signal collected by the sound collecting unit 103 , and the detection value detected by the sensor 104 . Note that the recognition unit 105 may acquire information received by the reception unit 101 . Based on the acquired information and the data stored in the first database 107, the recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between a plurality of persons. Note that the recognition method will be described later. The recognizing unit 105 outputs the recognized recognition result (feature amount related to sound, feature information related to human behavior) to the learning unit 108 . Note that the recognition unit 105 performs well-known image processing (for example, binarization processing, edge detection processing, clustering processing, image feature amount extraction processing, etc.) on the image captured by the imaging unit 102 . The recognition unit 105 performs well-known speech recognition processing (sound source identification processing, sound source localization processing, noise suppression processing, speech section detection processing, sound source extraction processing, acoustic feature amount calculation processing, etc.) on the acquired acoustic signal. The recognition unit 105 extracts a sound signal (or sound signal) of a target person, animal, or object from the acquired sound signal based on the recognition result, and recognizes the extracted sound signal (or sound signal). The result is output to the motion generation unit 110 . The recognition unit 105 extracts an image of a target person or object from the acquired image based on the recognition result, and outputs the extracted image to the action generation unit 110 as a recognition result.
 学習部108は、認知部105が出力する認知結果と、第2データベース109に格納されているデータを用いて、人間の感情的な相互作用を学習する。学習部108は、学習によって生成されたモデルを記憶する。なお、学習方法については後述する。 The learning unit 108 uses the recognition results output by the recognition unit 105 and the data stored in the second database 109 to learn human emotional interactions. A learning unit 108 stores a model generated by learning. The learning method will be described later.
 認識生成部140は、聞き手の反応を、聞き手を撮影した画像と、聞き手の音声を収音した音声信号と、聞き手の姿勢を検出した結果のうちの少なくとも1つに基づいて認識し、認識した結果に基づく情報を動作生成部110と、画像表示装置7に出力する。なお、認識生成部140の構成例と動作例は、図13を用いて後述する。 The recognition generation unit 140 recognizes and recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the voice of the listener, and a result of detecting the posture of the listener. Information based on the result is output to the motion generator 110 and the image display device 7 . A configuration example and an operation example of the recognition generation unit 140 will be described later with reference to FIG. 13 .
 動作生成部110は、読み聞かせ時、認識生成部140が備える生成部144(図13)が生成した情報に基づいて、ロボットの表情情報と動作情報を生成する。動作生成部110は、受信部101から受信された情報、撮影部102から撮影された画像、収音部103から収音された音響信号、および認知部105から認識結果を取得する。動作生成部110は、学習された結果と、取得された情報とに基づいて、利用者に対する行動(発話、仕草、画像)を生成する。 The motion generation unit 110 generates facial expression information and motion information of the robot based on the information generated by the generation unit 144 (FIG. 13) included in the recognition generation unit 140 when reading aloud. The motion generation unit 110 acquires information received from the reception unit 101 , images captured by the imaging unit 102 , acoustic signals collected by the sound collection unit 103 , and recognition results from the recognition unit 105 . The action generation unit 110 generates actions (utterances, gestures, images) for the user based on the learned result and the acquired information.
 画像生成部1101は、学習された結果と、取得された情報とに基づいて、表示部111に表示させる出力画像(静止画、連続した静止画、または動画)を生成し、生成した出力画像を表示部111に表示させる。これにより、動作生成部110は、表示部111に表情のようなアニメーションを表示させ、利用者へ提示する画像を提示させて、利用者とのコミュニケーションを取る。表示される画像は、人の目の動きに相当する画像、人の口の動きに相当する画像、利用者の目的地などの情報(地図、天気図、天気予報、お店や行楽地の情報等)、インターネット回線を介して利用者にTV電話してきた人の画像等である。 The image generation unit 1101 generates an output image (still image, continuous still images, or moving image) to be displayed on the display unit 111 based on the learned result and the acquired information, and outputs the generated output image. Displayed on the display unit 111 . Thereby, the action generating unit 110 causes the display unit 111 to display an animation such as a facial expression, present an image to be presented to the user, and communicate with the user. Images that are displayed include images that correspond to the movements of a person's eyes, images that correspond to the movements of a person's mouth, and information such as the user's destination (maps, weather maps, weather forecasts, information on shops and resorts, etc.). etc.), an image of a person calling the user via the Internet line.
 音声生成部1102は、学習された結果と、取得された情報とに基づいて、スピーカー112に出力させる出力音声信号を生成し、生成した出力音声信号をスピーカー112に出力させる。これにより、動作生成部110は、スピーカー112から音声信号を出力させて、利用者とのコミュニケーションを取る。出力される音声信号は、読み聞かせロボット1に割り当てられている声による音声信号、インターネット回線を介して利用者にTV電話してきた人の音声信号等である。 The audio generation unit 1102 generates an output audio signal to be output to the speaker 112 based on the learned result and the acquired information, and causes the speaker 112 to output the generated output audio signal. As a result, the action generator 110 causes the speaker 112 to output an audio signal to communicate with the user. The voice signals to be output are voice signals assigned to the storytelling robot 1, voice signals of a person calling the user via the Internet line, and the like.
 駆動部1103は、学習された結果と、取得された情報とに基づいて、アクチュエータ113を駆動するための駆動信号を生成し、生成した駆動信号でアクチュエータ113を駆動する。これにより、動作生成部110は、読み聞かせロボット1の動作を制御することで感情等を表現させ、利用者とのコミュニケーションを取る。 The drive unit 1103 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal. As a result, the motion generating unit 110 controls the motion of the storytelling robot 1 to express emotions and communicate with the user.
 送信情報生成部1104は、学習された結果と、取得された情報とに基づいて、例えば利用者がネットワークを会話している他の利用者へ、利用者が送信したい送信情報(音声信号、画像)を生成し、生成した送信情報を送信部114から送信させる。 Based on the learned result and the acquired information, the transmission information generation unit 1104 generates transmission information (audio signal, image ) is generated, and the generated transmission information is transmitted from the transmission unit 114 .
 表示部111は、液晶画像表示装置、または有機EL(Electro Luminescence)画像表示装置等である。表示部111は、コミュニケーション装置100の画像生成部1101が出力する出力画像を表示する。 The display unit 111 is a liquid crystal image display device, an organic EL (Electro Luminescence) image display device, or the like. Display unit 111 displays an output image output by image generation unit 1101 of communication device 100 .
 スピーカー112は、音声生成部1102が出力する出力音声信号を出力する。 The speaker 112 outputs the output audio signal output by the audio generation unit 1102 .
 アクチュエータ113は、駆動部1103が出力する駆動信号に応じて動作部を駆動する。 The actuator 113 drives the action section according to the drive signal output by the drive section 1103 .
 送信部114は、送信情報生成部1104が出力する送信情報を、ネットワークを介して送信先に送信する。 The transmission unit 114 transmits the transmission information output by the transmission information generation unit 1104 to the transmission destination via the network.
 第3データベース150は、聞き手の顔画像に名前や識別情報を関連付け格納し、感情的なルーチンの間に展開される眼球の動きのライブラリが格納される。また、第3データベース150は、ジェスチャー認識に用いられるデータ、音声認識に用いられる言語モデル等を格納する。なお、第3データベース150は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 A third database 150 associates and stores names and identification information with listener facial images, and stores a library of eye movements developed during emotional routines. The third database 150 also stores data used for gesture recognition, language models used for speech recognition, and the like. Note that the third database 150 may be placed on the cloud or may be connected via a network.
 コンテンツ500は、上述したように、例えば、ロボットの表現のルーチン(喜び、悲しみ等)501、画像メディア(写真、静止画、動画等)502、音響信号(音声・音声合成のためのプロンプト・オーディオ等)503を備える。なお、コンテンツ500は、クラウド上に置かれていてもよく、ネットワークを介して接続されていてもよい。 As described above, the content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, acoustic signals (prompts/audio for voice/speech synthesis). etc.) 503 . Note that the content 500 may be placed on the cloud or may be connected via a network.
 なお、読み聞かせロボット1は、個人または複数の人2とのコミュニケーションを、例えば特願2020-108946に記載の手法を用いて行う。 It should be noted that the story-telling robot 1 communicates with an individual or a plurality of people 2 using, for example, the method described in Japanese Patent Application No. 2020-108946.
<認識生成部140の構成と動作例>
 次に、認識生成部140の構成と動作例いついて説明する。
 図13は、本実施形態に係る認識生成部140の構成例を示す図である。認識生成部140は、図13のように、例えば顔認識部141、ジェスチャー認識部142、音声認識部143、および生成部144を備える。また生成部144は、上述したように、例えば、語り手・エージェンシー部1441、結びつき部1442、および教育部1443を備える。
<Configuration and Operation Example of Recognition Generation Unit 140>
Next, the configuration and operation example of the recognition generation unit 140 will be described.
FIG. 13 is a diagram showing a configuration example of the recognition generation unit 140 according to this embodiment. The recognition generation unit 140 includes, for example, a face recognition unit 141, a gesture recognition unit 142, a voice recognition unit 143, and a generation unit 144, as shown in FIG. The generation unit 144 also includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443, as described above.
 認識生成部140は、顔認識、ジェスチャー認識および音声認識を行う。認識生成部140は、センサ(撮影部102、収音部103、センサ104)からの入力とコンテンツ500を用いて、動作生成部110に出力する情報と、画像表示装置7に出力する情報を生成する。 The recognition generation unit 140 performs face recognition, gesture recognition, and voice recognition. The recognition generation unit 140 generates information to be output to the action generation unit 110 and information to be output to the image display device 7 using inputs from the sensors (image capturing unit 102, sound collection unit 103, sensor 104) and the content 500. do.
 顔認識部141は、第3データベース150に格納されているデータを参照し、周知の手法を用いて、撮影された画像に含まれる人の顔を認識する。顔認識部141は、第3データベース150にデータが格納されて無い場合、認識した顔画像に名前や識別情報を付与して第3データベースに格納する。 The face recognition unit 141 refers to the data stored in the third database 150 and uses a well-known method to recognize the face of the person included in the captured image. When the data is not stored in the third database 150, the face recognition unit 141 adds a name and identification information to the recognized face image and stores it in the third database.
 ジェスチャー認識部142は、第3データベース150に格納されているデータを参照し、周知の手法を用いて、撮影された画像に含まれる人の頭の位置や傾き、体の向き、手の位置の検出とトラッキングを行う。 The gesture recognition unit 142 refers to the data stored in the third database 150, and uses a known technique to determine the position and tilt of the person's head, body orientation, and hand position included in the captured image. detect and track.
 音声認識部143は、第3データベース150に格納されているデータを参照し、周知の手法を用いて、音源定位、音源分離、雑音抑圧、発話区間検出、話者特定等の処理を行う。 The speech recognition unit 143 refers to the data stored in the third database 150 and uses well-known techniques to perform processing such as sound source localization, sound source separation, noise suppression, speech segment detection, and speaker identification.
 語り手・エージェンシー部1441は、読み聞かせ時にセンサが検出した情報を入力とし、例えば、人間の語り手の身体や顔の表情、近接運動、視線、目のサッケード、話し方、声のトーン、音量など等を教師データとして学習した学習済みのモデルである。読み聞かせ時、語り手・エージェンシー部1441は、センサが検出した情報が入力され、例えば、首の動かし方、視線、目の動きや口の動き、話し方、声のトーン、音量等を出力する。 The narrator/agency unit 1441 receives information detected by the sensor during reading, such as the human narrator's body and facial expression, proximity movement, line of sight, eye saccade, way of speaking, tone of voice, volume, and the like. This is a trained model trained as teacher data. When reading aloud, the narrator/agency unit 1441 receives information detected by the sensor, and outputs, for example, how to move the neck, the line of sight, the movement of the eyes and the mouth, the way of speaking, the tone of voice, the volume, and the like.
 結びつき部1442は、読み聞かせ時にセンサが検出した情報と語り手・エージェンシー部1441が出力する情報を入力とし、例えば、聞き手の行動に応じた即時性のある行動等を教師データとして学習した学習済みのモデルである。読み聞かせ時、結びつき部1442は、センサが検出した情報および語り手・エージェンシー部1441が出力する情報のうちの少なくとも1つが入力され、例えば、聞き手の行動に応じた即時性のある行動等を出力する。 The connection unit 1442 receives the information detected by the sensor during reading and the information output by the narrator/agency unit 1441, and learns, for example, actions with immediacy according to the actions of the listener as teacher data. is a model. At the time of story-telling, the connection unit 1442 receives at least one of the information detected by the sensor and the information output by the narrator/agency unit 1441, and outputs, for example, immediate actions according to the actions of the listener. .
 教育部1443は、読み聞かせ時のセンサが検出し情報と結びつき部1442が出力する情報を入力とし、例えば、語り手と聞き手との間にポジティブな価値観や双方向の会話を促した情報等を教師データとして学習した学習済みのモデルである。読み聞かせ時、教育部1443は、センサが検出および結びつき部1442が出力する情報のうちの少なくとも1つが入力され、例えば、語り手と聞き手との間にポジティブな価値観や双方向の会話を促す情報を出力する。 The education unit 1443 receives as input the information detected by the sensor during reading aloud and the information output by the unit 1442 for linking. This is a trained model trained as teacher data. At the time of reading, the education unit 1443 receives at least one of the information detected by the sensor and output by the linking unit 1442. For example, information that encourages positive values and two-way conversation between the narrator and the listener. to output
<第1データベースが格納するデータ>
 次に、第1データベース107が格納するデータ例を説明する。第1データベース107は、例えば、言語モデルデータベース、音響モデルデータベース、対話コーパスデータベース、比較用画像データベースを格納する。
 言語モデルデータベースは、言語モデルを格納する。言語モデルは、任意の文字列について、それが日本語文等である確率を付与する確率モデルである。また、言語モデルは、例えば、Nグラムモデル、隠れマルコフモデル、最大エントロピーモデル等のいずれかである。
<Data stored in the first database>
Next, an example of data stored in the first database 107 will be described. The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and a comparison image database.
The language model database stores language models. A language model is a probabilistic model that gives a probability that an arbitrary character string is a Japanese sentence or the like. Also, the language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, or the like.
 音響モデルデータベースは、音源モデルを格納する。音源モデルは、収音された音響信号を音源同定するために用いるモデルである。 The acoustic model database stores sound source models. A sound source model is a model used to identify a sound source of a collected acoustic signal.
 音響特徴量とは、収音された音響信号を高速フーリエ変換(Fast Fourier Transform)を行って周波数領域の信号に変換した後、算出された特徴量である。音響特徴量は、例えば、静的メル尺度対数スペクトル(MSLS:Mel-Scale Log Spectrum)、デルタMSLS及び1個のデルタパワーを、所定時間(例えば、10ms)毎に算出される。なお、MSLSは、音響認識の特徴量としてスペクトル特徴量を用い、MFCC(メル周波数ケプストラム係数;Mel Frequency Cepstrum Coefficient)を逆離散コサイン変換することによって得られる。 An acoustic feature amount is a feature amount calculated after transforming a collected sound signal into a signal in the frequency domain by performing a Fast Fourier Transform. Acoustic features, for example, static Mel-Scale Log Spectrum (MSLS), delta MSLS, and one delta power, are calculated every predetermined time (eg, 10 ms). Note that MSLS is obtained by using a spectral feature amount as a feature amount for acoustic recognition and performing inverse discrete cosine transform on MFCC (Mel Frequency Cepstrum Coefficient).
 対話コーパスデータベースは、対話コーパスを格納する。対話コーパスとは、読み聞かせロボット1と利用者とが、対話を行う際に使用するコーパスであり、例えば対話内容に応じたシナリオである。 The dialogue corpus database stores the dialogue corpus. The dialogue corpus is a corpus that is used when the storytelling robot 1 and the user have a dialogue, and is, for example, a scenario corresponding to the contents of the dialogue.
 比較用画像データベースは、例えばパターンマッチングの際に用いられる画像を格納する。パターンマッチングの際に用いられる画像は、例えば、利用者の画像、利用者の家族の画像、利用者のペットの画像、利用者の友人や知り合いの画像等を含む。 The comparison image database stores images used for pattern matching, for example. Images used for pattern matching include, for example, images of the user, images of the user's family, images of the user's pets, and images of the user's friends and acquaintances.
 画像特徴量は、例えば人物や物の画像から、周知の画像処理によって抽出された特徴量である。
 なお、上述した例は一例であり、第1データベース107は他のデータを格納していてもよい。
The image feature amount is, for example, a feature amount extracted from an image of a person or an object by well-known image processing.
Note that the example described above is just an example, and the first database 107 may store other data.
<第2データベースが格納するデータ>
 次に、第2データベース109が格納するデータ例を説明する。第2データベース109は、例えば、社会構成要素、社会規範、心理学に関するデータ、人文学に関するデータを格納する。
 社会構成要素は、例えば、年齢、性別、職業、複数の人の間の関係(親子、夫婦、恋人、友達、知り合い、仕事仲間、ご近所の人、先生と生徒等)である。
<Data stored in the second database>
Next, an example of data stored in the second database 109 will be described. The second database 109 stores, for example, social constituent elements, social norms, data on psychology, and data on humanities.
The social components are, for example, age, sex, occupation, and relationships between multiple people (parents and children, couples, lovers, friends, acquaintances, co-workers, neighbors, teachers and students, etc.).
 社会規範は、個人、複数の人の間のルールやマナーであり、年齢、性別、職業、複数の人の間の関係に応じた発話、仕草等が関連づけられている。 Social norms are rules and manners between individuals and multiple people, and are associated with speech, gestures, etc. according to age, gender, occupation, and relationships between multiple people.
 心理学に関するデータは、例えば、これまでの実験や検証で得られている知見(例えば母親と幼児との愛着関係、エディプスコンプレックス等のコンプレックス、条件反射、フェティシズム等)のデータである。 Data related to psychology are, for example, data on findings obtained from past experiments and verifications (for example, attachment relationships between mothers and infants, complexes such as the Oedipus complex, conditioned reflexes, fetishism, etc.).
 人文学に関するデータは、例えば宗教的なルール、慣習、国民性、地域性、国や地域における特徴的な行為や行動や発話等のデータである。例えば、日本人の場合は、同意の際に、言葉で言わずに頷くことで同意を表す等のデータである。また、人文学に関するデータは、例えば、国や地域によって、何を重要視し、何を優先するか等のデータである。 Data related to the humanities are, for example, data on religious rules, customs, national characteristics, regional characteristics, and characteristic acts, actions, and utterances of a country or region. For example, in the case of Japanese people, the data is such that when they give consent, they express their consent by nodding without saying it in words. The humanities-related data is, for example, data on what is considered important and what is prioritized depending on the country or region.
<処理手順例>
 次に、処理手順例を説明する。図14は、本実施形態に係る読み聞かせロボット1の読み聞かせ処理の手順例を示すフローチャートである。
<Processing procedure example>
Next, a processing procedure example will be described. FIG. 14 is a flow chart showing an example of the procedure of the story-telling process of the story-telling robot 1 according to this embodiment.
 (ステップS1)生成部144は、読み聞かせに使用するコンテンツを生成する。なお、コミュニケーション装置100は、外部装置で生成されてコンテンツを、例えば受信部101を介して取得するようにしてもよい。 (Step S1) The generating unit 144 generates content to be used for storytelling. Note that the communication device 100 may acquire content generated by an external device via the receiving unit 101, for example.
 (ステップS2)認識生成部140は、センサ情報(撮影部102が撮影した画像、収音部103が収音した音響信号、センサ104が検出した検出値)を取得する。 (Step S2) The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).
 (ステップS3)認識生成部140の顔認識部141は、撮影された画像から聞き手の顔を含む画像を検出し、聞き手の顔認識を行う。なお、聞き手の顔画像が第3データベース150に登録されていない場合、認識生成部140は、例えば、聞き手に語りかけて、聞き手に名前を発話させて聞き手の名前を取得する。 (Step S3) The face recognition unit 141 of the recognition generation unit 140 detects an image including the listener's face from the captured image, and recognizes the listener's face. If the face image of the listener is not registered in the third database 150, the recognition generator 140 acquires the name of the listener by talking to the listener and making the listener pronounce his/her name, for example.
 (ステップS4)生成部144は、取得したコンテンツに基づいて、画像表示装置7に表示させる画像と出力する音響信号を画像表示装置7に出力する。続けて、動作生成部110は、取得されたコンテンツ、聞き手の名前を用いて、コンテンツの読み聞かせを開始する。 (Step S4) The generation unit 144 outputs an image to be displayed on the image display device 7 and an acoustic signal to be output to the image display device 7 based on the acquired content. Subsequently, the action generator 110 starts reading the content using the acquired content and listener's name.
 (ステップS5)認識生成部140は、センサ情報(撮影部102が撮影した画像、収音部103が収音した音響信号、センサ104が検出した検出値)を取得する。 (Step S5) The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).
 (ステップS6)顔認識部141は、撮影された画像とセンサ104が検出した検出値から聞き手の顔の向きや表情を認識する。ジェスチャー認識部142は、撮影された画像とセンサ104が検出した検出値から聞き手の動作を検出する。音声認識部143は、収音部103が収音した音響信号に対して、音声認識を行う。 (Step S6) The face recognition unit 141 recognizes the orientation and facial expression of the listener from the captured image and the detection values detected by the sensor 104 . The gesture recognition unit 142 detects the listener's motion from the captured image and the detection value detected by the sensor 104 . The speech recognition unit 143 performs speech recognition on the acoustic signal picked up by the sound pickup unit 103 .
 (ステップS7)生成部144は、取得したコンテンツ、取得したセンサ情報に基づいて、読み聞かせロボット1の表情情報や動作情報を生成する。続けて、動作生成部110は、生成された読み聞かせロボット1の表情情報や動作情報に基づいて、読み聞かせロボット1の目に相当する表示部111a、111bに表示させる画像、口に相当する表示部111cに表示させる画像、スピーカー112から出力する音声信号、アクチュエータ113を駆動する駆動信号を生成する。 (Step S7) The generation unit 144 generates facial expression information and motion information of the storytelling robot 1 based on the acquired content and the acquired sensor information. Subsequently, the motion generation unit 110 generates images to be displayed on the display units 111a and 111b corresponding to the eyes of the storytelling robot 1 and displays corresponding to the mouth based on the generated facial expression information and motion information of the storytelling robot 1. An image to be displayed on the unit 111c, an audio signal to be output from the speaker 112, and a drive signal to drive the actuator 113 are generated.
 (ステップS8)認識生成部140は、読み聞かせが終了したか否かを判別する。なお、読み聞かせの終了は、コンテンツの終了に限らず、聞き手の反応に応じて、コンテンツの途中であっても終了とするようにしてもよい。また、コンテンツが長い場合、認識生成部140は、読み聞かせた時間や聞き手の反応に応じて、コンテンツの途中であっても終了とするようにしてもよい。認識生成部140は、読み聞かせが終了したと判別した場合(ステップS8;YES)、処理を終了する。認識生成部140は、読み聞かせが終了していないと判別した場合(ステップS8;NO)、ステップS5に処理を戻す。 (Step S8) The recognition generation unit 140 determines whether or not the story-telling has ended. The end of the story-telling is not limited to the end of the content, and may be ended even in the middle of the content depending on the reaction of the listener. Moreover, if the content is long, the recognition generation unit 140 may end the content even in the middle of the content, depending on the reading time or the reaction of the listener. If the recognition generation unit 140 determines that the story-telling has ended (step S8; YES), it ends the process. When the recognition generation unit 140 determines that the story-telling has not ended (step S8; NO), the process returns to step S5.
 なお、図14に示した処理手順は一例であり、これに限らない。いくつかの処理は、並列に行っても良く、順番を入れ替えてもよい。また、聞き手の名前の取得は、読み聞かせを開始した後、読み聞かせ中に行うようにしてもよい。 It should be noted that the processing procedure shown in FIG. 14 is an example, and is not limited to this. Some processes may be performed in parallel, and the order may be changed. Also, the listener's name may be acquired during the reading after the reading is started.
 なお、画像表示装置7に表示させるコンテンツの画像や音声は、読み聞かせの途中で、聞き手の反応等に応じて変更するようにしてもよい。例えば、コンテンツが、複数の枝分かれする物語であってもよく、物語の進行に沿って流される音楽も2つ以上用意されていてもよい。このようなコンテンツは、例えば図7~図11に示した構成の生成部144を、クリエイター等が操作して生成させるようにしてもよい。また、生成部144は、読み聞かせを行わせる際に画像表示装置7に表示させる画像と、読み聞かせを行わせる際に画像表示装置7に出力させる音響信号と、読み聞かせを行わせる際に読み聞かせロボット1の音声信号と表情データと動作データのうちの少なくとも1つを変更するようにしてもよい。 It should be noted that the images and sounds of the content displayed on the image display device 7 may be changed in accordance with the reaction of the listener during the reading. For example, the content may be multiple branching stories, and two or more pieces of music may be provided along with the progress of the story. Such content may be generated by a creator or the like operating the generation unit 144 having the configuration shown in FIGS. 7 to 11, for example. Further, the generation unit 144 generates an image to be displayed on the image display device 7 when the story-telling is performed, an acoustic signal to be output by the image display device 7 when the story-telling is performed, and an image to be displayed on the image display device 7 when the story-telling is performed. At least one of the voice signal, facial expression data, and action data of the speaking robot 1 may be changed.
<評価>
 上述した読み聞かせロボット1を評価するため、以下の2つの評価を行った。
<Evaluation>
In order to evaluate the story-telling robot 1 described above, the following two evaluations were performed.
 (第1の評価)
 第1の評価では、以下の3つの異なる状態で読み聞かせを行った。
 I.ロボットは、語り手に使用される。この場合、ロボットが直接、物語を伝える。この状態を、以下の評価では、この状態を「ダイレクト」と呼ぶ。また、コンテンツは、ロボットの背後にあるテレビ画面に表示される。
 II.ストーリーは、事前に録音された誰か他の人によって伝えられる。ロボットの役割は、人々を巻き込んで対話するなど、読み聞かせのプロセスを促進することにある。この状態を、以下の評価では、この状態を「ファシリテーター」と呼ぶ。また、コンテンツは、ロボットの背後にあるテレビ画面に表示される。
 III.物語の内容だけをタブレットに表示し、伝統的なストーリーテリングの手法を用いて読み聞かせる。この場合は、マルチメディアコンテンツやその他の資産を含む同じコンテンツを使用する。以下の評価では、この状態を「タブレット」と呼ぶ。また、この場合、ロボットは使用しない。
(First evaluation)
In the first evaluation, reading was performed in the following three different states.
I. A robot is used as a narrator. In this case, the robot directly tells the story. This state will be referred to as "direct" in the evaluation below. Content is also displayed on a TV screen behind the robot.
II. The story is told by someone else pre-recorded. The robot's role is to facilitate the reading process, including engaging people in dialogue. This state will be referred to as the 'Facilitator' in the following evaluations. Content is also displayed on a TV screen behind the robot.
III. Only the content of the story is displayed on the tablet and read aloud using traditional storytelling techniques. In this case, use the same content, including multimedia content and other assets. In the evaluation below, this state is referred to as a "tablet". Also, in this case, no robot is used.
 評価では、具現化されたエージェントに固有のコミュニケーション・チャンネルの使用が与える影響をさらに調べるために、ロボットのエージェンシー感覚(アニメーションや表現のルーチン、視線、サッケードなどの能力)を制御した。 In the evaluation, we controlled the robot's sense of agency (animation and expression routines, eye gaze, saccades, etc.) in order to further investigate the impact of the use of communication channels specific to the embodied agent.
 第1の評価では、これらを無効にすることを「AGENCY OFF」といい、これらを有効にすることを「AGENCY ON」という。「AGENCY OFF」では、ロボットは行動を伴わない単なる小道具に過ぎない。
 第1の評価では、30名の被験者(成人女性15名、成人男性15名)を集め、上述の3種類のケース(ダイレクト、ファシリテーター、タブレット)について評価を行った。
In the first evaluation, disabling them is called "AGENCY OFF" and enabling them is called "AGENCY ON". In "AGENCY OFF", robots are nothing more than props without action.
In the first evaluation, 30 subjects (15 adult females and 15 adult males) were recruited and evaluated for the above three types of cases (direct, facilitator, tablet).
 まず、第1の評価では、上記の3つ状態毎に異なる読み聞かせのパフォーマンスを被験者(聞き手)に提示し、以下の指示で「好き」か「嫌い」かを尋ねるテストを行った。 語り部のパフォーマンスがカテゴリーの説明を満たしている場合にのみ「いいね!」が付く。なお、被験者が満足していれば、複数の「いいね!」が付いても良いとした。 First, in the first evaluation, we presented the subjects (listeners) with different reading performances for each of the above three conditions, and conducted a test asking whether they "liked" or "disliked" the following instructions. "Like" is given only when the storyteller's performance satisfies the category description. In addition, if the subject was satisfied, it was allowed to receive multiple "likes".
 図15は、第1の評価結果例を示す図である。
 図15において、項目は、以下のように定義した。
・「内容」;ストーリー性のある内容をシンプルに伝えることが、評価の最低限の条件である。
・「説得力」;全体的に説得力があり、信憑性があるかが評価のポイントである。
・「リアリズム」;ストーリーに含まれる要素がどれだけ本質的であるか、配信によってストーリーが現実のものとなっているような全体的な感覚が評価のポイントである。
・「インタラクティブ性」;被験者がストーリーテリングのプロセスに参加できると感じているかどうかが評価のポイントである。
FIG. 15 is a diagram showing a first evaluation result example.
In FIG. 15, items are defined as follows.
・"Content": The minimum requirement for evaluation is to convey the content with a story in a simple manner.
・"Persuasiveness"; the point of evaluation is whether the article is persuasive and credible as a whole.
・"Realism"; The point of evaluation is how essential the elements contained in the story are, and the overall feeling that the story is brought to life through distribution.
• "Interactivity"; the point of evaluation is whether or not the subject feels able to participate in the storytelling process.
 図15のように、ロボットの使用は、ロボットのエージェンシーがアクティブな「エージェンシー・オン」の場合に評価が高かった。このように、「エージェンシー・オン」の場合、ロボットを使用することでユーザーの期待感が高まり、ロボットのコミュニケーションモダリティがタップされて初めて意味を成すという結論が得られた。このことは、ロボットの表現力と行動がエンゲージメントを促進し、インタラクションに意味を与えることを示している。 As shown in Figure 15, the use of robots was highly evaluated when the agency of the robot was active "agency on". In this way, in the case of “Agency On,” we concluded that the use of robots raises users' expectations, and that the communication modalities of robots are tapped to make sense. This indicates that the expressiveness and behavior of robots promote engagement and give meaning to interactions.
 (第2の評価)
 第2の評価では、注意配分(attention allocation)を評価した。
 人間の注意の中心的な機能は、現在関心のある対象を選択し、周囲の残りの対象を無視することである。神経科学者は、注意をボトムアップ注意とトップダウン注意という2つの異なる機能に分類している。ボトムアップの注意は、視覚的特徴を操作し、最も顕著な情報を含む領域を無意識に強調すると考えられている。注意に関する多くの研究では、人間の眼球運動を記録して、注意の位置の変化を表現している。ボトムアップの注意配分は、色、動き、顔、その他の視覚的・聴覚的次元の特徴対比から生成されることを示している。この第2の評価では、ロボットのエージェンシーが人間の注意を引きつけ、テレビ画面(コンテンツ)からロボットに注意を移させることができるかどうかを検証した。
(Second evaluation)
The second assessment assessed attention allocation.
A central function of human attention is to select objects of current interest and ignore the rest of the surrounding objects. Neuroscientists classify attention into two distinct functions: bottom-up attention and top-down attention. Bottom-up attention is thought to manipulate visual features and unconsciously emphasize areas containing the most salient information. In many studies on attention, human eye movements are recorded to express changes in the position of attention. We show that bottom-up attention allocation is generated from feature contrasts in color, motion, face, and other visual and auditory dimensions. This second evaluation examined whether the robot agency could capture human attention and shift attention away from the TV screen (content) to the robot.
 なお、上述したように読み聞かせロボット1は、単なる動作ではなく、社会的に意味のあるマルチモーダルな動作を伴う意味のある行動を行う。そのため、ランダムな動きと注意の因果関係を検証するのではなく、社会的に関連性のあるロボットのコミュニケーションアフォーダンスの影響を検証した。 It should be noted that, as described above, the storytelling robot 1 performs not only simple actions, but also meaningful actions accompanied by socially meaningful multimodal actions. Therefore, rather than examining the causal relationship between random movements and attention, we examined the impact of socially relevant robot communication affordances.
 この第2の評価では、コンテンツの画像を表示した画像表示装置7と、その前に置かれた読み聞かせロボット1のみを提示して被験者にタスクを実行してもらった。なお、第2の評価では、被験者の頭の向きが注意の方向であると仮定した。第2の評価では、タスクを実行している間に被験者の注意がオブジェクト間(画像表示装置7、読み聞かせロボット1)でどのように移動するかを視覚化するために、被験者がロボットの方を向いているときに「頭の向き」を1に設定し、画像表示装置7を向いているときは0に設定した。なお、第2の評価では、注意配分に対するロボットの効果を比較するために、画像表示装置7に表示させる物語の内容は、変化させずに一定にした。 In this second evaluation, only the image display device 7 displaying the content image and the story-telling robot 1 placed in front of it were presented, and the subjects were asked to perform the task. In the second evaluation, it was assumed that the orientation of the subject's head was the direction of attention. In the second evaluation, the subject was directed to the robot to visualize how the subject's attention shifted between objects (image display device 7, storytelling robot 1) while performing the task. The "direction of the head" is set to 1 when facing the image display device 7, and set to 0 when the image display device 7 is facing. In the second evaluation, the content of the story displayed on the image display device 7 was kept constant in order to compare the effect of the robot on attention allocation.
 図16は、第2の評価結果例を示す図である。図16において、横軸は離散的な時刻である、縦軸は頭の向き(1または0)である。また、グラフg401は、「エージェンシー・オン」で画像表示装置7と読み聞かせロボット1とを用いた場合の聞き手の注意配分の時間変化の例である。グラフg411は、「エージェンシー・オフ」で画像表示装置7と読み聞かせロボット1とを用いた場合の聞き手の注意配分の時間変化の例である。なお、線g402は、「エージェンシー・オン」で画像表示装置7と読み聞かせロボット1とを用いた場合の聞き手の注意配分の時間変化を示す。線g403は、参考であり、画像表示装置7のみを用いて読み聞かせロボット1を画像表示装置7前に置かなかった場合の聞き手の注意配分の時間変化を示す。線g412は、「エージェンシー・オフ」で画像表示装置7と読み聞かせロボット1とを用いた場合の聞き手の注意配分の時間変化を示す。 FIG. 16 is a diagram showing a second evaluation result example. In FIG. 16, the horizontal axis is discrete time and the vertical axis is head orientation (1 or 0). Graph g401 is an example of the temporal change in listener's attention allocation when the image display device 7 and storytelling robot 1 are used in "Agency ON". A graph g411 is an example of the change over time of the listener's attention distribution when the image display device 7 and the story-telling robot 1 are used in the "agency off" mode. A line g402 indicates the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in "Agency ON". A line g403 is for reference and shows the time change of listener's attention allocation when only the image display device 7 is used and the story-telling robot 1 is not placed in front of the image display device 7. FIG. A line g412 shows the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in the "agency off" state.
 この図16からは、社会的に関連性のあるロボットの行動や行為が、被験者に注目されていると考えられ、頭の向きが0(画像表示装置7)から1(読み聞かせロボット1)に変化することで、被験者の注意を引きつけていることが確認できる。 From this FIG. 16, it is considered that socially relevant behaviors and behaviors of the robot are paid attention to by the subject, and the head orientation is changed from 0 (image display device 7) to 1 (reading robot 1). By changing, it can be confirmed that it attracts the subject's attention.
 グラフg401、g411の線g403ように、ロボットを使用しない場合、被験者は画像表示装置7の画面に映し出された物語の映像のみに固執した。
 線g402、g412のように、「エージェンシー・オン」は「エージェンシー・オフ」に比べてロボットをより強く意識させる傾向があり、頭の向きをロボットに向けて頻繁に、かつ長く変化させることができることがわかる。
As indicated by the line g403 of the graphs g401 and g411, when the robot was not used, the subject was fixated on only the image of the story displayed on the screen of the image display device 7. FIG.
As shown by lines g402 and g412, "agency on" tends to make the robot more conscious than "agency off", and the head direction can be changed frequently and for a long time toward the robot. I understand.
 なお、図16では、注意配分をわかりやすくするために、2人の被験者の結果のみを示したが、30人の被験者全体で一貫した傾向が見られたことが確認できた。 In addition, in FIG. 16, in order to make the distribution of attention easier to understand, only the results of two subjects were shown, but it was confirmed that a consistent tendency was observed for all 30 subjects.
 以上のように、本実施形態では、語り、エージェンシー、結びつき、教育といったストーリーテリングの要素を特定し、これらをロボットに合成した。すなわち、本実施形態では、ジェスチャーや目の動きによる表現力や、投影されたコンテンツや感情的な声(=ファシリテーター)の使用など、ロボットのコミュニケーション・チャンネルを活用することで、ロボットのストーリーテリング機能をデザインするようにした。 As described above, in this embodiment, we identified elements of storytelling such as narrative, agency, connection, and education, and synthesized them into a robot. In other words, in this embodiment, the storytelling function of the robot is realized by utilizing the communication channel of the robot, such as the expressiveness of gestures and eye movements, the use of projected content and emotional voices (= facilitator). was designed.
 これにより、本実施形態によれば、様々なコンテンツを集めることで多種多様な読み聞かせを作り上げることができ、生成されたコンテンツの読み聞かせ中に、読み聞かせロボット1によって動作を入れたり、相槌を入れたり、冒頭に質問を入れたりとコミュニケーション豊かに読み聞かせを行うことができ、保護者や読み聞かせのプロフェッショナルが行うような読み聞かせや、教育効果を得ることができる。本実施形態によれば、人間とロボットの有意義なインタラクションを実現し、ユーザーに最大限に受け入れられるようになった。 As a result, according to the present embodiment, it is possible to create a wide variety of storytelling by collecting various contents, and during the storytelling of the generated content, the storytelling robot 1 performs actions or backtracks. It is possible to read aloud in a rich communication by inserting a question at the beginning or inserting a question at the beginning. According to this embodiment, a meaningful interaction between a human and a robot is realized, and it is accepted by users to the maximum extent possible.
 本実施形態の手法は、長期的な注目とより大きな受容を維持するため、ソーシャルロボットのための将来の創造的なコンテンツのデザインにインスピレーションを与えることができる。
 また、本実施形態によれば、ロボットのような具象化されたエージェントを読み聞かせに使用することは、そのコミュニケーションアフォーダンス(具象化、表現力、その他のモダリティ)を活用して「エージェンシー・オン」にした場合、聞き手の興味をより維持させることができる。
The approach of the present embodiment maintains long-term attention and greater acceptance and can inspire future creative content designs for social robots.
Also, according to the present embodiment, using a materialized agent, such as a robot, for storytelling can be "agency-on" by leveraging its communication affordances (materialization, expressiveness, and other modalities). , the listener's interest can be maintained more.
 なお、上述したように、読み聞かせロボット1は、読み聞かせに限らず、読み聞かせをサポートする「ファシリテーター」の役割を行うようにしても、同様の効果を得ることができる。 It should be noted that, as described above, the story-telling robot 1 is not limited to story-telling, and similar effects can be obtained by playing the role of a "facilitator" that supports story-telling.
 (認知、学習、社会的能力の流れ)
 次に、実施形態のコミュニケーション装置100が行う認知と学習の流れについて説明する。図17は、実施形態のコミュニケーション装置100が行う認知と学習と社会的能力の流れを示す図である。
(Cognitive, Learning, Flow of Social Competence)
Next, the flow of recognition and learning performed by the communication device 100 of the embodiment will be described. FIG. 17 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device 100 of the embodiment.
 認識結果201は、認知部105によって認識された結果の一例である。認識結果201は、例えば対人関係、対人相互関係等である。 A recognition result 201 is an example of a result recognized by the recognition unit 105 . The recognition result 201 is, for example, an interpersonal relationship, an interpersonal mutual relationship, or the like.
 マルチモーダル学習、理解211は、学習部108によって行われる学習内容例である。学習方法212は、機械学習等である。また、学習対象213は、社会構成要素、社会模範、心理学、人文学等である。 The multimodal learning and understanding 211 is an example of learning content performed by the learning unit 108 . The learning method 212 is machine learning or the like. Also, the learning object 213 is social constituent elements, social models, psychology, humanities, and the like.
 社会的能力221は、社会技能であり、例えば共感、個性化、適応性、情緒的アホーダンス等である。 Social abilities 221 are social skills, such as empathy, individualization, adaptability, and emotional ajodance.
 (認識するデータ)
 次に、認知部105が認識するデータ例を説明する。
 図18は、実施形態に係る認知部105が認識するデータ例を示す図である。実施形態では、図18のように個人データ301と、対人関係データ351を認識する。
(data to be recognized)
Next, an example of data recognized by the recognition unit 105 will be described.
FIG. 18 is a diagram showing an example of data recognized by the recognition unit 105 according to the embodiment. In the embodiment, personal data 301 and interpersonal relationship data 351 are recognized as shown in FIG.
 個人データは、1人の中でおきる行動であり、撮影部102と収音部103によって取得されたデータと、取得されたデータに対して音声認識処理、画像認識処理等を行ったデータである。個人データは、例えば、音声データ、音声処理された結果である意味データ、声の大きさ、声の抑揚、発話された単語、表情データ、ジェスチャーデータ、頭部姿勢データ、顔向きデータ、視線データ、共起表現データ、生理的情報(体温、心拍数、脈拍数等)等である。なお、どのようなデータを用いるかは、例えば読み聞かせロボット1の設計者が選択してもよい。この場合、例えば、実際の2人のコミュニケーションまたはデモンストレーションに対して、読み聞かせロボット1の設計者が、コミュニケーションにおいて個人データのうち重要な特徴を設定するようにしてもよい。また、認知部105は、取得された発話と画像それぞれから抽出された情報に基づいて、個人データとして、利用者の感情を認知する。この場合、認知部105は、例えば声の大きさや抑揚、発話継続時間、表情等に基づいて認知する。そして実施形態の読み聞かせロボット1は、利用者の感情を良い感情を維持する、利用者との関係を良い関係を維持するように働きかけるように制御する。 Personal data is behavior that occurs in one person, and is data acquired by the imaging unit 102 and sound pickup unit 103, and data obtained by performing voice recognition processing, image recognition processing, etc. on the acquired data. . Personal data includes, for example, voice data, semantic data resulting from voice processing, voice volume, voice inflection, uttered words, facial expression data, gesture data, head posture data, face direction data, line of sight data , co-occurrence expression data, physiological information (body temperature, heart rate, pulse rate, etc.), and the like. The data to be used may be selected by the designer of the storytelling robot 1, for example. In this case, for example, the designer of the story-telling robot 1 may set important features of the personal data in communication for actual communication or demonstration between two persons. Also, the recognition unit 105 recognizes the user's emotion as personal data based on the information extracted from each of the acquired speech and image. In this case, the recognition unit 105 recognizes based on, for example, the loudness and intonation of the voice, the utterance duration, the facial expression, and the like. Then, the story-telling robot 1 of the embodiment controls the user's emotions so as to maintain good emotions and to work to maintain a good relationship with the user.
 ここで、利用者の社会的背景(バックグラウンド)の認知方法例を説明する。
 認知部105は、取得した発話と画像と第1データベース107が格納するデータとに基づいて、利用者の国籍、出身地等を推定する。認知部105は、取得した発話と画像と第1データベース107が格納するデータとに基づいて、利用者の起床時間、外出時間、帰宅時間、就寝時間等の生活スケジュールを抽出する。認知部105は、取得した発話と画像と生活スケジュールと第1データベース107が格納するデータとに基づいて、利用者の性別、年齢、職業、趣味、経歴、嗜好、家族構成、信仰している宗教、読み聞かせロボット1に対する愛着度等を推定する。なお、社会的背景は変化する場合もあるため、読み聞かせロボット1は、会話と画像と第1データベース107が格納するデータとに基づいて、利用者の社会的背景に関する情報を更新していく。なお、感情的な共有を可能とするために、社会的背景や読み聞かせロボット1に対する愛着度は、年齢や性別や経歴等の入力可能なレベルに限らず、例えば、時間帯に応じた感情の起伏や話題に対する声の大きさや抑揚等に基づいて認知する。このように、認知部105は、利用者が自信で気づいていないことについても、日々の会話と会話時の表情等に基づいて学習していく。
Here, an example of a method of recognizing the user's social background (background) will be described.
The recognition unit 105 estimates the user's nationality, hometown, etc., based on the acquired speech and image, and the data stored in the first database 107 . The recognizing unit 105 extracts the user's life schedule such as wake-up time, going-out time, returning home time, bedtime, etc., based on the acquired utterances and images and the data stored in the first database 107 . The recognition unit 105 recognizes the user's sex, age, occupation, hobbies, career, preferences, family structure, religion based on the acquired utterance, image, life schedule, and data stored in the first database 107. , the degree of attachment to the storytelling robot 1, etc. are estimated. Since the social background may change, the story-telling robot 1 updates the information about the user's social background based on the conversation, the image, and the data stored in the first database 107 . In addition, in order to enable emotional sharing, the social background and degree of attachment to the storytelling robot 1 are not limited to inputtable levels such as age, gender, and career. Recognize based on the volume and intonation of the voice to the undulations and topics. In this way, the recognizing unit 105 learns things that the user is not aware of by themselves based on daily conversations and facial expressions during conversations.
 対人関係データは、利用者と他の人との関係に関するデータである。このように対人関係データを用いることで、社会的なデータを用いることができる。対人関係のデータは、例えば、人と人との距離、対話している人同士の視線が交わっているか否か、声の抑揚、声の大きさ等である。人と人との距離は後述するように、対人関係によって異なる。例えば夫婦や友達であれば対人関係がL1であり、ビジネスマン同士の対人関係はL1よりも大きいL2である。  Personal relationship data is data related to the relationship between the user and other people. By using interpersonal relationship data in this way, social data can be used. The interpersonal relationship data includes, for example, the distance between people, whether or not the eyes of the people who are having a conversation meet each other, the inflection of the voice, the loudness of the voice, and the like. As will be described later, the distance between people varies depending on the interpersonal relationship. For example, the interpersonal relationship is L1 for couples or friends, and the interpersonal relationship between businessmen is L2, which is larger than L1.
 なお、例えば、実際の2人のコミュニケーションまたはデモンストレーションに対して、読み聞かせロボット1の設計者が、コミュニケーションにおいて対人データのうち重要な特徴を設定するようにしてもよい。なお、このような個人データ、対人関係データ、利用者の社会的背景に関する情報は、第1データベース107または記憶部106に格納する。 It should be noted that, for example, for actual communication or demonstration between two people, the designer of the story-telling robot 1 may set important features of interpersonal data in communication. Such personal data, interpersonal relationship data, and information about the user's social background are stored in first database 107 or storage unit 106 .
 また、認知部105は、利用者が複数人の場合、例えば利用者とその家族の場合、利用者毎に個人データを収集して学習し、人毎に社会的背景を推定する。なお、このような社会的背景は、例えばネットワークと受信部101を介して取得してもよく、その場合、利用者が例えばスマートフォン等で自分の社会的背景を入力または項目を選択するようにしてもよい。 In addition, when there are multiple users, for example, the user and his/her family, the recognition unit 105 collects and learns personal data for each user, and estimates the social background for each person. Note that such social background may be obtained, for example, via a network and the receiving unit 101. In that case, the user inputs his/her social background or selects an item using, for example, a smartphone. good too.
 ここで、対人関係データの認知方法例を説明する。
 認知部105は、取得した発話と画像と第1データベース107が格納するデータとに基づいて、コミュニケーションが行われている人と人との距離(間隔)を推定する。認知部105は、取得した発話と画像と第1データベース107が格納するデータとに基づいて、コミュニケーションが行われている人の視線が交わっているか否かを検出する。認知部105は、取得した発話と第1データベース107が格納するデータとに基づいて、発話内容、声の大きさ、声の抑揚、受信した電子メール、送信した電子メール、送受信した電子メールの送受信先の相手に基づいて、友人関係、仕事仲間、親戚親子関係を推定する。
Here, an example of a method for recognizing interpersonal relationship data will be described.
The recognition unit 105 estimates the distance (interval) between the people communicating with each other based on the acquired utterances, images, and data stored in the first database 107 . The recognition unit 105 detects whether or not the lines of sight of the person communicating with each other, based on the acquired utterance, image, and data stored in the first database 107 . Based on the acquired utterance and the data stored in the first database 107, the recognition unit 105 recognizes the content of the utterance, the loudness of the voice, the inflection of the voice, the received e-mail, the transmitted e-mail, and the transmission and reception of the received e-mail. Based on the previous partner, the relationship between friends, co-workers, relatives and parents is estimated.
 なお、認知部105は、使用される初期状態において、第1データベース107が記憶するいくつかの社会的背景や個人データの初期値の組み合わせの中から、例えばランダムに1つを選択して、コミュニケーションを開始するようにしてもよい。そして、認知部105は、ランダムに選択した組み合わせによって生成された行動によって、利用者とのコミュニケーションが継続しにくい場合、別の組み合わせを選択しなおすようにしてもよい。 Note that, in the initial state of use, the recognition unit 105 selects one, for example, at random from among several combinations of initial values of social backgrounds and personal data stored in the first database 107, and performs communication. may be started. Then, if it is difficult to continue communication with the user due to the behavior generated by the randomly selected combination, the recognition unit 105 may reselect another combination.
 (学習手順)
 実施形態では、認知部105によって認識された個人データ301と対人関係データ351と、第2データベース109が格納するデータを用いて、学習部108が学習を行う。
(learning procedure)
In the embodiment, the learning unit 108 performs learning using the personal data 301 and interpersonal relationship data 351 recognized by the recognition unit 105 and data stored in the second database 109 .
 ここで、社会的構成と社会規範について説明する。人々が社会的な相互作用に参加する空間において、例えば人と人とのキャリによって、対人関係が異なる。例えば、人との間隔が0~50cmの関係は親密(Intimate)な関係であり、人との間隔が50~1mの関係は個人的(Personal)な関係である。人との間隔が1~4mの関係は社会的(Social)な関係であり、人との間隔が4mの以上の関係は公的(Public)な関係である。このような社会規範は、学習時に、仕草や発話が社会規範に合致しているか否かを報酬(暗示的な報酬)として用いられる。 Here, I will explain the social structure and social norms. In spaces where people participate in social interaction, interpersonal relationships differ, for example, depending on the career of the person. For example, a relationship with a person at a distance of 0 to 50 cm is an intimate relationship, and a relationship with a person at a distance of 50 to 1 m is a personal relationship. A relationship with a distance of 1 to 4 m from a person is a social relationship, and a relationship with a distance of 4 m or more from a person is a public relationship. Such social norms are used as rewards (implicit rewards) based on whether gestures and utterances conform to the social norms during learning.
 また、対人関係は、学習時に報酬の特徴量の設定によって、利用される環境や利用者に応じたものに設定するようにしてもよい。具体的には、ロボットが苦手な人には、あまり話しかけないようなルールとし、ロボットが好きな人には積極的に話しかけるルールに設定するなど、複数の親密度の設定を設けるようにしてもよい。そして、実環境において、利用者の発話と画像を処理した結果に基づいて、利用者が、どのタイプであるかを認知部105が認知して、学習部108がルールを選択するようにしてもよい。 In addition, the interpersonal relationship may be set according to the environment in which it is used and the user by setting the feature value of the reward during learning. Specifically, you can set multiple intimacy settings, such as setting a rule to not talk to people who are not good at robots, and a rule to actively talk to people who like robots. good. Then, in a real environment, the recognizing unit 105 recognizes which type the user is based on the result of processing the user's utterance and image, and the learning unit 108 selects a rule. good.
 また、人間のトレーナーは、読み聞かせロボット1の行動を評価し、自分が知っている社会構成や規範に応じた報酬(暗示的な報酬)を提供するようにしてもよい。 In addition, the human trainer may evaluate the behavior of the story-telling robot 1 and provide a reward (implicit reward) according to the social structure and norms that he/she knows.
 図19は、実施形態に係る動作生成部110が用いるエージェント作成方法例を示す図である。
 符号300が示す領域は、入力からエージェントを作成、出力(エージェント)までの流れを示す図である。
 撮影部102が撮影した画像と収音部103が収音した情報310は、人(利用者、利用者の関係者、他人)に関する情報と、人の周りの環境情報である。撮影部102と収音部103によって取得された生データ302は、認知部105に入力される。
FIG. 19 is a diagram illustrating an example of an agent creation method used by the action generator 110 according to the embodiment.
The area indicated by reference numeral 300 is a diagram showing the flow from input to creation of an agent and output (agent).
The image captured by the capturing unit 102 and the information 310 captured by the sound capturing unit 103 are information about people (users, people related to the users, other people) and environmental information around people. Raw data 302 acquired by the imaging unit 102 and the sound collecting unit 103 is input to the recognition unit 105 .
 認知部105は、入力された生データ302から複数の情報(声の大きさ、声の抑揚、発話内容、発話された単語、利用者の視線、利用者の頭部姿勢、利用者の顔向き、利用者の生態情報、人と人との距離、人と人との視線が交わっているか否か、等)を抽出、認識する。認知部105は、抽出、認識された複数の情報を利用して、例えばニューラルネットワークを用いてマルチモーダル理解を行う。
 認知部105は、例えば音声信号および画像の少なくとも1つに基づいて、個人を識別し、識別した個人に識別情報(ID)を付与する。認知部105は、音声信号および画像の少なくとも1つに基づいて、識別した人ごとの動作を認知する。認知部105は、例えば画像に対して周知の画像処理と追跡処理を行って、識別した人の視線を認識する。認知部105は、例えば音声信号に対して音声認識処理(音源同定、音源定位、音源分離、発話区間検出、雑音抑圧等)を行って音声を認識する。認知部105は、例えば画像に対して周知の画像処理を行って、識別した人の頭部姿勢を認識する。認知部105は、例えば撮影された画像に2人が撮影されている場合、発話内容、撮影された画像における2人の間隔等に基づいて、対人関係を認知する。認知部105は、例えば撮影された画像と収音された音声信号それぞれを処理した結果に応じて、読み聞かせロボット1と利用者との社会的な距離を認知する(推定する)。
The recognition unit 105 extracts a plurality of pieces of information from the input raw data 302 (voice volume, voice intonation, utterance content, uttered words, user's line of sight, user's head posture, user's face orientation, etc.). , user's ecological information, distance between people, whether or not people's lines of sight intersect, etc.) are extracted and recognized. The recognition unit 105 uses a plurality of pieces of extracted and recognized information to perform multimodal understanding using, for example, a neural network.
The recognition unit 105 identifies an individual based on, for example, at least one of an audio signal and an image, and assigns identification information (ID) to the identified individual. The recognition unit 105 recognizes the motion of each identified person based on at least one of the audio signal and the image. The recognition unit 105 recognizes the line of sight of the identified person by, for example, performing well-known image processing and tracking processing on the image. The recognition unit 105 performs, for example, speech recognition processing (sound source identification, sound source localization, sound source separation, speech segment detection, noise suppression, etc.) on a speech signal to recognize speech. The recognition unit 105 recognizes the head posture of the identified person by, for example, performing well-known image processing on the image. For example, when two people are photographed in the photographed image, the recognition unit 105 recognizes the interpersonal relationship based on the speech content, the distance between the two persons in the photographed image, and the like. The recognition unit 105 recognizes (estimates) the social distance between the story-telling robot 1 and the user, for example, according to the result of processing each of the captured image and the collected sound signal.
 学習部108は、深層学習では無く、強化学習304を行う。強化学習では、最も関連性の高い特徴(社会構成や社会規範を含む)を選択するように学習を行う。この場合は、マルチモーダル理解で用いた複数の情報を特徴として入力に用いる。学習部108の入力は、例えば、生データそのものか、名前ID(識別情報)、顔の影響、認識したジェスチャー、音声からのキーワード等である。学習部108の出力は、読み聞かせロボット1の行動である。出力される行動は、目的に応じて定義したいものであればよく、例えば、音声応答、ロボットのルーチン、ロボットが回転するための向きの角度などである。なお、マルチモーダル理解において、検出にニューラルネットワーク等を用いてもよい。この場合は、身体の異なるモダリティを用いて、人間の活動を検出しますようにしてもよい。また、どの特徴を用いるかは、例えば読み聞かせロボット1の設計者が、予め選択するようにしてもよい。さらに、本実施形態では、学習時に、暗示的な報酬と明示的な報酬を用いることで、社会的な模範や社会構成概念を取り込むことができる。強化学習した結果が出力であり、エージェント305である。このように、本実施形態では、動作生成部110が用いるエージェントを作成する。 The learning unit 108 performs reinforcement learning 304 instead of deep learning. Reinforcement learning involves learning to select the most relevant features (including social constructs and social norms). In this case, multiple pieces of information used in multimodal understanding are used as features for input. Inputs to the learning unit 108 are, for example, raw data itself, name ID (identification information), influence of face, recognized gesture, keyword from voice, and the like. The output of the learning unit 108 is the behavior of the storytelling robot 1 . The behavior that is output may be anything that you want to define according to your purpose, such as voice responses, robot routines, or the angle of orientation for the robot to rotate. In multimodal understanding, a neural network or the like may be used for detection. In this case, different modalities of the body may be used to detect human activity. Also, which feature to use may be selected in advance by, for example, the designer of the storytelling robot 1 . Furthermore, in this embodiment, by using implicit and explicit rewards during learning, social models and social constructs can be incorporated. The output of the reinforcement learning is the agent 305 . Thus, in this embodiment, the agent used by the action generator 110 is created.
 符号350が示す領域は、報酬の使用方法を示す図である。
 暗黙的の報酬362は、暗黙的反応を学習するために使われる。この場合、生データ302には利用者の反応が含まれ、この生データ302を上述したマルチモーダル理解303する。学習部108は、暗黙的の報酬362と第2データベース109が格納する社会模範等を用いて、暗黙的反応システム372を生成する。なお、暗黙の報酬は、強化学習によって得られたものでもよく、人間が与えてもよい。また、暗黙的反応システムは、学習によって獲得されるモデルであってもよい。
The area indicated by reference numeral 350 is a diagram showing how the reward is used.
Implicit rewards 362 are used to learn implicit responses. In this case, the raw data 302 includes the user's reactions, and the multimodal understanding 303 of this raw data 302 is described above. The learning unit 108 generates an implicit reaction system 372 using the implicit reward 362 and the social model etc. stored in the second database 109 . Note that the implicit reward may be obtained by reinforcement learning or may be given by a human. The implicit reaction system may also be a model acquired through learning.
 明示的反応の学習には、例えば人間のトレーナーが、読み聞かせロボット1の行動を評価し、自分の知っている社会構成や社会規範に応じた報酬361を与える。なお、エージェントは、入力に対して、報酬が最大となる行動を採用する。これにより、エージェントは、ユーザーに対して肯定的な感情を最大化させるような振る舞い(発話、仕草)を採用する。 For learning explicit responses, for example, a human trainer evaluates the behavior of the story-telling robot 1 and gives a reward 361 according to the social structure and social norms that he/she knows. Note that the agent adopts the action that maximizes the reward for the input. As a result, the agent adopts behaviors (utterances and gestures) that maximize positive feelings toward the user.
 学習部108は、この明示的の報酬361を用いて、明示的反応システム371を生成する。なお、明示的反応システムは、学習によって獲得されるモデルであってもよい。なお、明示的な報酬は、利用者が、読み聞かせロボット1の行動を評価して与えるようにしてもよく、利用者の発話や行動(仕草、表情等)に基づいて、読み聞かせロボット1が、例えば利用者が望んでいた行動を取れたか否か等に基づいて報酬を推定するようにしてもよい。
 学習部108は、動作時、これらの学習モデルを用いてエージェント305を出力する。
Learning unit 108 uses this explicit reward 361 to generate explicit reaction system 371 . Note that the explicit response system may be a model acquired through learning. The explicit reward may be given by the user by evaluating the behavior of the story-telling robot 1. For example, the reward may be estimated based on whether or not the action desired by the user is taken.
Learning unit 108 outputs agent 305 using these learning models during operation.
 なお、実施形態では、利用者の反応である明示的な報酬を、暗示的な報酬より優先する。この理由は、利用者の反応の方が、コミュニケーションにおいては信頼度が高いためである。 Note that in the embodiment, explicit rewards, which are user reactions, are prioritized over implicit rewards. This is because the user's reaction is more reliable in communication.
 このように、実施形態の読み聞かせロボット1において、学習手段は、暗黙的な報酬と、明示的な報酬とを用いて学習を行い、前記暗黙的な報酬は、前記人に関する特徴情報を用いて、マルチモーダルによって学習された報酬であり、前記明示的な報酬は、前記動作生成手段によって生成された前記コミュニケーション装置の前記人に対する行動を評価した結果に基づく報酬である。
 また、実施形態の読み聞かせロボット1において、音響信号を収音する収音部と、利用者を含む画像を撮影する撮影部と、を備え、前記認知手段は、収音された前記音響信号に対して音声認識処理を行って音声に関する特徴情報を抽出し、撮影された画像に対して画像処理を行って画像に含まれる人行動に関する特徴情報を抽出し、前記人に関する特徴情報は、前記音声に関する特徴情報と、前記人行動に関する特徴情報を含み、前記音声に関する特徴情報は、音声信号、声の大きさの情報、声の抑揚の情報、発話の意味のうち少なくとも1つであり、前記人行動に関する特徴情報は、人の表情情報、人が行ったジェスチャー情報、人の頭部姿勢情報、人の顔向き情報、人の視線情報、および人と人との間の距離のうち少なくとも1つである。
Thus, in the story-telling robot 1 of the embodiment, the learning means performs learning using an implicit reward and an explicit reward, and the implicit reward is obtained using feature information about the person. , a multimodally learned reward, wherein the explicit reward is a reward based on the result of evaluating the behavior of the communication device toward the person generated by the action generating means.
Further, in the story-telling robot 1 of the embodiment, it is provided with a sound pickup unit that picks up an acoustic signal and a shooting unit that shoots an image including the user, and the recognition means is based on the picked-up acoustic signal. speech recognition processing is performed to extract feature information about voice; image processing is performed on the photographed image to extract feature information about human behavior included in the image; and the feature information about the human behavior, wherein the feature information about the voice is at least one of a voice signal, voice volume information, voice inflection information, and utterance meaning, and the person The characteristic information about the action is at least one of facial expression information of a person, gesture information of a person, head posture information of a person, face direction information of a person, line of sight information of a person, and distance between people. is.
 なお、本発明における読み聞かせロボット1、認識生成部140の機能の全てまたは一部を実現するためのプログラムをコンピュータ読み取り可能な記録媒体に記録して、この記録媒体に記録されたプログラムをコンピュータシステムに読み込ませ、実行することにより読み聞かせロボット1、認識生成部140が行い処理の全てまたは一部を行ってもよい。なお、ここでいう「コンピュータシステム」とは、OSや周辺機器等のハードウェアを含むものとする。また、「コンピュータシステム」は、ホームページ提供環境(あるいは表示環境)を備えたWWWシステムも含むものとする。また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ROM、CD-ROM等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムが送信された場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリ(RAM)のように、一定時間プログラムを保持しているものも含むものとする。 A program for realizing all or part of the functions of the reading robot 1 and the recognition generating unit 140 in the present invention is recorded on a computer-readable recording medium, and the program recorded on this recording medium is transferred to a computer system. All or part of the processing may be performed by the story-telling robot 1 and the recognition generation unit 140 by reading and executing the processing. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. Also, the "computer system" includes a WWW system provided with a home page providing environment (or display environment). The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. In addition, "computer-readable recording medium" means a volatile memory (RAM) inside a computer system that acts as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. , includes those that hold the program for a certain period of time.
 また、上記プログラムは、このプログラムを記憶装置等に格納したコンピュータシステムから、伝送媒体を介して、あるいは、伝送媒体中の伝送波により他のコンピュータシステムに伝送されてもよい。ここで、プログラムを伝送する「伝送媒体」は、インターネット等のネットワーク(通信網)や電話回線等の通信回線(通信線)のように情報を伝送する機能を有する媒体のことをいう。また、上記プログラムは、前述した機能の一部を実現するためのものであってもよい。さらに、前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるもの、いわゆる差分ファイル(差分プログラム)であってもよい。 Also, the program may be transmitted from a computer system storing this program in a storage device or the like to another computer system via a transmission medium or by transmission waves in a transmission medium. Here, the "transmission medium" for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing part of the functions described above. Further, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.
 以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形および置換を加えることができる。 As described above, the mode for carrying out the present invention has been described using the embodiments, but the present invention is not limited to such embodiments at all, and various modifications and replacements can be made without departing from the scope of the present invention. can be added.
 1…読み聞かせロボット、101…受信部、102…撮影部、103…収音部、104…センサ、100…コミュニケーション装置、105…認知部、106…記憶部、107…第1データベース、108…学習部、109…第2データベース、110…動作生成部、111…表示部、112…スピーカー、113…アクチュエータ、114…送信部、140…認識生成部、150…第3データベース、500…コンテンツ、1101…画像生成部、1102…音声生成部、1103…駆動部、1104…送信情報生成部、141…顔認識部、142…ジェスチャー認識部、143…音声認識部、144…生成部、1441…語り手・エージェント部、1442…結びつき部、1443…教育部 Reference Signs List 1 story-telling robot, 101 receiving unit, 102 photographing unit, 103 sound collecting unit, 104 sensor, 100 communication device, 105 recognition unit, 106 storage unit, 107 first database, 108 learning Part 109 Second database 110 Action generation unit 111 Display unit 112 Speaker 113 Actuator 114 Transmission unit 140 Recognition generation unit 150 Third database 500 Content 1101 Image generator 1102 Voice generator 1103 Drive unit 1104 Transmission information generator 141 Face recognition unit 142 Gesture recognition unit 143 Voice recognition unit 144 Generation unit 1441 Narrator/agent Department, 1442 ... Connection Department, 1443 ... Education Department

Claims (12)

  1.  聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成する生成部、
     を備える読み聞かせ情報作成装置。
    Read aloud using an image of a listener, an audio signal obtained by picking up the listener's voice, a result of detecting the listener's posture, a robot expression routine, image media, and an acoustic signal. An image to be displayed on the image display device when the reading is performed, an acoustic signal to be output to the image display device when the reading is performed, and an audio signal and facial expression data of the reading robot when the reading is performed. a generator for generating motion data;
    Story-telling information creation device.
  2.  前記生成部は、
     ナレーションを進めるためのナレーター部と、
     前記読み聞かせロボットの表現力を考慮し、表現力のあるルーチンを起動し、マルチメディアに対応するエージェンシー部と、
     特定の人物に前記読み聞かせロボットの視線を維持する人に反応する即時性のある行動を作成する結びつき部と、
     質問と回答の設定を行う教育部と、を備える、
     請求項1に記載の読み聞かせ情報作成装置。
    The generating unit
    A narrator section for advancing the narration,
    Considering the expressiveness of the story-telling robot, an agency section that activates a routine with expressiveness and responds to multimedia;
    a connection unit that creates immediacy actions in response to the person maintaining the line of sight of the storytelling robot to a particular person;
    an education department for configuring questions and answers;
    The story-telling information creating apparatus according to claim 1.
  3.  前記生成部は、行動木(Behaviour Trees)構造によって構成されている、
     請求項1または請求項2に記載の読み聞かせ情報作成装置。
    The generating unit is configured by a behavior tree structure,
    The story-telling information creating apparatus according to claim 1 or 2.
  4.  前記生成部は、
     前記聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果のうちの少なくとも1つに基づいて前記聞き手の状態を認識し、前記認識した結果に基づいて、前記読み聞かせを行わせる際に前記画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に前記読み聞かせロボットの音声信号と表情データと動作データのうちの少なくとも1つを変更する、
     請求項1または請求項2に記載の読み聞かせ情報作成装置。
    The generating unit
    recognizing the state of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the voice of the listener, and a result of detecting the posture of the listener; Based on this, an image to be displayed on the image display device when performing the story-telling, an acoustic signal to be output to the image display device when performing the story-telling, and the reading when performing the story-telling altering at least one of the voice signal, facial expression data, and motion data of the speaking robot;
    The story-telling information creating apparatus according to claim 1 or 2.
  5.  前記読み聞かせロボットの表情データは、目に相当する画像と口に相当する画像であり、
     前記読み聞かせロボットの動作データは、首に相当する部分の動作に関するデータである、
     請求項1または請求項2に記載の読み聞かせ情報作成装置。
    The facial expression data of the storytelling robot is an image corresponding to the eyes and an image corresponding to the mouth,
    The motion data of the storytelling robot is data relating to the motion of the part corresponding to the neck,
    The story-telling information creating apparatus according to claim 1 or 2.
  6.  前記画像表示装置と前記読み聞かせロボットは、近接して配置される、
     請求項1または請求項2に記載の読み聞かせ情報作成装置。
    The image display device and the story-telling robot are arranged in close proximity,
    The story-telling information creating apparatus according to claim 1 or 2.
  7.  画像を撮影する撮影部と、
     音響信号を収音する収音部と、
     前記聞き手の状態を検出するセンサと、
     目に相当する第1の表示部と、
     口に相当する第2の表示部と、
     首に相当する可動部と、
     請求項1に記載の読み聞かせ情報作成装置によって生成された前記画像表示装置に表示させる画像と、前記画像表示装置に出力させる音響信号と、前記読み聞かせロボットの音声信号と表情データと動作データを用いて、前記読み聞かせを行わせる動作生成手段と、
     を備える読み聞かせロボット。
    a photographing unit for photographing an image;
    a sound pickup unit that picks up an acoustic signal;
    a sensor that detects the state of the listener;
    a first display unit corresponding to eyes;
    a second display portion corresponding to the mouth;
    A movable part corresponding to the neck,
    An image to be displayed on the image display device generated by the story-telling information creation device according to claim 1, an acoustic signal to be output to the image display device, and an audio signal, facial expression data, and action data of the story-telling robot and an action generation means for performing the reading aloud using
    Story-telling robot with
  8.  画像を撮影する撮影部と、
     音響信号を収音する収音部と、
     前記聞き手の状態を検出するセンサと、
     目に相当する第1の表示部と、
     口に相当する第2の表示部と、
     首に相当する可動部と、
     請求項1に記載の読み聞かせ情報作成装置と、
     前記読み聞かせ情報作成装置によって生成された前記画像表示装置に表示させる画像と、前記画像表示装置に出力させる音響信号と、前記読み聞かせロボットの音声信号と表情データと動作データを用いて、前記読み聞かせを行わせる動作生成手段と、
     を備える読み聞かせロボット。
    a photographing unit for photographing an image;
    a sound pickup unit that picks up an acoustic signal;
    a sensor that detects the state of the listener;
    a first display unit corresponding to eyes;
    a second display portion corresponding to the mouth;
    A movable part corresponding to the neck,
    A story-telling information creation device according to claim 1;
    Using the image to be displayed on the image display device generated by the story-telling information creation device, the acoustic signal to be output to the image display device, and the voice signal, facial expression data, and action data of the story-telling robot, the reading is performed. a motion generating means for causing a sound to be heard;
    Story-telling robot with
  9.  前記聞き手に対して読み聞かせを行う、または、前記聞き手に対して前記読み聞かせの促進を行う、
     請求項7または請求項8に記載の読み聞かせロボット。
    Read aloud to the listener, or promote the read aloud to the listener;
    The story-telling robot according to claim 7 or 8.
  10.  人に関する人情報を取得し、取得した前記人情報から人に関する特徴情報を抽出し、コミュニケーションを行うコミュニケーション装置と人の間に生じる働きかけを認知し、人と人との間に生じる働きかけを認知する認知手段と、
     抽出された前記人に関する特徴情報を用いて、人の感情的な相互作用をマルチモーダルによって学習する学習手段と、を更に備え、
     前記動作生成手段は、学習された前記人の感情的な相互作用情報に基づいて、行動を生成する、
     請求項7または請求項8に記載の読み聞かせロボット。
    Acquiring human information about a person, extracting characteristic information about the person from the obtained human information, recognizing an action that occurs between a communication device that performs communication and a person, and recognizing an action that occurs between people means of cognition;
    learning means for multimodally learning a person's emotional interaction using the extracted feature information about the person;
    The action generating means generates actions based on learned emotional interaction information of the person.
    The story-telling robot according to claim 7 or 8.
  11.  生成部が、聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成する、
     読み聞かせ情報作成方法。
    A generation unit uses an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, a robot expression routine, image media, and an acoustic signal. , an image to be displayed on the image display device when performing the story-telling, an acoustic signal to be output to the image display device when performing the story-telling, and an audio signal of the story-telling robot when performing the story-telling and generate facial expression data and motion data,
    How to create story-telling information.
  12.  コンピュータに、
     聞き手を撮影した画像と、前記聞き手の音声を収音した音声信号と、前記聞き手の姿勢を検出した結果と、ロボットの表現のルーチンと、画像メディアと、音響信号とを用いて、読み聞かせを行わせる際に画像表示装置に表示させる画像と、前記読み聞かせを行わせる際に前記画像表示装置に出力させる音響信号と、前記読み聞かせを行わせる際に読み聞かせロボットの音声信号と表情データと動作データとを生成させる、
     プログラム。
    to the computer,
    Read aloud using an image of a listener, an audio signal obtained by picking up the listener's voice, a result of detecting the listener's posture, a robot expression routine, image media, and an acoustic signal. An image to be displayed on the image display device when the reading is performed, an acoustic signal to be output to the image display device when the reading is performed, and an audio signal and facial expression data of the reading robot when the reading is performed. generating operational data;
    program.
PCT/JP2022/028936 2021-08-10 2022-07-27 Storytelling information creation device, storytelling robot, storytelling information creation method, and program WO2023017732A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021130727 2021-08-10
JP2021-130727 2021-08-10

Publications (1)

Publication Number Publication Date
WO2023017732A1 true WO2023017732A1 (en) 2023-02-16

Family

ID=85200463

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/028936 WO2023017732A1 (en) 2021-08-10 2022-07-27 Storytelling information creation device, storytelling robot, storytelling information creation method, and program

Country Status (1)

Country Link
WO (1) WO2023017732A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725547A (en) * 2023-11-17 2024-03-19 华南师范大学 Emotion and cognition evolution mode identification method based on cross-modal feature fusion network

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004114285A (en) * 2002-09-02 2004-04-15 Sony Corp Robotic device and its behavior control method
JP2007072719A (en) * 2005-09-06 2007-03-22 Nec Corp Story output system, robot device and story output method
JP2011203859A (en) * 2010-03-24 2011-10-13 Fujitsu Frontech Ltd Device and method for outputting voice
JP2013099823A (en) * 2011-11-09 2013-05-23 Panasonic Corp Robot device, robot control method, robot control program and robot system
JP2017010516A (en) * 2015-06-24 2017-01-12 百度在線網絡技術(北京)有限公司 Method, apparatus, and terminal device for human-computer interaction based on artificial intelligence

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004114285A (en) * 2002-09-02 2004-04-15 Sony Corp Robotic device and its behavior control method
JP2007072719A (en) * 2005-09-06 2007-03-22 Nec Corp Story output system, robot device and story output method
JP2011203859A (en) * 2010-03-24 2011-10-13 Fujitsu Frontech Ltd Device and method for outputting voice
JP2013099823A (en) * 2011-11-09 2013-05-23 Panasonic Corp Robot device, robot control method, robot control program and robot system
JP2017010516A (en) * 2015-06-24 2017-01-12 百度在線網絡技術(北京)有限公司 Method, apparatus, and terminal device for human-computer interaction based on artificial intelligence

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725547A (en) * 2023-11-17 2024-03-19 华南师范大学 Emotion and cognition evolution mode identification method based on cross-modal feature fusion network

Similar Documents

Publication Publication Date Title
Oertel et al. Engagement in human-agent interaction: An overview
Kory Westlund et al. Flat vs. expressive storytelling: Young children’s learning and retention of a social robot’s narrative
US20180133900A1 (en) Embodied dialog and embodied speech authoring tools for use with an expressive social robot
US20220093000A1 (en) Systems and methods for multimodal book reading
US20220309948A1 (en) Systems and methods to measure and enhance human engagement and cognition
WO2021261474A1 (en) Behavior control device, behavior control method, and program
Hu et al. Storytelling agents with personality and adaptivity
Nishida Conversational informatics: An engineering approach
WO2023017732A1 (en) Storytelling information creation device, storytelling robot, storytelling information creation method, and program
De Wit et al. The design and observed effects of robot-performed manual gestures: A systematic review
Janowski et al. Adaptive artificial personalities
Gomez et al. Exploring affective storytelling with an embodied agent
Bryer et al. Re‐animation: multimodal discourse around text
Doumanis Evaluating humanoid embodied conversational agents in mobile guide applications
DeMara et al. Towards interactive training with an avatar-based human-computer interface
Platz Design Beyond Devices: Creating Multimodal, Cross-device Experiences
Moro Learning Socially Assistive Robot Behaviors for Personalized Human-Robot Interaction
op den Akker et al. Computational models of social and emotional turn-taking for embodied conversational agents: a review
JP7425681B2 (en) Social ability generation device, social ability generation method, and communication robot
Wang Designing chatbot interfaces for language learning: ethnographic research into affect and users' experiences
WO2023017753A1 (en) Learning device, learning method, and program
Howarth Metaphor and neuroscience: implications for online learning
Semjonovs Astrobot: The far end of persuasion
Nirme Understanding virtual speakers
Javed Personalizing Mixed Initiative Dance Interactions with a Socially-Aware Robot

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855795

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE