WO2023017732A1

WO2023017732A1 - Storytelling information creation device, storytelling robot, storytelling information creation method, and program

Info

Publication number: WO2023017732A1
Application number: PCT/JP2022/028936
Authority: WO
Inventors: ランディゴメス
Original assignee: 本田技研工業株式会社
Priority date: 2021-08-10
Filing date: 2022-07-27
Publication date: 2023-02-16

Abstract

This storytelling information creation device comprises a generation unit that generates an image to be displayed on an image display device during storytelling, an acoustic signal to be outputted to the image display device during storytelling, and a voice signal, facial expression data, and motion data for a storytelling robot during storytelling, by using a photographed image of a listener, a picked-up voice signal of the listener's voice, the result of detecting the listener's posture, an expression routine of the robot, image media, and an acoustic signal.

Description

Story-telling information creation device, story-telling robot, story-telling information creation method, program

The present invention relates to a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a program.
This application claims priority based on Japanese Patent Application No. 2021-130727 filed on August 10, 2021, the content of which is incorporated herein.

Storytelling is an inherently social activity that allows us humans to connect with others, find meaning in life, construct socially shared meanings, transmit information from person to person, and transfer knowledge to others. Helps pass it on to generations. Cognitive psychologist Jerome Bruner points out that storytelling serves the dual purpose of creating meaning – making the queer familiar and defining the individual and collective self.

For reading aloud, for example, there are audiobooks read aloud by narrators. Audiobooks are provided by mediums such as cassette tapes and CDs (compact discs), and are also provided through distribution services such as the Internet. A document reading support device has been proposed that supports such reading by estimating an utterance style in consideration of the context (see, for example, Patent Document 1).

JP 2015-215626 A

However, conventional audiobooks have the same audio data, so there was a problem that users would get bored if they played it over and over again. In addition, even if the conventional text-to-speech device estimates an utterance style that takes into consideration the context, there is a problem that the user becomes bored after repeated use. Further, some users may become bored in the middle of one story.

Aspects of the present invention have been made in view of the above problems, and include a story-telling information creation device, a story-telling robot, a story-telling information creation method, and a story-telling information creation device that can devise story-telling according to the user. The purpose is to provide a program.

In order to solve the above problems, the present invention employs the following aspects.
(1) A story-telling information creation device according to an aspect of the present invention includes an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and an expression of a robot. Using a routine, an image medium, and an acoustic signal, an image to be displayed on an image display device when reading a story, an acoustic signal to be output to the image display device when reading a story, and a generation unit that generates a voice signal, facial expression data, and action data of the story-telling robot when performing story-telling.

(2) In the above aspect (1), the generation unit considers the expressive power of the narrator unit for proceeding with the narration and the story-telling robot, activates a routine with expressive power, and creates an agency that supports multimedia. , a linking unit that creates immediate actions in response to a person who maintains the line of sight of the story-telling robot to a specific person, and an education unit that sets questions and answers. .

(3) In the aspect (1) or (2) above, the generator may be configured with a behavior tree structure.

(4) In any one of the above aspects (1) to (3), the generating unit generates an image of the listener, an audio signal obtained by collecting the audio of the listener, and the attitude of the listener. recognizing the state of the listener based on at least one of the detection results, and based on the recognized result, an image to be displayed on the image display device when the reading is performed; At least one of the sound signal to be output to the image display device when performing the reading, and the voice signal, facial expression data, and action data of the story-telling robot when performing the story-telling may be changed. good.

(5) In any one of the above modes (1) to (4), the facial expression data of the storytelling robot is an image corresponding to the eyes and an image corresponding to the mouth, and the motion of the storytelling robot The data may be data relating to the motion of the portion corresponding to the neck.

(6) In any one of the above aspects (1) to (5), the image display device and the story-telling robot may be arranged close to each other.

(7) A storytelling robot according to an aspect of the present invention includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; an image to be read, an acoustic signal to be output to the image display device, and an action generating means for causing the story-telling robot to perform the story-telling by using the voice signal, expression data, and action data of the story-telling robot.

(8) A storytelling robot according to an aspect of the present invention includes an imaging unit that captures an image, a sound pickup unit that picks up an acoustic signal, a sensor that detects the state of the listener, and a first a second display portion corresponding to the mouth; a movable portion corresponding to the neck; any one of the above aspects (1) to (6); an image to be displayed on an image display device, an acoustic signal to be output to the image display device, and an action generating means for causing the storytelling robot to perform the storytelling using an audio signal, facial expression data, and action data of the storytelling robot. .

(9) In the above aspect (7) or (8), the listener may be read aloud, or the listener may be encouraged to read aloud.

(10) In any one of the above aspects (7) to (9), the communication device and the person acquire human information about the person, extract characteristic information about the person from the acquired personal information, and communicate with the person. Multimodal learning of people's emotional interactions by using cognitive means for recognizing actions that occur between people and for recognizing actions that occur between people and the extracted feature information about the said person. and learning means, wherein the action generating means may generate actions based on the learned emotional interaction information of the person.

(11) In a story-telling information creation method according to an aspect of the present invention, the generation unit detects an image of a listener, an audio signal obtained by collecting the listener's voice, a result of detecting the listener's posture, An image to be displayed on an image display device when storytelling is performed using a robot expression routine, image media, and an acoustic signal, and sound to be output to the image display device when the storytelling is performed A signal, an audio signal, facial expression data, and motion data of the storytelling robot when the storytelling is performed.

(12) A program according to an aspect of the present invention stores, in a computer, an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, and a robot expression routine. and an image to be displayed on an image display device when reading aloud using an image medium and an acoustic signal, an acoustic signal to be output to the image display device when the reading is performed, and the reading A voice signal, facial expression data, and action data of a storytelling robot are generated when the storytelling is performed.

According to the above aspects (1) to (12), it is possible to devise reading aloud according to the user. Further, according to the aspects (1) to (12), the user's attention can be maintained.

It is a figure which shows the communication example of the story-telling robot which concerns on embodiment. It is a figure which shows the example of the outline of the story-telling robot which concerns on embodiment. FIG. 10 is a diagram showing an example of story-telling by a story-telling robot while projecting a video of a story on an image display device; It is a figure which shows the operation|movement and facial expression example of the image projected on the image display apparatus during story-telling which concerns on embodiment, and a story-telling robot. It is a figure which shows the operation|movement and facial expression example of the image projected on the image display apparatus during story-telling which concerns on embodiment, and a story-telling robot. FIG. 10 illustrates an example perceptual element; It is a figure which shows the implementation example of the story-telling which concerns on embodiment. FIG. 4 is a diagram showing an example of a narrator node; FIG. 2 illustrates an example agency node; It is a figure which shows an example of a connection node. FIG. 4 is a diagram showing an example of an action tree configuration for storytelling; It is a block diagram which shows the structural example of the story-telling robot which concerns on embodiment. It is a figure which shows the structural example of the recognition production|generation part which concerns on embodiment. It is a flowchart which shows the procedure example of the story-telling process of the story-telling robot which concerns on embodiment. It is a figure which shows the example of a 1st evaluation result. It is a figure which shows the example of a 2nd evaluation result. 4 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device of the embodiment; FIG. It is a figure which shows the data example which the recognition part which concerns on embodiment recognizes. It is a figure which shows the example of the agent preparation method which the action production|generation part which concerns on embodiment uses.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In addition, in the drawings used for the following description, the scale of each member is appropriately changed so that each member has a recognizable size.

<Overview>
FIG. 1 is a diagram showing a communication example of the storytelling robot 1 according to the present embodiment. As shown in FIG. 1, the story-telling robot 1 performs story-telling according to, for example, the user's state or situation (g17). In the following description, the "reading robot" is also referred to as "robot". In addition, the storytelling robot 1 communicates with an individual or a plurality of people 2 . Communication is mainly dialogue g11 and gesture g12 (movement). Actions are represented by images displayed on the display in addition to actual actions. Further, when the e-mail is sent to the user via the Internet line or the like, the story-telling robot 1 receives the e-mail and informs the user that the e-mail has been received and the content (g13). Further, when the story-telling robot 1 requires a reply to an e-mail, for example, it makes a proposal g14 by communicating with the user as to whether advice is required. The story-telling robot 1 sends a reply (g15). In addition, the storytelling robot 1 presents a weather forecast g16 for a location according to the scheduled date and time and location, for example, according to the user's schedule.

In this way, the story-telling robot 1 of the present embodiment generates the social ability of the robot so that an emotional connection can be formed between the robot and the person, for example, according to the reaction or action of the person. can communicate with people. In addition, the storytelling robot 1 can communicate with a person and a robot by sympathizing with each other on an emotional level. Then, the story-telling robot 1 devisely performs story-telling according to the situation of the person to whom it is to be read-aloud.

<Outline example of story-telling robot 1>
Next, an example of the appearance of the storytelling robot 1 will be described.
FIG. 2 is a diagram showing an example of the outline of the storytelling robot 1 according to the present embodiment. In the example of the front view g101 and the side view g102 of FIG. 2, the storytelling robot 1 has three display units 111 (111a, 111b, 111c). In the example of FIG. 2, the imaging unit 102a is attached above the display unit 111a, and the imaging unit 102b is attached above the display unit 111b. The

display units

111a and 111b correspond to the human eye and present images and image information corresponding to the human eye. The screen size of the

display units

111a and 111b is, for example, 3 inches. The speaker 112 is attached to the housing 120 in the vicinity of the display section 111c that displays an image corresponding to the human mouth. The display unit 111c is composed of, for example, a plurality of LEDs (light emitting diodes), each of which can be addressed and can be individually turned on and off. The sound pickup unit 103 is attached to the housing 120 .

The storytelling robot 1 also includes a boom 121 . Boom 121 is movably attached to housing 120 via movable portion 131 . A horizontal bar 122 is rotatably attached to the boom 121 via a movable portion 132 .
Further, a display portion 111a is rotatably attached to the horizontal bar 122 via a movable portion 133, and a display portion 111b is rotatably attached via a movable portion . In addition, the outer shape of the reading robot 1 shown in FIG. 2 is an example, and is not limited to this.

With such a configuration, the reading robot 1 has, for example, five motion degrees of freedom (base rotation, neck leaning, eye stroke, eye rotation, eye tilt). It has tilt)) and enables expressive movements. The story-telling robot 1 can then communicate via text-to-speech (TTS), animation routines (open-loop combinations of movements, sounds, and eyes expressing specific emotions), projected screens, etc. .

<Example of story-telling by story-telling robot 1>
Next, an example of story-telling by the story-telling robot 1 will be described with reference to FIGS. 3 to 5. FIG. FIG. 3 is a diagram showing an example of story-telling by the story-telling robot 1 while projecting the image of the story on the image display device 7 . As in the example of FIG. 3, the storytelling robot 1 is placed, for example, on the table Tab and in front of the image display device 7 . Note that the story-telling robot 1 may be placed beside the image display device 7 or the like. That is, the image display device 7 and the story-telling robot 1 are arranged close to each other. The image displayed on the image display device 7 may be a still image or a moving image. The image also includes music to match the image. The image display device 7 has a speaker, and reproduces music included in the video from the speaker.

FIGS. 4 and 5 are diagrams showing images displayed on the image display device 7 during story-telling according to the present embodiment, and examples of actions and facial expressions of the story-telling robot 1. FIG. In FIG. 4, an image g201 is a first example image displayed on the image display device 7, and images g203 and g205 are examples of the motion and expression of the storytelling robot 1 during storytelling. For example, the image g203 is the action and facial expression of being surprised, and the image g205 is the action and facial expression of being disappointed. In FIG. 5, an image g211 is a second example image displayed on the image display device 7, and images g213 and g215 are examples of actions and facial expressions of the storytelling robot 1 during storytelling. For example, the image g213 is the action and facial expression when happy, and the image g215 is the action and facial expression when it is interesting.

As shown in FIGS. 4 and 5, the story-telling robot 1 moves parts corresponding to the neck and eyes according to the contents of the story and the state of the user Hu during the story-telling, and displays the display part corresponding to the eyes and the mouth. By changing the image displayed on the display unit corresponding to , the action and facial expression are changed. It should be noted that the images shown in FIGS. 4 and 5 and the facial expressions and actions of the storytelling robot 1 are examples, and are not limited to these.
Note that the storytelling robot 1 may emit sounds of exclamation points such as "Oh!" Such facial expressions and actions of the storytelling robot 1 are designed to enhance persuasiveness and interactivity, and reflect a realistic repertoire of gestures that a human narrator would make. Moreover, facial expressions include, for example, eye movements, facial expressions of eyes, mouth facial expressions due to the shape of the mouth, and the like. Actions include, for example, neck movements, eye movements, etc., as described above.

<Perceptual function of story-telling robot 1>
Next, an example of the perceptual function of the storytelling robot 1 will be described. The story-telling robot 1 recognizes the environment using sight and hearing in addition to the five degrees of freedom described above. The story-telling robot 1 acquires visual information using the data captured by the imaging unit 102 included in the story-telling robot 1 . The imaging unit 102 includes, for example, an RGB camera and a depth sensor or a distance sensor. In addition, the storytelling robot 1 acquires auditory information using an acoustic signal picked up by the sound pickup unit 103 of the storytelling robot 1 . The sound pickup unit 103 is a microphone array including a plurality of microphones.

FIG. 6 is a diagram illustrating an example sensory element.
The storytelling robot 1 perceives the voice direction, voice recognition, the listener (user), the direction of the listener's (user's) head, the direction of the listener's body, the position of the hands, etc., as in the image g301.
An image g304 is an image captured by the capturing unit 102 provided in the storytelling robot 1, and is a viewpoint seen from the robot.
Image g302 is an image obtained by extracting an image including the listener's face from image g304.
An image g303 is an image resulting from performing face authentication on the image g302 by a well-known technique. In this example, the listener's name is Fred, and 2 is assigned as identification information (ID). If the face image and name of the listener are not registered in the third database 150 (FIG. 12) of the story-telling robot 1, the story-telling robot 1, for example, speaks to the listener to hear the name, and responds with an audio signal. and register it in the third database 150.
A face recognition unit 141 (FIG. 13) performs face recognition, a gesture recognition unit 142 (FIG. 13) performs gesture recognition, and a voice recognition unit 143 (FIG. 13) performs voice recognition, estimation of a sound source direction, and the like.

The story-telling robot 1 acquires position information using the acquired information captured by the imaging unit 102 . The storytelling robot 1 then obtains the poses of the different human parts (limbs, body, head), for example, using well-known skeleton detection (see, for example, Reference 1). Furthermore, in addition to skeleton detection, the storytelling robot 1 detects various gestures such as a hand waving gesture and a pointing gesture by a well-known gesture detection method (for example, Japanese Patent Application Laid-Open No. 2021-031630), Estimate the pointing direction of a person's hand. Then, the storytelling robot 1 can identify the face by comparing with the information registered in the third database 150, and can also identify facial features such as a smile when the person is nearby.

The storytelling robot 1 identifies the sound source direction and separates the sound sources from the sound signal collected by the sound collection unit 103 using a technique such as beamforming. Note that the storytelling robot 1 may perform processing such as noise suppression and utterance section estimation on the collected sound signal. Further, the story-telling robot 1 converts the separated speech whose speaker is specified into text and understands the language. Thereby, the story-telling robot 1 obtains auditory information.

Reference 1; J. Shotton, A. Fitzgibbon, A. Blake, A. Kipman, M. Finocchio, R. Moore, and T. Sharp, “Real-time human pose recognition in parts from a single depth image” in CVPR IEEE, June 2011

<Overview of control of story-telling robot 1 in story-telling>
Next, the outline of the control of the story-telling robot 1 in story-telling will be described.
Previous studies have shown that reading to children helps with self-expression, empathic understanding, two-way communication, and supports children's speaking and listening practice.
In addition, by embodied by a robot, it is possible to use gestures and line of sight. Studies to date have shown that the frequency and direction of the robot's line of sight and simple head movements have social effects that increase participation in learning materials in educational settings.
In addition, the combination of gaze and gesture increases the persuasiveness of the talking robot. Previous studies have shown that gestures are more persuasive when combined with the gaze of the listener.

However, in these research results, there was a problem that children lost interest in robots as storytellers. there is

For this reason, in the present embodiment, in the storytelling, the following four points (Narrator, Agency, Engagement, Education) are taken into account to create a story and read aloud. The facial expression, motion, and speech of the robot 1 are performed.

I. The role of the narrator The narrator tells the story and sets the point of view. In storytelling, the narrator speaks live to tell the story to the listener. And reading aloud can improve story comprehension and attention. When telling a story by voice, especially in order to attract children, it is necessary to aim for rich expressiveness and speak according to the situation. For example, when parents tell stories to their children, they often "act out" the characters and stories in their own words, exaggerating, varying, and expressively conveying the words. The expressiveness of the narrator's voice contains social and emotional cues that call attention and help draw the audience into the characters and the world of the story. Live, the narrator uses direct gazes and gestures to highlight key points in the story, directing attention and interest.

II. Agency Storytelling is directly related to meaning formation. A multimodal approach to storytelling allows us to influence the listener in the ordering and placement of semiotic resources in the meaning-making process. Body and facial expressions, proximity movements, gaze, eye saccades, speech style, tone of voice, and volume are examples of modal resources for emotional reading. With the addition of media technology, modal resources expand to include image colors, text, motion, music, sounds, and more. Combining live storytelling with communication media such as projection and animation can provide a wealth of modal resources that support audience agency in the meaning-making process.

III. Connectivity Since we know that immediacy of action in response to the listener's actions enhances connection, we should choose stories that support maintaining short, novel interactions between the robot and the viewer. is important. These were scripted for high-level immediacy responses (closed-loop), where the robot responds directly to learning and reacting to listener actions, and for limited interactions predicted by limited prompts. Any low-level immediacy response, such as a programmed response (open loop). Closed-loop responses have the effect of increasing connection. A simple open-loop response can also be effective in increasing engagement. Elements such as the aesthetic appeal of the characters and setting are designed to enhance the connection, and the immediacy of the story makes the listener feel part of the story through simple open-loop questions and responses by the listener. Designing behavior is important.

IV. Education Reading to children promotes verbal and emotional literacy. Stories should be designed to encourage positive values and two-way dialogue between children and the narrator. A simple question-and-answer exchange can encourage children to use their imagination and think about the consequences of their actions. You can also add educational elements over time to teach children about the larger concepts involved in the story.

It should be noted that the above four are just examples, and the elements to be considered are not limited to these, and other elements may also be included.

<Content example>
Here, an example of content 500 (FIG. 7) used for storytelling will be described.
The content 500 used in the embodiments was designed by the author to, for example, support audience agency and engagement, facilitate learning of the educational value of stories, and support the role of storyteller. These components consist of narrative, animated projections, sound, and robot performance and interaction with the audience. The stories were created in collaboration with writers and educators. The creative director collaborated with designers, illustrators, animators, voice artists, and sound designers to create visuals, character performances, and sounds that projected story elements. Based on the knowledge that story-specific eye and body gestures promote interaction between story-telling robot 1 and listeners, the content enhances the robot's persuasiveness and sympathizes with the characters in the story. To convey this, we designed new animation movements for the robot's body and eyes that are dedicated to narrative interactions.

<Example of operation according to the contents of story-telling robot 1>
Here, an operation example of the story-telling robot 1 according to the contents will be described.
The facial expressions and actions of the storytelling robot 1 as shown in FIGS. 4 and 5 are designed to increase persuasiveness and interactivity, and reflect the realistic repertoire of gestures that a human narrator would make. there is

This includes animated gestural movements of the body and eye saccades, which fall into three of four categories in the gesture classification scheme known as "Kendon's Continuum" (see Reference 2). be. These gestures, which consist of gestures, emblems and pantomimes, are the fourth category in Kendon's continuum, and are used as hands because robots do not have hands. Gestures are difficult.

Reference 2; D. McNeill, "Hand and mind: What gestures reveal about thought", University of Chicago Press, 1992.

A gesture is an action that embodies the meaning associated with speech. For example, when speaking a positive opinion, the story-telling robot 1 realizes a gesture by “nodding”.

An emblem is an idiomatic sign with an agreed upon meaning, such as a smile to express happiness. Therefore, the storytelling robot 1 realizes an emblem by making a smiling expression to express happiness.

A pantomime is a gesture or series of gestures that tells a story, usually without words. For example, the robot can mimic and reflect the sadness of the characters in the story by exaggeratedly bending forward or dropping their eyes.

Also, the ability to use eyes to guide and interpret social behavior is a central aspect of social interaction. For this reason, in the embodiment, the pantomime was designed with particular attention to eye movements. In addition to the gestural movements designed to relate to the emotions in the story, a third database 150 of the storytelling robot 1 was added to maintain a sense of active listening and engagement with the robot. designed and housed a library of eye movements that are deployed during emotional routines. These consisted of subtly distributed saccadeic movements. In the story-telling robot 1, eye-tracking is used to change the line-of-sight direction to support interaction and engagement with the listener.

Furthermore, since the creative storytelling content is designed to be experienced through interaction between the robot and the listener, dictated reading, and a live experience using the image display device 7, it is possible to confirm the role of the robot in storytelling. is important. Therefore, in the embodiment, creative contents such as the image display device 7 and robot gesture performance are used to overcome the limitations and problems of speech transmission by a speech synthesis system.

In addition, the expressiveness of the voice is an important element for attracting attention and emphasizing information in the story. In the embodiment, since speech synthesis systems have a limited range of expressions, modulations, and nuances, we set two roles for speech delivery for storytelling:

The first role uses the human narrator's voice as the storytelling element of the project. The robot then takes on the role of facilitator of the story, combining gestures, sympathetic reactions, and simple questions and answers according to the story to interact with the story. Note that the audio signal for this role is a recording of a human narrator's vocal performance when telling a story expressively. The robot's speech synthesis system then combined it with appropriate gestural movements for question-answer interactions.

The second role is that of the robot as a narrator. It combines a robot speech synthesis system for both storytelling and question-answering, as well as robot gestures.

In the embodiment, all potential interactions provided by combining these elements are mapped for programming into the robot.

<Processing program of story-telling robot 1>
Next, an example of a processing program of the story-telling robot 1 will be described.
First, we need a framework to transfer the designed storytelling concept to the actual robot. It should be noted that the robot does not simply function as an open loop that reproduces the story. Robots need to respond to sensory inputs to trigger behaviors related to bonding, agency, and education. Furthermore, the robot as a narrator needs to function both as a facilitator and as a storyteller itself.

Therefore, the framework needs to be able to define the robot's behavior and access the perceptual results by combining different actuation components and modalities. It is preferable that the programming performed on the robot to achieve such a purpose, in other words, a "choreographer of storytelling," can be easily programmed even by a person who does not specialize in robotics engineering. The framework should be versatile enough to consider different narratives, and preferably reuse elements to create new narratives and actions.

In the embodiment, Behavior Trees (BT) were adopted as the basis for this "choreographer who tells a story". An action tree is a method for structuring the flow and control of multiple tasks in a decision-making application. By using action trees as a framework, we can provide simple semantics for non-robot experts to implement new actions.

Main elements of the behavior tree in the embodiment are described below.
An action tree models actions as a hierarchical tree composed of nodes.
The tree is traversed from top to bottom at a constant rate according to well-defined rules, executing tasks and commands encountered along the way. Note that the status of tasks and commands is reported up the chain, and the flow changes accordingly.

Nodes are classified according to their function as follows, for example:
I. Composite: Controls the flow of the tree itself. These nodes resemble control structures in traditional programming languages.
II. Decorator: Processes or modifies the status received from the child.
III. Leaf: Where the actual tasks are performed, there are atomic tasks or other functionality that the robot can perform. Therefore, these nodes cannot have children.

As you can see from this classification, the action tree naturally separates the logic from the actual task. Also, when developing a tree, you only need to care about the leaf nodes. This flow can be later defined and constantly rearranged to create new storytelling actions or to extend what is already being done. An important advantage of action trees is composability and reusability (due to the tree's hierarchical nature).

<Implementation of storytelling>
The storytelling implementation used a "palette" of leaf nodes for the action tree engine, as shown in FIG.
FIG. 7 is a diagram showing an implementation example of storytelling according to the present embodiment. As shown in FIG. 7, content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, audio signals (prompts/prompts for voice/speech synthesis). audio, etc.) 503 .

In addition, as shown in FIG. 7, in the configuration of storytelling using an action tree, the generation unit 144 (FIG. 12, storytelling information creation device) includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443. . The generating unit 144 uses the input from the sensors (the photographing unit 102, the sound collecting unit 103, and the sensor 104 (FIG. 12)) and the content 500 to output information to the reading robot 1 and to the image display device 7. Generate information. In addition, the story-telling robot 1 may be equipped with the production|generation part 144, and another apparatus (for example, a personal computer, a tablet terminal, etc.) may be equipped with it.

Here, an example in which another device includes the generation unit 144 will be described.
These nodes can be used, for example, in the GUI so that when creating new stories, developers can "drag and drop" and combine them with Composite and Decorator nodes to create new storytelling applications. (Graphical User Interface).

Here, examples of predefined nodes used in the action tree for storytelling are shown in FIGS. 8-11.
FIG. 8 is a diagram showing an example of a narrator node. The narrator node advances the story. FIG. 9 is a diagram showing an example of an agency node. Agency nodes provide multimedia and presentation support. FIG. 10 is a diagram showing an example of a connection node. The connection node provides a closed-loop immediate reaction of the robot. FIG. 11 is a diagram showing an example of an action tree configuration for storytelling. Note that the recognition generation unit 140 recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the listener's voice, and a result of detecting the posture of the listener. can be
Note that the structures, connection relationships, nodes, etc. shown in FIGS. 8 to 11 are examples, and are not limited to these.

An outline of each node in FIGS. 8 to 11 will be described below.
Narrator (1441) (Narrator section); various blocks are created to advance the narration.

• AudioPlay: Allows the narrator to play pre-recorded audio and controls various aspects of the audio.

Speech-to-speech synthesis (SpeakTTS): Using a text-to-speech engine, the robot utters a programmatic sentence. This enables the story-telling robot 1 to react according to the situation during story-telling. In addition, you can control the prosody of your speech using tags, for example.

Agency (1441); contains multiple blocks (eg, ExecuteRoutine, ProjectorImage and ProjectorVideo) that take into account the expressiveness of the robot, launch expressive routines, and support multimedia. For example, ExecuteRoutine allows execution of a predefined robot expression routine. These routines are open-loop combinations of all the robot's behavioral modalities, including movements, sounds, eyes and mouths, to express various emotions such as joy and sorrow. Projector Image and Projector Video use the image display device 7 to display still images and animations.

• Engagement (1442); this block can create immediacy actions that respond to people, such as keeping a robot's line of sight on a particular person, and can connect multiple blocks (e.g. TrackPerson). Prepare. TrackPerson tracks detected persons using perceptual functions (implemented as an additional block, eg GetPeople, which returns information about all persons in the field of view). It can be combined with other blocks (eg, GetClosestPerson, etc.) to find and track the closest person or consider other proximity conditions.

• Education (1443); this block comprises several blocks (eg, AskQuestion, ListensPerson, GetASR) related to setting questions and answers. AskQuestion asks a question using speech synthesis and sets the expected answer in the action tree memory. ListensPerson indicates which person to listen to when there is a group of people. GetASR obtains recognized speech from a specified person. This can be used to compare the expected answer to a yes or no question, or to have the robot respond appropriately based on the result.

<Configuration example of story-telling robot 1>
Next, a configuration example of the story-telling robot 1 will be described.
FIG. 12 is a block diagram showing a configuration example of the storytelling robot 1 according to this embodiment. As shown in FIG. 12, the storytelling robot 1 includes a receiving unit 101, an imaging unit 102, a sound pickup unit 103, a sensor 104, a communication device 100, a storage unit 106, a first database 107, a second database 109, a display unit 111, A speaker 112 , an actuator 113 , a transmitter 114 , a third database 150 and content 500 are provided.

The communication device 100 includes a recognition section 105 (cognition means), a learning section 108 (learning means), an action generation section 110 (action generation means), and a recognition generation section 140 .
The motion generating section 110 includes an image generating section 1101 , a sound generating section 1102 , a driving section 1103 and a transmission information generating section 1104 .

<Functions and operations of story-telling robot 1>
Next, the function and operation of each functional unit of the storytelling robot 1 will be described with reference to FIG.

The receiving unit 101 acquires information (e.g., e-mail, blog information, news, weather forecast, etc.) from, for example, the Internet via a network, and outputs the acquired information to the recognition unit 105 and the action generation unit 110. Alternatively, for example, when the first database 107 is on the cloud, the receiving unit 101 acquires information from the first database 107 on the cloud and outputs the acquired information to the recognition unit 105 .

The imaging unit 102 is, for example, a CMOS (Complementary Metal Oxide Semiconductor) imaging device, a CCD (Charge Coupled Device) imaging device, or the like. The imaging unit 102 also includes a depth sensor. The photographing unit 102 outputs photographed images (human information, which is information about a person; still images, continuous still images, moving images) and depth information to the recognition unit 105 and the action generation unit 110 . Note that the storytelling robot 1 may include a plurality of imaging units 102 . In this case, the imaging unit 102 may be attached to the front and rear of the housing of the storytelling robot 1, for example. Note that the imaging unit 102 may be a distance sensor.

The sound pickup unit 103 is, for example, a microphone array composed of a plurality of microphones. The sound pickup unit 103 outputs acoustic signals (human information) picked up by a plurality of microphones to the recognition unit 105 and the action generation unit 110 . Note that the sound pickup unit 103 may sample each sound signal picked up by the microphone using the same sampling signal, convert the analog signal into a digital signal, and then output the signal to the recognition unit 105 .

The sensor 104 includes, for example, a temperature sensor that detects the temperature of the environment, an illuminance sensor that detects the illuminance of the environment, a gyro sensor that detects the tilt of the storytelling robot 1 housing, and a movement of the storytelling robot 1 housing. They are an acceleration sensor, an atmospheric pressure sensor for detecting atmospheric pressure, and the like. The sensor 104 outputs the detected value to the recognition unit 105 and the motion generation unit 110 . Note that the depth sensor may be included in the sensor 104 .

The storage unit 106 stores, for example, items to be recognized by the recognition unit 105, various values (threshold values, constants) used for recognition, algorithms for recognition, and the like.

The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and acoustic features used for speech recognition, a comparison image database and image features used for image recognition, and the like. do. Each data and feature amount will be described later. Note that the first database 107 may be placed on the cloud or may be connected via a network.

The second database 109 stores data related to relationships between people, such as social components, social norms, social customs, psychology, and humanities, which are used during learning. Note that the second database 109 may be placed on the cloud or may be connected via a network.

The communication device 100 recognizes an action that occurs between the story-telling robot 1 and a person, or an action that occurs between a plurality of people, and based on the recognized content and the data stored in the second database 109, the human's emotional response. Learn to interact. Then, the communication device 100 generates the social ability of the storytelling robot 1 from the learned contents. The social ability is, for example, the ability to interact between people, such as dialogue, behavior, understanding, empathy, etc. between people.

The recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between multiple people. The recognizing unit 105 acquires the image captured by the capturing unit 102 , the acoustic signal collected by the sound collecting unit 103 , and the detection value detected by the sensor 104 . Note that the recognition unit 105 may acquire information received by the reception unit 101 . Based on the acquired information and the data stored in the first database 107, the recognition unit 105 recognizes an action that occurs between the storytelling robot 1 and a person, or an action that occurs between a plurality of persons. Note that the recognition method will be described later. The recognizing unit 105 outputs the recognized recognition result (feature amount related to sound, feature information related to human behavior) to the learning unit 108 . Note that the recognition unit 105 performs well-known image processing (for example, binarization processing, edge detection processing, clustering processing, image feature amount extraction processing, etc.) on the image captured by the imaging unit 102 . The recognition unit 105 performs well-known speech recognition processing (sound source identification processing, sound source localization processing, noise suppression processing, speech section detection processing, sound source extraction processing, acoustic feature amount calculation processing, etc.) on the acquired acoustic signal. The recognition unit 105 extracts a sound signal (or sound signal) of a target person, animal, or object from the acquired sound signal based on the recognition result, and recognizes the extracted sound signal (or sound signal). The result is output to the motion generation unit 110 . The recognition unit 105 extracts an image of a target person or object from the acquired image based on the recognition result, and outputs the extracted image to the action generation unit 110 as a recognition result.

The learning unit 108 uses the recognition results output by the recognition unit 105 and the data stored in the second database 109 to learn human emotional interactions. A learning unit 108 stores a model generated by learning. The learning method will be described later.

The recognition generation unit 140 recognizes and recognizes the reaction of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the voice of the listener, and a result of detecting the posture of the listener. Information based on the result is output to the motion generator 110 and the image display device 7 . A configuration example and an operation example of the recognition generation unit 140 will be described later with reference to FIG. 13 .

The motion generation unit 110 generates facial expression information and motion information of the robot based on the information generated by the generation unit 144 (FIG. 13) included in the recognition generation unit 140 when reading aloud. The motion generation unit 110 acquires information received from the reception unit 101 , images captured by the imaging unit 102 , acoustic signals collected by the sound collection unit 103 , and recognition results from the recognition unit 105 . The action generation unit 110 generates actions (utterances, gestures, images) for the user based on the learned result and the acquired information.

The image generation unit 1101 generates an output image (still image, continuous still images, or moving image) to be displayed on the display unit 111 based on the learned result and the acquired information, and outputs the generated output image. Displayed on the display unit 111 . Thereby, the action generating unit 110 causes the display unit 111 to display an animation such as a facial expression, present an image to be presented to the user, and communicate with the user. Images that are displayed include images that correspond to the movements of a person's eyes, images that correspond to the movements of a person's mouth, and information such as the user's destination (maps, weather maps, weather forecasts, information on shops and resorts, etc.). etc.), an image of a person calling the user via the Internet line.

The audio generation unit 1102 generates an output audio signal to be output to the speaker 112 based on the learned result and the acquired information, and causes the speaker 112 to output the generated output audio signal. As a result, the action generator 110 causes the speaker 112 to output an audio signal to communicate with the user. The voice signals to be output are voice signals assigned to the storytelling robot 1, voice signals of a person calling the user via the Internet line, and the like.

The drive unit 1103 generates a drive signal for driving the actuator 113 based on the learned result and the acquired information, and drives the actuator 113 with the generated drive signal. As a result, the motion generating unit 110 controls the motion of the storytelling robot 1 to express emotions and communicate with the user.

Based on the learned result and the acquired information, the transmission information generation unit 1104 generates transmission information (audio signal, image ) is generated, and the generated transmission information is transmitted from the transmission unit 114 .

The display unit 111 is a liquid crystal image display device, an organic EL (Electro Luminescence) image display device, or the like. Display unit 111 displays an output image output by image generation unit 1101 of communication device 100 .

The speaker 112 outputs the output audio signal output by the audio generation unit 1102 .

The actuator 113 drives the action section according to the drive signal output by the drive section 1103 .

The transmission unit 114 transmits the transmission information output by the transmission information generation unit 1104 to the transmission destination via the network.

A third database 150 associates and stores names and identification information with listener facial images, and stores a library of eye movements developed during emotional routines. The third database 150 also stores data used for gesture recognition, language models used for speech recognition, and the like. Note that the third database 150 may be placed on the cloud or may be connected via a network.

As described above, the content 500 includes, for example, robot expression routines (joy, sadness, etc.) 501, image media (photographs, still images, moving images, etc.) 502, acoustic signals (prompts/audio for voice/speech synthesis). etc.) 503 . Note that the content 500 may be placed on the cloud or may be connected via a network.

It should be noted that the story-telling robot 1 communicates with an individual or a plurality of people 2 using, for example, the method described in Japanese Patent Application No. 2020-108946.

<Configuration and Operation Example of Recognition Generation Unit 140>
Next, the configuration and operation example of the recognition generation unit 140 will be described.
FIG. 13 is a diagram showing a configuration example of the recognition generation unit 140 according to this embodiment. The recognition generation unit 140 includes, for example, a face recognition unit 141, a gesture recognition unit 142, a voice recognition unit 143, and a generation unit 144, as shown in FIG. The generation unit 144 also includes, for example, a narrator/agency unit 1441, a connection unit 1442, and an education unit 1443, as described above.

The recognition generation unit 140 performs face recognition, gesture recognition, and voice recognition. The recognition generation unit 140 generates information to be output to the action generation unit 110 and information to be output to the image display device 7 using inputs from the sensors (image capturing unit 102, sound collection unit 103, sensor 104) and the content 500. do.

The face recognition unit 141 refers to the data stored in the third database 150 and uses a well-known method to recognize the face of the person included in the captured image. When the data is not stored in the third database 150, the face recognition unit 141 adds a name and identification information to the recognized face image and stores it in the third database.

The gesture recognition unit 142 refers to the data stored in the third database 150, and uses a known technique to determine the position and tilt of the person's head, body orientation, and hand position included in the captured image. detect and track.

The speech recognition unit 143 refers to the data stored in the third database 150 and uses well-known techniques to perform processing such as sound source localization, sound source separation, noise suppression, speech segment detection, and speaker identification.

The narrator/agency unit 1441 receives information detected by the sensor during reading, such as the human narrator's body and facial expression, proximity movement, line of sight, eye saccade, way of speaking, tone of voice, volume, and the like. This is a trained model trained as teacher data. When reading aloud, the narrator/agency unit 1441 receives information detected by the sensor, and outputs, for example, how to move the neck, the line of sight, the movement of the eyes and the mouth, the way of speaking, the tone of voice, the volume, and the like.

The connection unit 1442 receives the information detected by the sensor during reading and the information output by the narrator/agency unit 1441, and learns, for example, actions with immediacy according to the actions of the listener as teacher data. is a model. At the time of story-telling, the connection unit 1442 receives at least one of the information detected by the sensor and the information output by the narrator/agency unit 1441, and outputs, for example, immediate actions according to the actions of the listener. .

The education unit 1443 receives as input the information detected by the sensor during reading aloud and the information output by the unit 1442 for linking. This is a trained model trained as teacher data. At the time of reading, the education unit 1443 receives at least one of the information detected by the sensor and output by the linking unit 1442. For example, information that encourages positive values and two-way conversation between the narrator and the listener. to output

<Data stored in the first database>
Next, an example of data stored in the first database 107 will be described. The first database 107 stores, for example, a language model database, an acoustic model database, a dialogue corpus database, and a comparison image database.
The language model database stores language models. A language model is a probabilistic model that gives a probability that an arbitrary character string is a Japanese sentence or the like. Also, the language model is, for example, an N-gram model, a hidden Markov model, a maximum entropy model, or the like.

The acoustic model database stores sound source models. A sound source model is a model used to identify a sound source of a collected acoustic signal.

An acoustic feature amount is a feature amount calculated after transforming a collected sound signal into a signal in the frequency domain by performing a Fast Fourier Transform. Acoustic features, for example, static Mel-Scale Log Spectrum (MSLS), delta MSLS, and one delta power, are calculated every predetermined time (eg, 10 ms). Note that MSLS is obtained by using a spectral feature amount as a feature amount for acoustic recognition and performing inverse discrete cosine transform on MFCC (Mel Frequency Cepstrum Coefficient).

The dialogue corpus database stores the dialogue corpus. The dialogue corpus is a corpus that is used when the storytelling robot 1 and the user have a dialogue, and is, for example, a scenario corresponding to the contents of the dialogue.

The comparison image database stores images used for pattern matching, for example. Images used for pattern matching include, for example, images of the user, images of the user's family, images of the user's pets, and images of the user's friends and acquaintances.

The image feature amount is, for example, a feature amount extracted from an image of a person or an object by well-known image processing.
Note that the example described above is just an example, and the first database 107 may store other data.

<Data stored in the second database>
Next, an example of data stored in the second database 109 will be described. The second database 109 stores, for example, social constituent elements, social norms, data on psychology, and data on humanities.
The social components are, for example, age, sex, occupation, and relationships between multiple people (parents and children, couples, lovers, friends, acquaintances, co-workers, neighbors, teachers and students, etc.).

Social norms are rules and manners between individuals and multiple people, and are associated with speech, gestures, etc. according to age, gender, occupation, and relationships between multiple people.

Data related to psychology are, for example, data on findings obtained from past experiments and verifications (for example, attachment relationships between mothers and infants, complexes such as the Oedipus complex, conditioned reflexes, fetishism, etc.).

Data related to the humanities are, for example, data on religious rules, customs, national characteristics, regional characteristics, and characteristic acts, actions, and utterances of a country or region. For example, in the case of Japanese people, the data is such that when they give consent, they express their consent by nodding without saying it in words. The humanities-related data is, for example, data on what is considered important and what is prioritized depending on the country or region.

<Processing procedure example>
Next, a processing procedure example will be described. FIG. 14 is a flow chart showing an example of the procedure of the story-telling process of the story-telling robot 1 according to this embodiment.

(Step S1) The generating unit 144 generates content to be used for storytelling. Note that the communication device 100 may acquire content generated by an external device via the receiving unit 101, for example.

(Step S2) The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).

(Step S3) The face recognition unit 141 of the recognition generation unit 140 detects an image including the listener's face from the captured image, and recognizes the listener's face. If the face image of the listener is not registered in the third database 150, the recognition generator 140 acquires the name of the listener by talking to the listener and making the listener pronounce his/her name, for example.

(Step S4) The generation unit 144 outputs an image to be displayed on the image display device 7 and an acoustic signal to be output to the image display device 7 based on the acquired content. Subsequently, the action generator 110 starts reading the content using the acquired content and listener's name.

(Step S5) The recognition generation unit 140 acquires sensor information (an image captured by the image capture unit 102, an acoustic signal captured by the sound capture unit 103, and a detection value detected by the sensor 104).

(Step S6) The face recognition unit 141 recognizes the orientation and facial expression of the listener from the captured image and the detection values detected by the sensor 104 . The gesture recognition unit 142 detects the listener's motion from the captured image and the detection value detected by the sensor 104 . The speech recognition unit 143 performs speech recognition on the acoustic signal picked up by the sound pickup unit 103 .

(Step S7) The generation unit 144 generates facial expression information and motion information of the storytelling robot 1 based on the acquired content and the acquired sensor information. Subsequently, the motion generation unit 110 generates images to be displayed on the

display units

111a and 111b corresponding to the eyes of the storytelling robot 1 and displays corresponding to the mouth based on the generated facial expression information and motion information of the storytelling robot 1. An image to be displayed on the unit 111c, an audio signal to be output from the speaker 112, and a drive signal to drive the actuator 113 are generated.

(Step S8) The recognition generation unit 140 determines whether or not the story-telling has ended. The end of the story-telling is not limited to the end of the content, and may be ended even in the middle of the content depending on the reaction of the listener. Moreover, if the content is long, the recognition generation unit 140 may end the content even in the middle of the content, depending on the reading time or the reaction of the listener. If the recognition generation unit 140 determines that the story-telling has ended (step S8; YES), it ends the process. When the recognition generation unit 140 determines that the story-telling has not ended (step S8; NO), the process returns to step S5.

It should be noted that the processing procedure shown in FIG. 14 is an example, and is not limited to this. Some processes may be performed in parallel, and the order may be changed. Also, the listener's name may be acquired during the reading after the reading is started.

It should be noted that the images and sounds of the content displayed on the image display device 7 may be changed in accordance with the reaction of the listener during the reading. For example, the content may be multiple branching stories, and two or more pieces of music may be provided along with the progress of the story. Such content may be generated by a creator or the like operating the generation unit 144 having the configuration shown in FIGS. 7 to 11, for example. Further, the generation unit 144 generates an image to be displayed on the image display device 7 when the story-telling is performed, an acoustic signal to be output by the image display device 7 when the story-telling is performed, and an image to be displayed on the image display device 7 when the story-telling is performed. At least one of the voice signal, facial expression data, and action data of the speaking robot 1 may be changed.

<Evaluation>
In order to evaluate the story-telling robot 1 described above, the following two evaluations were performed.

(First evaluation)
In the first evaluation, reading was performed in the following three different states.
I. A robot is used as a narrator. In this case, the robot directly tells the story. This state will be referred to as "direct" in the evaluation below. Content is also displayed on a TV screen behind the robot.
II. The story is told by someone else pre-recorded. The robot's role is to facilitate the reading process, including engaging people in dialogue. This state will be referred to as the 'Facilitator' in the following evaluations. Content is also displayed on a TV screen behind the robot.
III. Only the content of the story is displayed on the tablet and read aloud using traditional storytelling techniques. In this case, use the same content, including multimedia content and other assets. In the evaluation below, this state is referred to as a "tablet". Also, in this case, no robot is used.

In the evaluation, we controlled the robot's sense of agency (animation and expression routines, eye gaze, saccades, etc.) in order to further investigate the impact of the use of communication channels specific to the embodied agent.

In the first evaluation, disabling them is called "AGENCY OFF" and enabling them is called "AGENCY ON". In "AGENCY OFF", robots are nothing more than props without action.
In the first evaluation, 30 subjects (15 adult females and 15 adult males) were recruited and evaluated for the above three types of cases (direct, facilitator, tablet).

First, in the first evaluation, we presented the subjects (listeners) with different reading performances for each of the above three conditions, and conducted a test asking whether they "liked" or "disliked" the following instructions. "Like" is given only when the storyteller's performance satisfies the category description. In addition, if the subject was satisfied, it was allowed to receive multiple "likes".

FIG. 15 is a diagram showing a first evaluation result example.
In FIG. 15, items are defined as follows.
・"Content": The minimum requirement for evaluation is to convey the content with a story in a simple manner.
・"Persuasiveness"; the point of evaluation is whether the article is persuasive and credible as a whole.
・"Realism"; The point of evaluation is how essential the elements contained in the story are, and the overall feeling that the story is brought to life through distribution.
• "Interactivity"; the point of evaluation is whether or not the subject feels able to participate in the storytelling process.

As shown in Figure 15, the use of robots was highly evaluated when the agency of the robot was active "agency on". In this way, in the case of “Agency On,” we concluded that the use of robots raises users' expectations, and that the communication modalities of robots are tapped to make sense. This indicates that the expressiveness and behavior of robots promote engagement and give meaning to interactions.

(Second evaluation)
The second assessment assessed attention allocation.
A central function of human attention is to select objects of current interest and ignore the rest of the surrounding objects. Neuroscientists classify attention into two distinct functions: bottom-up attention and top-down attention. Bottom-up attention is thought to manipulate visual features and unconsciously emphasize areas containing the most salient information. In many studies on attention, human eye movements are recorded to express changes in the position of attention. We show that bottom-up attention allocation is generated from feature contrasts in color, motion, face, and other visual and auditory dimensions. This second evaluation examined whether the robot agency could capture human attention and shift attention away from the TV screen (content) to the robot.

It should be noted that, as described above, the storytelling robot 1 performs not only simple actions, but also meaningful actions accompanied by socially meaningful multimodal actions. Therefore, rather than examining the causal relationship between random movements and attention, we examined the impact of socially relevant robot communication affordances.

In this second evaluation, only the image display device 7 displaying the content image and the story-telling robot 1 placed in front of it were presented, and the subjects were asked to perform the task. In the second evaluation, it was assumed that the orientation of the subject's head was the direction of attention. In the second evaluation, the subject was directed to the robot to visualize how the subject's attention shifted between objects (image display device 7, storytelling robot 1) while performing the task. The "direction of the head" is set to 1 when facing the image display device 7, and set to 0 when the image display device 7 is facing. In the second evaluation, the content of the story displayed on the image display device 7 was kept constant in order to compare the effect of the robot on attention allocation.

FIG. 16 is a diagram showing a second evaluation result example. In FIG. 16, the horizontal axis is discrete time and the vertical axis is head orientation (1 or 0). Graph g401 is an example of the temporal change in listener's attention allocation when the image display device 7 and storytelling robot 1 are used in "Agency ON". A graph g411 is an example of the change over time of the listener's attention distribution when the image display device 7 and the story-telling robot 1 are used in the "agency off" mode. A line g402 indicates the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in "Agency ON". A line g403 is for reference and shows the time change of listener's attention allocation when only the image display device 7 is used and the story-telling robot 1 is not placed in front of the image display device 7. FIG. A line g412 shows the temporal change in listener's attention distribution when the image display device 7 and storytelling robot 1 are used in the "agency off" state.

From this FIG. 16, it is considered that socially relevant behaviors and behaviors of the robot are paid attention to by the subject, and the head orientation is changed from 0 (image display device 7) to 1 (reading robot 1). By changing, it can be confirmed that it attracts the subject's attention.

As indicated by the line g403 of the graphs g401 and g411, when the robot was not used, the subject was fixated on only the image of the story displayed on the screen of the image display device 7. FIG.
As shown by lines g402 and g412, "agency on" tends to make the robot more conscious than "agency off", and the head direction can be changed frequently and for a long time toward the robot. I understand.

In addition, in FIG. 16, in order to make the distribution of attention easier to understand, only the results of two subjects were shown, but it was confirmed that a consistent tendency was observed for all 30 subjects.

As described above, in this embodiment, we identified elements of storytelling such as narrative, agency, connection, and education, and synthesized them into a robot. In other words, in this embodiment, the storytelling function of the robot is realized by utilizing the communication channel of the robot, such as the expressiveness of gestures and eye movements, the use of projected content and emotional voices (= facilitator). was designed.

As a result, according to the present embodiment, it is possible to create a wide variety of storytelling by collecting various contents, and during the storytelling of the generated content, the storytelling robot 1 performs actions or backtracks. It is possible to read aloud in a rich communication by inserting a question at the beginning or inserting a question at the beginning. According to this embodiment, a meaningful interaction between a human and a robot is realized, and it is accepted by users to the maximum extent possible.

The approach of the present embodiment maintains long-term attention and greater acceptance and can inspire future creative content designs for social robots.
Also, according to the present embodiment, using a materialized agent, such as a robot, for storytelling can be "agency-on" by leveraging its communication affordances (materialization, expressiveness, and other modalities). , the listener's interest can be maintained more.

It should be noted that, as described above, the story-telling robot 1 is not limited to story-telling, and similar effects can be obtained by playing the role of a "facilitator" that supports story-telling.

(Cognitive, Learning, Flow of Social Competence)
Next, the flow of recognition and learning performed by the communication device 100 of the embodiment will be described. FIG. 17 is a diagram showing the flow of cognition, learning, and social ability performed by the communication device 100 of the embodiment.

A recognition result 201 is an example of a result recognized by the recognition unit 105 . The recognition result 201 is, for example, an interpersonal relationship, an interpersonal mutual relationship, or the like.

The multimodal learning and understanding 211 is an example of learning content performed by the learning unit 108 . The learning method 212 is machine learning or the like. Also, the learning object 213 is social constituent elements, social models, psychology, humanities, and the like.

Social abilities 221 are social skills, such as empathy, individualization, adaptability, and emotional ajodance.

(data to be recognized)
Next, an example of data recognized by the recognition unit 105 will be described.
FIG. 18 is a diagram showing an example of data recognized by the recognition unit 105 according to the embodiment. In the embodiment, personal data 301 and interpersonal relationship data 351 are recognized as shown in FIG.

Personal data is behavior that occurs in one person, and is data acquired by the imaging unit 102 and sound pickup unit 103, and data obtained by performing voice recognition processing, image recognition processing, etc. on the acquired data. . Personal data includes, for example, voice data, semantic data resulting from voice processing, voice volume, voice inflection, uttered words, facial expression data, gesture data, head posture data, face direction data, line of sight data , co-occurrence expression data, physiological information (body temperature, heart rate, pulse rate, etc.), and the like. The data to be used may be selected by the designer of the storytelling robot 1, for example. In this case, for example, the designer of the story-telling robot 1 may set important features of the personal data in communication for actual communication or demonstration between two persons. Also, the recognition unit 105 recognizes the user's emotion as personal data based on the information extracted from each of the acquired speech and image. In this case, the recognition unit 105 recognizes based on, for example, the loudness and intonation of the voice, the utterance duration, the facial expression, and the like. Then, the story-telling robot 1 of the embodiment controls the user's emotions so as to maintain good emotions and to work to maintain a good relationship with the user.

Here, an example of a method of recognizing the user's social background (background) will be described.
The recognition unit 105 estimates the user's nationality, hometown, etc., based on the acquired speech and image, and the data stored in the first database 107 . The recognizing unit 105 extracts the user's life schedule such as wake-up time, going-out time, returning home time, bedtime, etc., based on the acquired utterances and images and the data stored in the first database 107 . The recognition unit 105 recognizes the user's sex, age, occupation, hobbies, career, preferences, family structure, religion based on the acquired utterance, image, life schedule, and data stored in the first database 107. , the degree of attachment to the storytelling robot 1, etc. are estimated. Since the social background may change, the story-telling robot 1 updates the information about the user's social background based on the conversation, the image, and the data stored in the first database 107 . In addition, in order to enable emotional sharing, the social background and degree of attachment to the storytelling robot 1 are not limited to inputtable levels such as age, gender, and career. Recognize based on the volume and intonation of the voice to the undulations and topics. In this way, the recognizing unit 105 learns things that the user is not aware of by themselves based on daily conversations and facial expressions during conversations.

　Personal relationship data is data related to the relationship between the user and other people. By using interpersonal relationship data in this way, social data can be used. The interpersonal relationship data includes, for example, the distance between people, whether or not the eyes of the people who are having a conversation meet each other, the inflection of the voice, the loudness of the voice, and the like. As will be described later, the distance between people varies depending on the interpersonal relationship. For example, the interpersonal relationship is L1 for couples or friends, and the interpersonal relationship between businessmen is L2, which is larger than L1.

It should be noted that, for example, for actual communication or demonstration between two people, the designer of the story-telling robot 1 may set important features of interpersonal data in communication. Such personal data, interpersonal relationship data, and information about the user's social background are stored in first database 107 or storage unit 106 .

In addition, when there are multiple users, for example, the user and his/her family, the recognition unit 105 collects and learns personal data for each user, and estimates the social background for each person. Note that such social background may be obtained, for example, via a network and the receiving unit 101. In that case, the user inputs his/her social background or selects an item using, for example, a smartphone. good too.

Here, an example of a method for recognizing interpersonal relationship data will be described.
The recognition unit 105 estimates the distance (interval) between the people communicating with each other based on the acquired utterances, images, and data stored in the first database 107 . The recognition unit 105 detects whether or not the lines of sight of the person communicating with each other, based on the acquired utterance, image, and data stored in the first database 107 . Based on the acquired utterance and the data stored in the first database 107, the recognition unit 105 recognizes the content of the utterance, the loudness of the voice, the inflection of the voice, the received e-mail, the transmitted e-mail, and the transmission and reception of the received e-mail. Based on the previous partner, the relationship between friends, co-workers, relatives and parents is estimated.

Note that, in the initial state of use, the recognition unit 105 selects one, for example, at random from among several combinations of initial values of social backgrounds and personal data stored in the first database 107, and performs communication. may be started. Then, if it is difficult to continue communication with the user due to the behavior generated by the randomly selected combination, the recognition unit 105 may reselect another combination.

(learning procedure)
In the embodiment, the learning unit 108 performs learning using the personal data 301 and interpersonal relationship data 351 recognized by the recognition unit 105 and data stored in the second database 109 .

Here, I will explain the social structure and social norms. In spaces where people participate in social interaction, interpersonal relationships differ, for example, depending on the career of the person. For example, a relationship with a person at a distance of 0 to 50 cm is an intimate relationship, and a relationship with a person at a distance of 50 to 1 m is a personal relationship. A relationship with a distance of 1 to 4 m from a person is a social relationship, and a relationship with a distance of 4 m or more from a person is a public relationship. Such social norms are used as rewards (implicit rewards) based on whether gestures and utterances conform to the social norms during learning.

In addition, the interpersonal relationship may be set according to the environment in which it is used and the user by setting the feature value of the reward during learning. Specifically, you can set multiple intimacy settings, such as setting a rule to not talk to people who are not good at robots, and a rule to actively talk to people who like robots. good. Then, in a real environment, the recognizing unit 105 recognizes which type the user is based on the result of processing the user's utterance and image, and the learning unit 108 selects a rule. good.

In addition, the human trainer may evaluate the behavior of the story-telling robot 1 and provide a reward (implicit reward) according to the social structure and norms that he/she knows.

FIG. 19 is a diagram illustrating an example of an agent creation method used by the action generator 110 according to the embodiment.
The area indicated by reference numeral 300 is a diagram showing the flow from input to creation of an agent and output (agent).
The image captured by the capturing unit 102 and the information 310 captured by the sound capturing unit 103 are information about people (users, people related to the users, other people) and environmental information around people. Raw data 302 acquired by the imaging unit 102 and the sound collecting unit 103 is input to the recognition unit 105 .

The recognition unit 105 extracts a plurality of pieces of information from the input raw data 302 (voice volume, voice intonation, utterance content, uttered words, user's line of sight, user's head posture, user's face orientation, etc.). , user's ecological information, distance between people, whether or not people's lines of sight intersect, etc.) are extracted and recognized. The recognition unit 105 uses a plurality of pieces of extracted and recognized information to perform multimodal understanding using, for example, a neural network.
The recognition unit 105 identifies an individual based on, for example, at least one of an audio signal and an image, and assigns identification information (ID) to the identified individual. The recognition unit 105 recognizes the motion of each identified person based on at least one of the audio signal and the image. The recognition unit 105 recognizes the line of sight of the identified person by, for example, performing well-known image processing and tracking processing on the image. The recognition unit 105 performs, for example, speech recognition processing (sound source identification, sound source localization, sound source separation, speech segment detection, noise suppression, etc.) on a speech signal to recognize speech. The recognition unit 105 recognizes the head posture of the identified person by, for example, performing well-known image processing on the image. For example, when two people are photographed in the photographed image, the recognition unit 105 recognizes the interpersonal relationship based on the speech content, the distance between the two persons in the photographed image, and the like. The recognition unit 105 recognizes (estimates) the social distance between the story-telling robot 1 and the user, for example, according to the result of processing each of the captured image and the collected sound signal.

The learning unit 108 performs reinforcement learning 304 instead of deep learning. Reinforcement learning involves learning to select the most relevant features (including social constructs and social norms). In this case, multiple pieces of information used in multimodal understanding are used as features for input. Inputs to the learning unit 108 are, for example, raw data itself, name ID (identification information), influence of face, recognized gesture, keyword from voice, and the like. The output of the learning unit 108 is the behavior of the storytelling robot 1 . The behavior that is output may be anything that you want to define according to your purpose, such as voice responses, robot routines, or the angle of orientation for the robot to rotate. In multimodal understanding, a neural network or the like may be used for detection. In this case, different modalities of the body may be used to detect human activity. Also, which feature to use may be selected in advance by, for example, the designer of the storytelling robot 1 . Furthermore, in this embodiment, by using implicit and explicit rewards during learning, social models and social constructs can be incorporated. The output of the reinforcement learning is the agent 305 . Thus, in this embodiment, the agent used by the action generator 110 is created.

The area indicated by reference numeral 350 is a diagram showing how the reward is used.
Implicit rewards 362 are used to learn implicit responses. In this case, the raw data 302 includes the user's reactions, and the multimodal understanding 303 of this raw data 302 is described above. The learning unit 108 generates an implicit reaction system 372 using the implicit reward 362 and the social model etc. stored in the second database 109 . Note that the implicit reward may be obtained by reinforcement learning or may be given by a human. The implicit reaction system may also be a model acquired through learning.

For learning explicit responses, for example, a human trainer evaluates the behavior of the story-telling robot 1 and gives a reward 361 according to the social structure and social norms that he/she knows. Note that the agent adopts the action that maximizes the reward for the input. As a result, the agent adopts behaviors (utterances and gestures) that maximize positive feelings toward the user.

Learning unit 108 uses this explicit reward 361 to generate explicit reaction system 371 . Note that the explicit response system may be a model acquired through learning. The explicit reward may be given by the user by evaluating the behavior of the story-telling robot 1. For example, the reward may be estimated based on whether or not the action desired by the user is taken.
Learning unit 108 outputs agent 305 using these learning models during operation.

Note that in the embodiment, explicit rewards, which are user reactions, are prioritized over implicit rewards. This is because the user's reaction is more reliable in communication.

Thus, in the story-telling robot 1 of the embodiment, the learning means performs learning using an implicit reward and an explicit reward, and the implicit reward is obtained using feature information about the person. , a multimodally learned reward, wherein the explicit reward is a reward based on the result of evaluating the behavior of the communication device toward the person generated by the action generating means.
Further, in the story-telling robot 1 of the embodiment, it is provided with a sound pickup unit that picks up an acoustic signal and a shooting unit that shoots an image including the user, and the recognition means is based on the picked-up acoustic signal. speech recognition processing is performed to extract feature information about voice; image processing is performed on the photographed image to extract feature information about human behavior included in the image; and the feature information about the human behavior, wherein the feature information about the voice is at least one of a voice signal, voice volume information, voice inflection information, and utterance meaning, and the person The characteristic information about the action is at least one of facial expression information of a person, gesture information of a person, head posture information of a person, face direction information of a person, line of sight information of a person, and distance between people. is.

A program for realizing all or part of the functions of the reading robot 1 and the recognition generating unit 140 in the present invention is recorded on a computer-readable recording medium, and the program recorded on this recording medium is transferred to a computer system. All or part of the processing may be performed by the story-telling robot 1 and the recognition generation unit 140 by reading and executing the processing. It should be noted that the "computer system" referred to here includes hardware such as an OS and peripheral devices. Also, the "computer system" includes a WWW system provided with a home page providing environment (or display environment). The term "computer-readable recording medium" refers to portable media such as flexible discs, magneto-optical discs, ROMs and CD-ROMs, and storage devices such as hard discs incorporated in computer systems. In addition, "computer-readable recording medium" means a volatile memory (RAM) inside a computer system that acts as a server or client when a program is transmitted via a network such as the Internet or a communication line such as a telephone line. , includes those that hold the program for a certain period of time.

Also, the program may be transmitted from a computer system storing this program in a storage device or the like to another computer system via a transmission medium or by transmission waves in a transmission medium. Here, the "transmission medium" for transmitting the program refers to a medium having a function of transmitting information, such as a network (communication network) such as the Internet or a communication line (communication line) such as a telephone line. Further, the program may be for realizing part of the functions described above. Further, it may be a so-called difference file (difference program) that can realize the above-described functions in combination with a program already recorded in the computer system.

As described above, the mode for carrying out the present invention has been described using the embodiments, but the present invention is not limited to such embodiments at all, and various modifications and replacements can be made without departing from the scope of the present invention. can be added.

Reference Signs List 1 story-telling robot, 101 receiving unit, 102 photographing unit, 103 sound collecting unit, 104 sensor, 100 communication device, 105 recognition unit, 106 storage unit, 107 first database, 108 learning Part 109 Second database 110 Action generation unit 111 Display unit 112 Speaker 113 Actuator 114 Transmission unit 140 Recognition generation unit 150 Third database 500 Content 1101 Image generator 1102 Voice generator 1103 Drive unit 1104 Transmission information generator 141 Face recognition unit 142 Gesture recognition unit 143 Voice recognition unit 144 Generation unit 1441 Narrator/agent Department, 1442 ... Connection Department, 1443 ... Education Department

Claims

Read aloud using an image of a listener, an audio signal obtained by picking up the listener's voice, a result of detecting the listener's posture, a robot expression routine, image media, and an acoustic signal. An image to be displayed on the image display device when the reading is performed, an acoustic signal to be output to the image display device when the reading is performed, and an audio signal and facial expression data of the reading robot when the reading is performed. a generator for generating motion data;
Story-telling information creation device.
The generating unit
A narrator section for advancing the narration,
Considering the expressiveness of the story-telling robot, an agency section that activates a routine with expressiveness and responds to multimedia;
a connection unit that creates immediacy actions in response to the person maintaining the line of sight of the storytelling robot to a particular person;
an education department for configuring questions and answers;
The story-telling information creating apparatus according to claim 1.
The generating unit is configured by a behavior tree structure,
The story-telling information creating apparatus according to claim 1 or 2.
The generating unit
recognizing the state of the listener based on at least one of an image of the listener, an audio signal obtained by picking up the voice of the listener, and a result of detecting the posture of the listener; Based on this, an image to be displayed on the image display device when performing the story-telling, an acoustic signal to be output to the image display device when performing the story-telling, and the reading when performing the story-telling altering at least one of the voice signal, facial expression data, and motion data of the speaking robot;
The story-telling information creating apparatus according to claim 1 or 2.
The facial expression data of the storytelling robot is an image corresponding to the eyes and an image corresponding to the mouth,
The motion data of the storytelling robot is data relating to the motion of the part corresponding to the neck,
The story-telling information creating apparatus according to claim 1 or 2.
The image display device and the story-telling robot are arranged in close proximity,
The story-telling information creating apparatus according to claim 1 or 2.
a photographing unit for photographing an image;
a sound pickup unit that picks up an acoustic signal;
a sensor that detects the state of the listener;
a first display unit corresponding to eyes;
a second display portion corresponding to the mouth;
A movable part corresponding to the neck,
An image to be displayed on the image display device generated by the story-telling information creation device according to claim 1, an acoustic signal to be output to the image display device, and an audio signal, facial expression data, and action data of the story-telling robot and an action generation means for performing the reading aloud using
Story-telling robot with
a photographing unit for photographing an image;
a sound pickup unit that picks up an acoustic signal;
a sensor that detects the state of the listener;
a first display unit corresponding to eyes;
a second display portion corresponding to the mouth;
A movable part corresponding to the neck,
A story-telling information creation device according to claim 1;
Using the image to be displayed on the image display device generated by the story-telling information creation device, the acoustic signal to be output to the image display device, and the voice signal, facial expression data, and action data of the story-telling robot, the reading is performed. a motion generating means for causing a sound to be heard;
Story-telling robot with
Read aloud to the listener, or promote the read aloud to the listener;
The story-telling robot according to claim 7 or 8.
Acquiring human information about a person, extracting characteristic information about the person from the obtained human information, recognizing an action that occurs between a communication device that performs communication and a person, and recognizing an action that occurs between people means of cognition;
learning means for multimodally learning a person's emotional interaction using the extracted feature information about the person;
The action generating means generates actions based on learned emotional interaction information of the person.
The story-telling robot according to claim 7 or 8.
A generation unit uses an image of a listener, an audio signal obtained by collecting the voice of the listener, a result of detecting the posture of the listener, a robot expression routine, image media, and an acoustic signal. , an image to be displayed on the image display device when performing the story-telling, an acoustic signal to be output to the image display device when performing the story-telling, and an audio signal of the story-telling robot when performing the story-telling and generate facial expression data and motion data,
How to create story-telling information.
to the computer,
Read aloud using an image of a listener, an audio signal obtained by picking up the listener's voice, a result of detecting the listener's posture, a robot expression routine, image media, and an acoustic signal. An image to be displayed on the image display device when the reading is performed, an acoustic signal to be output to the image display device when the reading is performed, and an audio signal and facial expression data of the reading robot when the reading is performed. generating operational data;
program.