WO2023080105A1 - Terminal en ligne et programme - Google Patents

Terminal en ligne et programme Download PDF

Info

Publication number
WO2023080105A1
WO2023080105A1 PCT/JP2022/040653 JP2022040653W WO2023080105A1 WO 2023080105 A1 WO2023080105 A1 WO 2023080105A1 JP 2022040653 W JP2022040653 W JP 2022040653W WO 2023080105 A1 WO2023080105 A1 WO 2023080105A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
image
display
video
partner
Prior art date
Application number
PCT/JP2022/040653
Other languages
English (en)
Japanese (ja)
Inventor
雄一郎 吉川
浩 石黒
恵 川田
ハーメド マハズーン
優太 似田
Original Assignee
国立大学法人大阪大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 国立大学法人大阪大学 filed Critical 国立大学法人大阪大学
Publication of WO2023080105A1 publication Critical patent/WO2023080105A1/fr

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/431Generation of visual interfaces for content selection or interaction; Content or additional data rendering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems

Definitions

  • the embodiments described in this specification relate to online terminals and programs, and more particularly to online terminals and programs used for non-face-to-face interactions, for example.
  • Non-face-to-face dialogue (online dialogue) systems such as video conferencing (web conference, online conference) are an attractive way to communicate, especially without risking the recent spread of COVID-19.
  • Non-face-to-face interaction systems include microphone and camera audiovisual systems.
  • Patent Document 1 the image around the eyes and eyelids of the interlocutor is edited to make the user feel as if he or she is looking at a camera regardless of where the user is looking. made it It shows that the system made the user experience the sensation of eye contact.
  • the main purpose of the embodiment is to provide a novel online terminal and program.
  • Another object of this embodiment is to provide an online terminal and program that can express the interlocutor's intention in online dialogue.
  • the embodiment adopted the following configuration in order to solve the above problems.
  • a first embodiment is an online terminal for executing online dialogue through a network, comprising first display means for displaying an image of a dialogue partner on a display, and displaying a dialogue image on the display in a form belonging to the dialogue partner's image.
  • An online terminal comprising a second display means for displaying a video presentation including at least one of a video including a user from another's point of view and a video based on behavior or perception of the user.
  • the online terminal carries out online interaction between the user and the interaction partner through the network.
  • the first display means displays an image of the dialogue partner on the display.
  • the second display means displays, in the form of, for example, a speech bubble, a video representation including at least one of a video including the user captured from the viewpoint of the conversation partner and a video based on the user's behavior or recognition, which belongs to the video of the conversation partner. .
  • such a video representation is displayed in a form such as a balloon that visually clearly indicates the attribution to the dialogue partner.
  • a form such as a balloon that visually clearly indicates the attribution to the dialogue partner. The user can easily imagine what he or she is paying attention to or recognize, and can express the interlocutor's intention.
  • the second embodiment is an online terminal subordinate to the first embodiment, and the video including the user includes the video of the user and the video representing the recognition result of the user.
  • the video expression is a character string "I see”, an icon (symbol) of affirmative facial expression, a "! mark, or a nodding motion. It can be generated in a form that includes animation showing , video processing effects, movement of balloons (up and down, large and small), etc.
  • the conversation partner by displaying a video expression based on the user's recognition result, it is possible to make the conversation partner feel the same feelings as the user's satisfaction or dissatisfaction. Conversely, by displaying the nodding and shaking of the dialogue partner, it becomes easier to understand the consent or dissatisfaction of the dialogue partner, and at the same time, it is possible to strengthen the sense that such a speech bubble is an indication in the mind of the dialogue partner. .
  • the third embodiment is an online terminal subordinate to the second embodiment, and the image including the user is an image to which characters or symbols indicating the user's recognition result are added.
  • a fourth embodiment is an online terminal depending on any one of the first to third embodiments, in which at least one of a camera and a microphone is provided in association with a video signal from the camera and audio from the microphone.
  • live information acquisition means for acquiring at least one of the signals as live information; and first judgment means for judging whether or not an important event has occurred based on the live information; When it is determined that a has occurred, the video representation is displayed.
  • At least one of a camera and a microphone is associated, and the live information acquisition means acquires at least one of the video signal from the camera and the audio signal from the microphone as live information.
  • the first determination means determines whether or not an important event (important word, important video) has occurred based on the live information. Then, the second display means displays the image representation when the first judgment means judges that the important event has occurred.
  • the image expression is displayed only when an important event occurs, it is easier to get the user's attention by removing the habit of displaying it all the time.
  • a fifth embodiment is an online terminal according to the fourth embodiment, comprising: second determining means for determining whether live information should imply at least one of a user's feeling of watching and listening; and A changing means is provided for changing the image representation according to the judgment of the second judgment means.
  • the second determination means determines whether it is necessary to suggest at least one of the user's sense of gaze and sense of listening.
  • the changing means changes the image expression by, for example, moving the image expression closer to or away from the image of the dialogue partner (or avatar) or blurring the image expression according to the determination of the second determination means.
  • the fifth embodiment it is possible to express the user's sense of gaze and sense of listening by changing the representation of the video.
  • a sixth embodiment is a program executed by an online terminal that performs online interaction through a network, wherein the processor of the online terminal is caused to display an image of the interaction partner on the display, and to the display,
  • the sixth embodiment can also be expected to have the same effect as the first embodiment.
  • a video representation such as a video including a user from the perspective of a dialogue partner
  • a form such as a speech bubble that visually clearly indicates attribution to the dialogue partner. Therefore, the user can easily imagine the attention and recognition of the interlocutor and can express the interlocutor's intention.
  • FIG. 1 is an illustrative view showing one example of an online dialogue system of an embodiment.
  • FIG. 2 is a block diagram showing the electrical configuration of the user terminal of the embodiment shown in FIG.
  • FIG. 3 is a block diagram showing the electrical configuration of the server of the embodiment shown in FIG.
  • FIG. 4 is an illustrative view showing a first display example of the display on the user side in the embodiment.
  • FIG. 5 is an illustrative view showing a second display example of the display on the user side in the embodiment.
  • FIG. 6 is an illustrative view showing a third display example of the display on the user side in the embodiment.
  • FIG. 7 is an illustrative view showing a fourth display example of the display on the user side in the embodiment.
  • FIG. 1 is an illustrative view showing one example of an online dialogue system of an embodiment.
  • FIG. 2 is a block diagram showing the electrical configuration of the user terminal of the embodiment shown in FIG.
  • FIG. 3 is a block diagram showing the
  • FIG. 8 is an illustrative view showing a fifth display example of the display on the user side in the embodiment.
  • 9 is an illustrative view showing one example of a memory map of a storage unit (RAM) of the user terminal shown in FIG. 2;
  • FIG. 10 is an illustrative view showing one example of a memory map of a storage unit (RAM) of the server shown in FIG. 3;
  • FIG. 11 is a flow diagram showing an example of the operation of the user terminal in the embodiment.
  • FIG. 12 is an illustrative view showing a sixth display example of the display on the user side in the embodiment.
  • FIG. 13 is an illustrative view showing a seventh display example of the display on the user side in the embodiment.
  • FIG. 14 is an illustrative view showing an eighth display example of the display on the user side in the embodiment.
  • FIG. 15 is an illustrative view showing a ninth display example of the display on the user side in the embodiment.
  • FIG. 16 is an illustrative view showing a tenth display example of the display on the user side in the embodiment.
  • FIG. 17 is an illustrative view showing an eleventh display example of the display on the user side in the embodiment.
  • FIG. 18 is an illustrative view showing a twelfth display example of the display on the user side in the embodiment.
  • FIG. 19 is an illustrative view showing a thirteenth display example of the display on the user side in the embodiment.
  • FIG. 20 is a flowchart showing an example of a video processing subroutine.
  • FIG. 21 is an illustrative view showing a fourteenth display example of the display on the user side in the embodiment.
  • FIG. 22 is an illustrative view showing a fifteenth display example of the display on the user side in the embodiment.
  • FIG. 23 is a flow chart showing another example of a video processing subroutine.
  • FIG. 24 is a flow chart showing another example of a video processing subroutine.
  • FIG. 25 is a flow chart showing another example of a video processing subroutine.
  • FIG. 26 is a flow diagram showing another example of a video processing subroutine.
  • FIG. 27 is an illustrative view showing a sixteenth display example of the display on the user side in the embodiment.
  • 28 is an illustrative view showing one example of a memory map of a storage unit (RAM) of the user terminal shown in FIG. 2 in another embodiment;
  • FIG. RAM storage unit
  • Information processing device 12o is communicably connected through a network 14 such as the Internet.
  • the user terminals 12u and 12o are used in the online interactive system 10, and therefore are called online terminals for convenience.
  • the online dialogue in this embodiment can be performed by using a CG character on the screen or a robot that can be operated from a remote location as an avatar. It should be understood that it also includes an "avatar conference" that participates in dialogue. In the case of an avatar conference, it is easy to make the avatars appear to be making eye contact with each other, but it is not always certain that the interlocutors are making eye contact with each other. Consider it useful.
  • the online dialogue system 10 as in this embodiment is used for a video conference as an example. It should be pointed out in advance that it can be used not only for simple dialogue (two or more interlocutors talk), etc.
  • the online interaction system 10 may further include a server 16 , which is also connected to the network 14 .
  • Both the information processing device used by the user U and the information processing device used by the dialogue partner O are both referred to as user terminals.
  • User terminals 12u and 12o are then simply referred to as user terminal 12 when need not be distinguished from each other and have the configuration shown in FIG.
  • FIG. 2 is a block diagram showing the electrical configuration of the user terminal 12.
  • user terminal 12 includes CPU 18 .
  • the CPU 18 is also called a processor or the like, and a storage section 20, a communication section 22, an input control circuit 24, a display control circuit 26, and the like are connected to the CPU 18 via the bus 17.
  • FIG. 1 is a block diagram showing the electrical configuration of the user terminal 12.
  • user terminal 12 includes CPU 18 .
  • the CPU 18 is also called a processor or the like, and a storage section 20, a communication section 22, an input control circuit 24, a display control circuit 26, and the like are connected to the CPU 18 via the bus 17.
  • the CPU 18 is in charge of overall control of the user terminal 12.
  • the storage unit 20 functions as the main storage unit of the CPU 18, and is used as a RAM used as a work area and a buffer area, and an auxiliary storage for storing control programs and various data for the CPU 18 to control the operation of each component of the user terminal 12.
  • Storage devices including memory such as SSDs or HDDs.
  • the communication unit 22 is a communication circuit for connecting to the network 14 and communicates with other user terminals 12 and external computers (external terminals) such as the server 16 via the network 14 according to instructions from the CPU 18 .
  • An input device 28 is connected to the input control circuit 24 and a display 30 is connected to the display control circuit 26 .
  • the input device 28 is a device for receiving an input operation (user operation) by the user U or the dialogue partner O, and is, as is well known, an input operation unit such as a numeric keypad, a mouse, and a touch panel.
  • a touch panel may be provided on the display surface of the display 30 .
  • the input control circuit 24 outputs an operation signal or operation data corresponding to the operation of the input device 28, which is taken into the CPU 18.
  • the display control circuit 26 Under instructions from the CPU 18 , the display control circuit 26 generates display image data using the image generation data stored in the storage unit 20 and outputs the generated display image data to the display 30 .
  • a camera 32 and a microphone 34 are associated with the user terminal 12 , and a video signal captured by the camera 32 is captured by the CPU 18 and stored in the storage unit 20 . Similarly, an audio signal picked up by the microphone 34 is captured by the CPU 18 and stored in the storage unit 20 .
  • the user terminal 12 is provided with a speaker 35, and the CPU 18 outputs an audio signal to an audio processing circuit (not shown), whereby audio can be output from the speaker 35 based on the audio signal.
  • the environment sensor 36 can be a sensor that acquires environmental data such as the temperature and humidity of the location where the user or the dialogue partner is present, and the biometric sensor 38 acquires biological data such as body temperature, pulse, perspiration, etc. of the user or the dialogue partner.
  • FIG. 3 is a block diagram showing the electrical configuration of the server 16.
  • server 16 includes CPU 40 .
  • the CPU 40 is also called a computer or a processor, and is connected to the storage section 42 , the communication section 44 , the input control circuit 46 , the display control circuit 48 and the like via the bus 39 .
  • the CPU 40 is in charge of overall control of the server 16.
  • the storage unit 42 functions as a main storage unit for the CPU 40, and includes a RAM used as a work area and a buffer area, and an auxiliary storage for storing control programs and various data for the CPU 40 to control the operation of each component of the server 16.
  • the device for example, includes memory such as an SSD or HDD.
  • the communication unit 44 is a communication circuit for connecting to the network 14, and communicates with the user terminal 12 and other external computers (external terminals) via the network 14 according to instructions from the CPU 40.
  • An input device 50 is connected to the input control circuit 46 , and a display 52 is connected to the display control circuit 48 .
  • the input control circuit 46 outputs an operation signal or operation data according to the operation of the input device 50 to the CPU 40 .
  • the display control circuit 48 Under instructions from the CPU 40 , the display control circuit 48 generates display image data using the image generation data stored in the storage unit 42 and outputs the generated display image data to the display 52 .
  • FIG. 4 shows a first display example of the display 30u on the user side.
  • an image 54 of the dialogue partner O is displayed on the display 30u of the user terminal 12u of the user U.
  • An image 58 including the user U is displayed inside.
  • the image 54 of the conversation partner O is an image captured by the camera 32o of the user terminal 12o and sent to the user terminal 12u via the server 16.
  • FIG. Also, the image 58 of the user U here is an image captured by the camera 32u of the user terminal 12u, and is, so to speak, an image including the user U captured from the dialogue partner's viewpoint.
  • the image 58 displayed in the balloon 56 is basically the image of the user himself/herself, but the image 58 displayed in the balloon 56 is "an image of the user captured from the point of view of the conversation partner.” and at least one of images based on user's behavior and/or user's recognition.”
  • the following video representation is also conceivable.
  • the user's in-camera (corresponding to the camera 32 in FIG. 2) in Zoom (trade name), environmental sensors such as a telephoto camera, images of multiple cameras, and their reconstructed images can be considered.
  • Another possible image expression is a visual sensor worn by the dialogue partner, or if the dialogue partner is a robot, the image from the visual sensor mounted on the robot or the like, or a reconstructed image of them.
  • Video representations may be recorded video of the user, composite video such as CG, icons of dialogue partners, letters of names, etc., and not only 2D video but also 3D CG representation may be used.
  • ⁇ speech balloons'' can be said to be video elements that visually display attribution to the conversational partner. , display in contact with the dialogue partner, display in the vicinity of the dialogue partner, display on some graphic element connected to the dialogue partner, and the like.
  • speech balloons are used as dialogue screens in online dialogue, projection in real space, stationary or portable displays in real space, displays worn or mounted on dialogue partners, AR (virtual reality) It's okay.
  • a dialogue partner humans in real space as in the embodiment, robots, CG agents, humans (robots, agents) in online dialogue and virtual space, avatars (humans) in real space / video conference and virtual space
  • robots, CG agents, humans (robots, agents) in online dialogue and virtual space avatars (humans) in real space / video conference and virtual space
  • avatars humans in real space / video conference and virtual space
  • identification information or an ID (icon or name) of a dialogue partner can also be considered.
  • FIG. 5 shows a second display example of the display 30u on the user side.
  • the dialogue partner is a robot as described above, and the image of the robot (avatar) is displayed as the dialogue partner image 54 .
  • This robot has a display on the top of its head, which in this example is used to display a balloon 56 .
  • the user's image 58 is displayed in the balloon (display) 56 .
  • a robot having a display on its head is installed in front of the user, and the user and the robot interact with each other. and the head display will show the balloon 56 and the image 58 therein.
  • FIG. 6 shows a third display example of the display 30u on the user side.
  • the dialogue partner is a robot
  • the robot avatar
  • a balloon 56 is formed from this robot image 54 and the user's image 58 is displayed in the balloon 56 .
  • FIG. 7 shows a fourth display example of the display 30u on the user side.
  • an inorganic figure (as an avatar) is displayed as the image 54 of the dialogue partner.
  • a balloon 56 is formed from this image 54 and the user's image 58 is displayed in the balloon 56 .
  • FIG. 8 shows still another display example of the display 30u on the user side.
  • a graphic of the above-mentioned identification information of the dialogue partner (the Chinese character "Kawa" which is part of the name) is displayed as the image 54 of the dialogue partner.
  • a balloon 56 is formed from this image 54 and the user's image 58 is displayed in the balloon 56 .
  • FIG. 9 is an illustrative view showing one example of a memory map of the storage unit (RAM) 20 of the user terminal 12 shown in FIG.
  • the storage unit 20 includes a program storage area 60 and a data storage area 62.
  • FIG. Various control programs including an OS are stored in the program storage area 60 of the storage unit 20 .
  • the control program includes a live information acquisition program 64a and a live information analysis/processing program 64b.
  • the live information acquisition program 64a is a program for acquiring live information of the user/dialogue partner.
  • the live information includes operation input information from the input device 50 and sensor information.
  • the sensor information is live information from various sensors such as the camera 32, microphone 34, environment sensor 36, and biosensor 38 shown in FIG.
  • the sensor information includes at least video signals from the camera 32 and audio signals from the microphone 34 .
  • the live information analysis/processing program 64b is a program for analyzing the live information thus obtained and further processing it as necessary.
  • An example of analysis is speech recognition based on the above-mentioned voice signals of the user (dialogue partner).
  • the speech recognition result is morphologically analyzed to analyze whether or not a focal word (focus word) is included in the utterance of the user (dialogue partner). "Processing" is to generate an image to be displayed as an image 58 in a balloon 56 based on the focus word.
  • the user when detecting a character string of a focus word, if the character string is, for example, " ⁇ ", characters such as "" ⁇ "
  • the user recognizes the utterance of the conversation partner as if it were the displayed content. feeling) can be expressed. Furthermore, it is possible to make it appear as if the conversation partner recognizes the contents of the user's utterance. For example, when the focus word detected from the user's utterance is superimposed here, it is possible to feel that the conversation partner is listening (feeling of being listened to), and the focus word detected from the user's utterance can be When superimposed here, it is possible to express the feeling that oneself is listening to the dialogue partner (listening feeling).
  • processing is generating an image associated with the speech recognition result or the focus word (for example, an image of a car when the focus word is "car").
  • image 58 corresponding to the content of the user's (conversation partner's) speech
  • the user can have a sense of properly listening to the speech of the conversation partner (listening feeling).
  • analysis includes recognizing facial expressions and actions based on video signals from the camera, and “processing” generates facial expression icons, animations, and video processing effects corresponding to the recognition results. Including.
  • processing By displaying the user's facial expression recognition result in a form that is linked to the "conversation partner's speech balloon", the feeling that the conversation partner is noticing (gazing at, listening to) the change in his/her (user's) emotions can be realized. , the user can have.
  • displaying the expression recognition result of the conversation partner in a speech bubble it becomes easier to understand the conversation partner's emotions, and the feeling that the speech balloon is a display in the conversation partner's head can be strengthened.
  • the facial expression recognition result of the user when the facial expression recognition result of the user is superimposed on the user's figure and displayed in a balloon, the change in the user's emotion can be felt by the conversation partner (feeling of being watched).
  • the facial expression recognition result is superimposed on the user's figure and displayed in a balloon, it is possible to express the feeling that the user is aware of the change in the emotion of the conversation partner (gazing feeling).
  • a nod when a nod is recognized as a result of facial expression recognition, a character string "I see”, a positive facial expression icon, an "! You can generate showing animations, video processing effects, speech bubble movement (up and down, big and small), and more.
  • displaying the user's nodding or shaking the head it is possible to make the other party of the conversation feel the same feelings as the user's satisfaction or dissatisfaction.
  • By displaying the nodding and shaking of the conversation partner it becomes easier to understand the agreement or dissatisfaction of the conversation partner, and the feeling that the speech balloon is an indication in the mind of the conversation partner can be strengthened.
  • a communication program 64c a communication program 64c, a video generation program 64d, a display control program 64e, and the like are set in the program storage area 60.
  • the communication program 64c is a program for communicating data with other user terminals 12, servers 16, or other external computers or other devices via the network 14.
  • the video generation program 64d is a program for generating the display screen shown in FIG. 4, etc., and is a program for generating the balloon 56 and the above-described processed video displayed therein.
  • the display control program 64e is a program for displaying the display screen of FIG. 4 and the like.
  • a counter 66c for measuring the blowing time and integrated data are stored in the data storage area 62 of the storage unit (RAM) 20 in addition to an area 66a for storing the data of the live information described above and an area 66b for storing the data of the processed information.
  • a storage area 66d and the like are provided in the data storage area 62 of the storage unit (RAM) 20 in addition to an area 66a for storing the data of the live information described above and an area 66b for storing the data of the processed information.
  • the integrated data stored in the area 66 is the live information data and/or processed data from the two user terminals 12u and 12o integrated by the server 16 and transmitted to each of the user terminals 12u and 12o for balloon display.
  • the user terminals 12u and 12o use this integrated data to display balloon display screens as shown in FIGS.
  • FIG. 10 is an illustrative view showing one example of a memory map of the storage unit (RAM) 42 of the server 16 shown in FIG.
  • storage unit 42 includes program storage area 68 and data storage area 70 .
  • Various control programs including the OS are stored in the program storage area 68 of the storage unit 42 .
  • the control program includes a communication program 72a and a received information integration program 72b.
  • the communication program 72a is a program for communicating data with the user terminal 12, other external computers or other devices via the network 14. This communication program 72a receives live information and/or processed information thereof from the user terminal 12u on the user side and the user terminal 12o on the dialogue partner side.
  • Received information integration program 72b in order to display the screen such as FIG. It is a program for integrating live information and/or processed information received from the user terminal 12u and the dialogue partner's user terminal 12o.
  • the integrated data created according to the received information integration program 72b includes data of the image 54, balloon 56 and image 58 shown in FIG.
  • the integrated data is transmitted to the user terminal 12u on the user side and the user terminal 12o on the conversation partner side according to the communication program 72a described above.
  • FIG. 11 is a flow diagram showing an example of the operation of the user terminal 12u of the user U in this embodiment. It should be pointed out in advance that the routine of FIG. 11 is repeatedly executed at predetermined time intervals (for example, frame cycles). If necessary, the user terminal 12o of the dialogue partner can also perform the same processing, and if there are a plurality of dialogue partners, the same processing can be executed for each dialogue partner.
  • predetermined time intervals for example, frame cycles
  • the CPU 18 acquires live information of the user/dialogue partner according to the live information acquisition program 64a (Fig. 9).
  • the live information of the dialogue partner may be received directly from the user terminal 12o of the dialogue partner, or may be sent from the user terminal 12o of the dialogue partner to the server 16 and received from the server 16.
  • This live information data is stored in the live information data storage area 66a shown in FIG.
  • the live information data on the user side is transmitted from the communication unit 22 to the server 16 according to the communication program 64c, and stored in the server 16 in the area 74a (FIG. 10) as received information data.
  • the CPU 18 displays the image 54 (FIG. 4, etc.) of the dialogue partner on the display 30 and outputs the voice of the dialogue partner from the speaker 35 based on the live information of the dialogue partner.
  • step S4 if no time value is set in the balloon display time timer 66c (Fig. 9), the balloon display time timer 66c is set to an initial value, for example, 30 seconds.
  • the timer 66c is decremented. That is, the initial value is set in the timer 66c at the beginning of the repetition of the routine of FIG. 11, but from the next repetition onwards, the time value reflecting the decrement in step S5 may be used as it is.
  • the CPU 18 performs the above-described analysis (recognition) of the live information of the user/conversation partner according to the live information analysis/processing program 64b (Fig. 9), and determines whether the recognition result is important. That is, it determines whether or not an important word is uttered in voice recognition, whether or not an important facial expression is made in image-based facial expression recognition, and whether or not an important action is shown in image-based action recognition.
  • key words may include focal words, pre-registered topic words (topic words registered prior to dialogue), backtracking, etc.
  • a smile, an angry face, or the like can be considered as the important facial expression.
  • Nodding and shaking (negative) can be considered as important motions in the video.
  • step S7 it is determined whether there is such important live information.
  • the balloon 56 is displayed only when an important word or image is detected from the live information, so the balloon display time T seconds is set only when "YES" is determined in step S7. That is, the balloon display time T (seconds) is added to the timer 66c.
  • step S11 if necessary, the live information is processed according to the live information analysis/processing program 64b (FIG. 9) by any of the methods described above.
  • the processed data in step S11 is stored in the processed data storage area 66b shown in FIG.
  • the CPU 18 transmits the processed data to the server 16 according to the communication program 64c.
  • the processed data referred to here is the processed data of the user terminal 12u on the user side and the processed data of the user terminal 12o on the dialogue partner side.
  • live information must always be sent to the server 16 as described in step S2 above, but processed data can be sent as needed, such as when processing data are integrated to generate different processed data. It is necessary to transmit to the server 16 only when the
  • the server 16 integrates such processed data from the user terminals 12u and 12o according to the received information integration program 72b (FIG. 10), and transmits the integrated data to the user terminals 12u and 12o.
  • step S15 it is determined whether or not the counter 66c has timed out, that is, whether or not the balloon display time T is greater than "0 (zero)".
  • step S17 the CPU 18 receives integrated data transmitted from the server 16.
  • this integrated data includes video data and data indicating the display position for displaying the video data. Therefore, in step S19, the CPU 18 determines the display position of the balloon 56 and An image 58 or the like is displayed.
  • step S11 if there is superimposed/processed data in step S11, the superimposed/processed image 58 is displayed in the balloon 56.
  • the speech bubble 56 and the image 58 therein are displayed only when an important event such as an important word or important image is detected.
  • the counter 66c is set to infinity as the timer time, or instead of detecting an important event in step S7, for example, "YES" is determined when the image of the dialogue partner is displayed. can be considered.
  • FIG. 12 shows a sixth display example of the display 30u on the user side.
  • FIG. 12 shows an example in which the live information is processed (image processing effect) in step S11 to display an image 58 in which only the user's face is enlarged as an image representation including the user in the balloon 56.
  • FIG. By displaying an image 58 in which only the user's face is enlarged, it is possible for the user to express his positive intention to the conversation partner's speech.
  • FIG. 13 shows a seventh display example of the display 30u on the user side.
  • FIG. 13 shows an example of displaying an image 58 in which the user's head is tilted as an image representation including the user in the balloon 56 by performing image processing on the live information in step S11.
  • the inclination of the head with the image 58 in this manner, it is possible to express the user's attitude of listening intently to the conversation partner.
  • FIG. 14 shows an eighth display example of the display 30u on the user side.
  • FIG. 14 shows an example in which the live information is image-processed in step S11, and an image 58 is displayed as an animation showing the motion of the user nodding as an image representation including the user in the balloon 56.
  • FIG. By displaying the motion of the user nodding, it is possible to indicate the user's positive intention to the conversation partner's speech.
  • FIG. 15 shows a ninth display example of the display 30u on the user side.
  • FIG. 15 shows an example in which an icon expressing awareness of live information is superimposed on the balloon 56 as a video expression and displayed in step S11. By displaying the image 58 with such icons superimposed, it is possible to express the user's attitude of listening intently to the conversation partner's speech.
  • FIG. 16 shows a tenth display example of the display 30u on the user side.
  • FIG. 16 shows an example of superimposing an icon representing a motion of nodding on the live information in step S11 and displaying it in the balloon 56 as a video representation. By displaying the image 58 with such an icon superimposed, it is possible to indicate that the user has an affirmative intention to the conversation partner's speech.
  • FIG. 17 shows an eleventh display example of the display 30u on the user side.
  • FIG. 17 shows an example in which a character string "I see” expressing awareness of live information is superimposed and displayed in a balloon 56 as a video expression in step S11.
  • a character string "I see” expressing awareness of live information is superimposed and displayed in a balloon 56 as a video expression in step S11.
  • FIG. 18 shows a twelfth display example of the display 30u on the user side.
  • FIG. 18 shows an example in which a “!” mark expressing awareness of the live information is superimposed on the speech bubble 56 as a video expression and displayed in step S11.
  • a “!” mark expressing awareness of the live information is superimposed on the speech bubble 56 as a video expression and displayed in step S11.
  • FIG. 19 shows a thirteenth display example of the display 30u on the user side.
  • FIG. 19 shows an example in which the live information is image-processed in step S11, and the image processed such that the balloon 56 itself including the image 58 is moved vertically and horizontally is displayed.
  • the speech bubble 56 is moved up and down, it is possible to indicate the user's affirmative intention to the conversation partner's utterance in the same way as nodding.
  • the balloon 56 is moved left and right, it is possible to indicate the user's negative intention to the conversation partner's utterance.
  • FIG. 20 is a flowchart showing an example of the subroutine of step S11 of FIG. 11 included in the live information analysis/processing program 64b. 20 and FIGS. 23 to 26 to be described later, similar to FIG. 11, only the processing in the user terminal of the user side will be described, but if necessary, the same processing can be performed in the user terminal of the dialogue partner. Just run it.
  • step S21 of FIG. 20 the CPU 18 detects how long the user's line of sight has been watching the screen of the display 30u or the image 54 (FIG. 4, etc.) of the dialogue partner displayed on the screen, based on the live information data. Furthermore, in the next step S23, the CPU 18 detects the presence or absence of the user's speech based on the live information data.
  • step S25 the CPU 18 determines whether or not it is necessary to express the sense of gaze or sense of listening with balloon images.
  • step S25 For example, as an example of a case where it is not necessary to give a particular sense of gaze or listening, ⁇ the user has taken his eyes off the screen of the display 30u for the last minute'', and ⁇ the user has not spoken for the last minute''. If such a thing is detected, the need to make the viewer feel the feeling of watching or listening is reduced, so in this case, "NO" is determined in step S25.
  • step S25 the CPU 18 performs image processing (effects) that does not imply a feeling of gaze or listening as an image representation including the user displayed in the balloon 56 in step S26.
  • image processing effects
  • a speech balloon 56 and an image 58 are displayed in a smaller size and farther away (compared to the normal case shown in FIG. 4, for example) to give the impression of being further away from the image 54 of the dialogue partner. give.
  • the size of the balloon 56 and the image 58 may be reduced without changing the display position of the balloon 56 .
  • step S26 it is conceivable to add an effect such as blurring the balloon 56 and the image 58.
  • step S25 the CPU 18 performs image processing (effects) to imply a sense of gaze and a sense of listening as the image representation including the user displayed in the balloon 56 in step S27.
  • image processing effects
  • FIG. 22 a balloon 56 and an image 58 are displayed larger and closer (compared to the normal case shown in FIG. 4, for example) to give an impression that the image 54 of the conversation partner is closer. give.
  • step S29 an image for a speech balloon is created with such image processing.
  • This video data is sent to the server 16, as described above, which integrates the video data from both user terminals 12u and 12o and returns the integrated data to the user terminals 12u and 12o.
  • step S27 for example, the same balloon 56 or image 58 as in FIG. 4 may be displayed without performing video processing (effects) that particularly suggests a feeling of gaze or listening.
  • the effect shown in FIG. 21 may be applied only when it is not desired to imply a sense of gaze or a sense of listening.
  • FIG. 23 is a flowchart showing another example of the subroutine of step S11 of FIG. 11 included in the live information analysis/processing program 64b. Note that only the processing on the user side will be described in FIG. 23, but the processing on the dialogue partner side is the same.
  • step S31 of FIG. 23 the CPU 18 confirms the setting change of the display priority.
  • This display priority is a display priority that can be set by the user in a GUI (not shown) displayed on the display 30u of the user terminal 12u according to a priority setting program (not shown). 58, or as superimposed on the image 58, indicates what is prioritized. For example, set whether to display a character string or display an image.
  • the set priority is stored in the data storage area 62 (FIG. 9) of the user terminal 12u.
  • step S33 the CPU 18 executes voice recognition based on the user's voice signal included in the live information. Then, in step S35, it is determined whether or not a focus word display flag (not shown) indicating that the focus word should be displayed according to the speech recognition result is true (TRUE). Note that the focus word display flag is set, for example, in the data storage area 62 of the user terminal 12u in the same manner as the dialogue partner speech recognition flag, which will be described later.
  • the CPU 18 extracts the focus word from the user's utterance in step S37.
  • the focus word extracted from this user utterance is stored in this data storage area 62 .
  • the CPU 18 determines whether or not the dialogue partner speech recognition display flag (not shown) is true (TRUE). If "YES" is determined in this step S39, the CPU 18 extracts the focus word from the dialogue partner's utterance in the next step S41 in the same manner as in steps S33 and S27. The focal word extracted from the dialogue partner's speech is also stored in the data storage area 62 .
  • the CPU 18 calculates a comprehensive score based on the display priority, each recognition accuracy, and the evaluation of the recognition content.
  • step S45 it is determined whether or not there is an object with a score above a certain level, and in step S47, the object with the high score is superimposed/processed so as to be displayed.
  • the object with the high score is superimposed/processed so as to be displayed.
  • Table 1 shows an example of the results of operations according to the subroutine of FIG.
  • the focal word of the user's utterance is detected as "Kokusai-dori"
  • the focal word of the dialogue partner's utterance is detected as "Awamori”.
  • FIG. 24 is a flowchart showing another example of the subroutine of step S11 of FIG. 11 included in the live information analysis/processing program 64b. Note that FIG. 24 only describes the processing on the user side, but it should be noted that the processing on the dialogue partner side is the same.
  • Steps S51-S65 in FIG. 24 are the same as or similar to steps S31-S45 in FIG. 23, so overlapping descriptions are omitted here.
  • step S65 it is determined whether or not there is an object (character string) with a score equal to or higher than a certain value, and in step S67, an image search is performed using the selected character string.
  • an object character string
  • a score equal to or higher than a certain value
  • step S67 an image search is performed using the selected character string.
  • step S67 it is determined whether or not the search has been performed with a certain accuracy or higher. If "YES”, the searched object (image) is superimposed/processed so as to be displayed.
  • Table 2 shows an example of the results of operations according to the subroutine of FIG.
  • the image displayed when the user utters "Okinawa” is based on the user's behavior, and when the user utters "summer” for example, the image of Okinawa scenery is displayed. is displayed, it can be said that the video cost reduction including the video based on the user's perception is displayed.
  • FIG. 25 is a flowchart showing another example of the subroutine of step S11 of FIG. 11 included in the live information analysis/processing program 64b. Note that FIG. 25 only describes the processing on the user side, but it should be noted that the processing on the dialogue partner side is the same.
  • Steps S81-S91 in FIG. 25 are the same as or similar to steps S31-S41 in FIG. 23, so overlapping descriptions are omitted here.
  • step S89 the accuracy of facial expression recognition and the strength of the recognized facial expression are evaluated in order to detect the presence or absence of an object to be highlighted. Then, in step S95, it is determined whether or not there is an object to be highlighted. Don't highlight unless you can make a selection with high precision and strength.
  • step S97 the facial expression icon, animation, and video processing effect corresponding to the selected facial expression, voice recognition result, and focus word are selected. It is determined in step S99 whether the accuracy of the selection is above a certain level, and if "YES", the selected facial expression icon, animation, and video processing effect are selected, and the selected object (image) is displayed. Superimpose/process.
  • Table 3 shows an example of the results of operations according to the subroutine of FIG.
  • a "smile” is detected as a result of facial expression recognition of the user, and the facial expression icon selected based on this or the facial expression icon and the user's image are displayed in a balloon of the conversation partner. Furthermore, as a result of the user's facial expression recognition, "know” is detected, and based on this, selected image processing effects or animations are displayed in the conversation partner's balloon. Then, as a result of facial expression recognition of the conversation partner, "ecstatic” is detected, and based on this, an image of the selected man drinking awamori is displayed in the speech balloon of the conversation partner.
  • FIG. 26 is a flowchart showing another example of the subroutine of step S11 of FIG. 11 included in the live information analysis/processing program 64b. Note that FIG. 26 only describes the processing on the user side, but it should be noted that the processing on the dialogue partner side is the same.
  • step S101 of FIG. 26 the CPU 18 detects a positive action such as a nod of the user based on the video signal of the camera 32u (FIG. 2) as live information. Similarly, in step S113, based on the video signal from the camera 32u, a negative motion such as a swing of the user's head is detected.
  • step S115 the CPU 18 detects positive utterances such as backtracking of the user based on the audio signal from the microphone 34u (FIG. 2), which is live information. Similarly, in step S117, based on the audio signal from the microphone 34u, a negative utterance such as a growl of the user is detected.
  • step S119 the CPU 18 calculates a comprehensive score based on the evaluation of the display priority, each recognition accuracy, and the recognition content in the same manner as in step S43 (FIG. 23).
  • step S121 it is determined whether there is an expression with a score above a certain level. Then, in step S123, image processing (processing of balloons) is executed for the expression corresponding to the expression with the higher score, and the result of the image processing is superimposed/processed so as to be displayed.
  • image processing processing of balloons
  • Table 4 shows an example of the results of operations according to the subroutine of FIG.
  • FIG. 27 shows a sixteenth display example of the display 30u on the user side.
  • images 54 of two dialogue partner avatars speech robots
  • a balloon 56 belonging to one of them is displayed.
  • an image 80 of the user is also displayed on the screen.
  • this avatar is a talking robot having an appropriate number of degrees of freedom such as the neck, head, and arms. can do.
  • the user's avatar may be a CG character.
  • the user can interact with two interaction partner avatars, and in FIG. 27 an image 58 of the user's avatar is displayed in a balloon 56 belonging to the interaction partner image 54 of one of them. .
  • the user's avatar may participate as an attendant in the dialogue of the two dialogue partners and may not participate in the dialogue.
  • a balloon 82 belonging to the user avatar may be formed in the vicinity of the user's avatar image 80, and an image 84 of the dialogue partner may be displayed therein.
  • the balloon 82 can express that the user's head is filled with the avatar image of the conversation partner when the conversation partner is speaking, and can promote the user's listening attitude.
  • the user terminal 12 has an avatar control program 64f set in advance as shown in FIG. avatar) can be controlled.
  • the CPU 18 basically executes the processing of each step in these flow charts. good.
  • the server 16 integrates the live information and/or processed information of each user terminal to display images such as balloons 56 .
  • each user terminal is provided with the function of the server 16, and each user terminal directly exchanges live information to generate and synthesize images such as balloons 56 within itself.
  • REFERENCE SIGNS LIST 10 online dialogue system 12, 12u, 12o: user terminal 16: server 30: display 32: camera 34: microphone 54: dialogue partner's image 56: balloon 58: user's image

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

Le problème décrit par la présente invention concerne la possibilité d'exprimer les intentions d'un interlocuteur dans une conversation en ligne. La solution selon l'invention concerne un système de conversation en ligne comprenant un terminal utilisateur et un serveur. Un utilisateur et un interlocuteur participent à une conversation en ligne sur un réseau à l'aide d'une caméra et d'un microphone du terminal utilisateur. La vidéo de l'interlocuteur est affichée sur un écran, une bulle de pensée pertinente est affichée dans la vidéo et la vidéo de l'utilisateur est affichée dans la bulle de pensée. Cette vidéo donne l'impression que l'interlocuteur écoute le discours de l'utilisateur et fait attention à ses actions.
PCT/JP2022/040653 2021-11-08 2022-10-31 Terminal en ligne et programme WO2023080105A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2021181854 2021-11-08
JP2021-181854 2021-11-08

Publications (1)

Publication Number Publication Date
WO2023080105A1 true WO2023080105A1 (fr) 2023-05-11

Family

ID=86241144

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2022/040653 WO2023080105A1 (fr) 2021-11-08 2022-10-31 Terminal en ligne et programme

Country Status (1)

Country Link
WO (1) WO2023080105A1 (fr)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10200873A (ja) * 1997-01-16 1998-07-31 Sharp Corp テレビ電話装置
US20210185276A1 (en) * 2017-09-11 2021-06-17 Michael H. Peters Architecture for scalable video conference management

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10200873A (ja) * 1997-01-16 1998-07-31 Sharp Corp テレビ電話装置
US20210185276A1 (en) * 2017-09-11 2021-06-17 Michael H. Peters Architecture for scalable video conference management

Similar Documents

Publication Publication Date Title
US20220254343A1 (en) System and method for intelligent initiation of a man-machine dialogue based on multi-modal sensory inputs
KR102300606B1 (ko) 자연어 대화에 관련되는 정보의 시각적 제시
EP2597868B1 (fr) Interface optimisée pour des communications de voix et de vidéo
KR20190038900A (ko) 단어 흐름 주석
CN110326300B (zh) 信息处理设备、信息处理方法及计算机可读存储介质
CN111597828B (zh) 翻译显示方法、装置、头戴显示设备及存储介质
KR102193029B1 (ko) 디스플레이 장치 및 그의 화상 통화 수행 방법
WO2017130486A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations, et programme
US20200233630A1 (en) Control device, control method, and computer program
US11170180B2 (en) Control device and control method
JP4845183B2 (ja) 遠隔対話方法及び装置
KR101567154B1 (ko) 다중 사용자 기반의 대화 처리 방법 및 이를 수행하는 장치
WO2023080105A1 (fr) Terminal en ligne et programme
JP7423490B2 (ja) ユーザの感情に応じたキャラクタの傾聴感を表現する対話プログラム、装置及び方法
JP2019086858A (ja) 顧客応対システム及び顧客応対方法
JP7474211B2 (ja) ユーザから発話された名詞を忘却する対話プログラム、装置及び方法
WO2023058393A1 (fr) Dispositif de traitement d'informations, procédé de traitement d'informations et programme
JP2020052846A (ja) 描画システム、描画方法、及びプログラム
JP7257349B2 (ja) 対象人物の特徴的な身振りを推定するプログラム、装置及び方法
JP6699457B2 (ja) 通信制御装置、通信制御システム、通信制御方法、及び通信制御プログラム
JP2023131824A (ja) 情報処理装置、制御プログラムおよび制御方法
JP2023131825A (ja) 情報処理装置、制御プログラムおよび制御方法
JP2023072111A (ja) 情報処理装置、制御プログラム、制御方法および情報処理システム
JP2024088982A (ja) 情報処理装置、情報処理システム、制御プログラムおよび制御方法
CN118251667A (zh) 用于生成视觉字幕的系统和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22889925

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023558024

Country of ref document: JP