WO2021257868A1 - Video chat with spatial interaction and eye contact recognition - Google Patents

Video chat with spatial interaction and eye contact recognition Download PDF

Info

Publication number
WO2021257868A1
WO2021257868A1 PCT/US2021/037887 US2021037887W WO2021257868A1 WO 2021257868 A1 WO2021257868 A1 WO 2021257868A1 US 2021037887 W US2021037887 W US 2021037887W WO 2021257868 A1 WO2021257868 A1 WO 2021257868A1
Authority
WO
WIPO (PCT)
Prior art keywords
user
avatar
users
virtual
video chat
Prior art date
Application number
PCT/US2021/037887
Other languages
French (fr)
Inventor
Tyler Alexander WELCH
Original Assignee
Meet I2I, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Meet I2I, Inc. filed Critical Meet I2I, Inc.
Publication of WO2021257868A1 publication Critical patent/WO2021257868A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/157Conference systems defining a virtual conference space and using avatars or agents
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • H04N7/155Conference systems involving storage of or access to video conference sessions

Definitions

  • the video chat service provides a virtual spatial environment including avatars associated with the users.
  • the virtual spatial environment may enable users to attend virtual events or perform virtual activities with users they communicate with via webcam.
  • the video chat service may provide overlays facilitating more realistic communication via video, including features that link actions performed by the users over video chat with actions performed by their virtual avatars.
  • Virtual avatars may perform actions performed by people in reality, including playing games, moving their bodies, focusing attention on one another, and attending events. Users may control the actions of their avatars using input devices, such as keyboards and mice (e.g., by pressing shortcut keys or hotkeys, typing commands into a terminal, or clicking on avatars or objects). They may also perform gestures which are picked up via webcam and translated into avatar actions (e.g., waving arms on camera may cause an avatar to perform jumping jacks).
  • input devices such as keyboards and mice (e.g., by pressing shortcut keys or hotkeys, typing commands into a terminal, or clicking on avatars or objects). They may also perform gestures which are picked up via webcam and translated into avatar actions (e.g., waving arms on camera may cause an avatar to perform jumping jacks).
  • Such innovations not available on many web chat programs, broaden the range of virtual activities users may perform when located remotely.
  • the video chat service includes eye contact enhancement features to make webcam communication more realistic.
  • the service may move a video chat screen to an eye contact region, enabling users of the service to look directly at one another as if they were communicating in person.
  • the video chat service may orient video chat screens on the periphery of the display to relate them to where the users’ avatars are located within the virtual environment. This feature may map to an in-person communication with multiple users located at different points in space and enable video chat users to direct gazes toward one another in a more realistic way than in currently-used video chat programs.
  • method for enhancing online video communication for a plurality of users comprises (a) providing (1) a first instance of a video chat client application for a first user of the plurality of users, (2) a second instance of the video chat client application for a second user of the plurality of users, and (3) a virtual spatial environment for the first user and the second user.
  • the first user has a first avatar in the virtual spatial environment and the second user has a second avatar in the virtual spatial environment;
  • the method further comprises (b) obtaining one or more features associated with virtual spatial information relating to the first avatar and the second avatar in the virtual spatial environment; obtaining one or more features indicative of audiovisual information from the first instance of the video chat client application and from the second instance of the video chat client application; and (c) modifying at least one output of the first instance of the video chat client application, of the second instance of the video chat client application, a state of the first avatar, a state of the second avatar, or a combination thereof, responsive to the one or more features indicative of virtual spatial information, the one or more features indicative of audiovisual information, or a combination thereof.
  • the method further comprises modifying a feature of the virtual spatial environment.
  • (c) comprises modifying a picture quality of the video chat.
  • the picture quality is a resolution, a window size, a color scheme, a brightness, a contrast, a transparency, or a clarity.
  • the resolution is modified according to a distance between the first avatar and the second avatar.
  • the distance is an adjustable value based on a preference of the user, a number of participants, device capabilities, and other virtual features.
  • the resolution decreases linearly.
  • the feature is a gaze direction from the first video chat client instance and wherein (c) comprises modifying the orientation of the second avatar.
  • (c) comprises modifying an audio quality.
  • modifying the audio quality comprises modifying a loudness, a volume, an accuracy, a fidelity, a pitch, or a tempo.
  • modifying the volume comprises varying the volume from a highest volume to a lowest volume.
  • modifying the audio quality is performed using post processing of live stream signals of the online video communication.
  • (c) comprises modifying a location of a chat window.
  • modifying the location of the chat window comprises making the chat window more prominent in a view of the first user and in a view of the second user. [0020] In some embodiments, the making the chat window more prominent is moving the chat window to an eye contact region of a screen.
  • the location is based on a location of an associated camera with the screen.
  • the location depends on an orientation of the display.
  • the location depends on a frequency of interaction between the first avatar and the second avatar.
  • the location depends on a focus on a feed.
  • the modifying the audio quality comprises using post processing of live stream signals.
  • the method further comprises, responsive to receiving a click on the first avatar, reversing the modification of the output of the first video chat interface or the state of the first avatar.
  • the method further comprises, upon receiving an action on the second avatar, initiating a private conversation between the first user and the second user.
  • the virtual spatial environment provides access to third party software.
  • the third party software is word processing software.
  • the virtual spatial environment includes gaming objects.
  • the virtual spatial environment is a game board.
  • a feature of the one or more features indicative of virtual spatial information is a virtual distance between the first avatar and the second avatar.
  • the first user or the second user may toggle a high-powered mode to prevent modification of a volume of the first video chat client instance or of the second video chat client instance.
  • Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
  • Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto.
  • the computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
  • FIG. 1A shows a view of a user’s screen during a video chat interaction.
  • FIG. IB shows an example AV feed associated with one user that illustrates an example lowest (e.g., 0%) level of quality.
  • FIG. 1C shows an example AV feed associated with another user that illustrates an example degraded (e.g., somewhere between 100% and 0%) level of quality
  • FIG. ID shows an example AV feed associated with another user that illustrates and example highest level (e.g., 100%) level of quality
  • FIG. 2 shows a view of a user’s screen during a video chat interaction that further illustrates how proximity between avatars may affect the quality of presentation of AV feeds of users associated with the avatars
  • FIG. 3 shows a view of screen with three focus feeds in an eye contact region near the top of the screen
  • FIG. 4 illustrates additional enhancements to promote conveying eye contact and reduce unease in video chat communications.
  • FIG. 5 illustrates an algorithm to make audio degrade or decrease in volume in a manner that replicates real-world conditions.
  • FIG. 6 illustrates a scenario by which multiple users may have private conversations even while viewing a large public event, according to an embodiment.
  • FIG. 7 illustrates content collaboration using the system, in an embodiment
  • FIG. 8 illustrates a scenario wherein users may play an online game together, in various embodiments
  • FIG. 9 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
  • the system disclosed provides an enhanced user experience with respect to online chat applications, adding features to more closely simulate in person-interactions.
  • the system supplies a chat interface combined with a virtual spatial environment, with user experience additions to improve mutual eye contact between video chat participants.
  • the disclosed system may enable users to interact in many kinds of virtual environments, including online conferences, concerts, events, and games.
  • the system pairs a user of a video chat program with a virtual avatar.
  • the actions of these virtual avatars may affect virtual communications between users. Users may manipulate the motions and actions of their avatars using input devices or using gestures. Additionally, users may program avatars to direct their motions and actions automatically. For example, as one virtual avatar approaches another, the corresponding video chats between corresponding users associated with the avatars may be enhanced. For example, video chat volume may increase or picture quality may sharpen. The converse may be true when users’ avatars move away from one another.
  • the eye contact improvements provide more realistic chat experiences for users by providing visual eye contact regions to users, guiding them as to where to focus their eyes in order to maintain proper eye contact with other users. Users may choose to focus on video feeds or particular users in order to have private conversations with them. Eye contact regions may be configured with respect to the location of the user’s webcam or with respect to data collected from user interaction with other users.
  • each user i.e., participant in a video chat
  • a virtual character referred to herein as an avatar
  • Proximity between avatars in the virtual spatial environment may affect whether audiovisual (AV) feeds associated with certain users are presented. For example, when an avatar moves towards and/or is within a threshold proximity to other avatars in the virtual spatial environment, the respective users can see and/or hear each other through presentation of live AV feeds. If the avatars move apart and/or are outside the threshold proximity, the AV feeds may not be presented.
  • AV audiovisual
  • proximity between avatars can affect how AV feeds are presented.
  • a quality level of the audio and/or video associated with user AV feeds can be adjusted based on proximity between avatars.
  • quality level may include, for example, a transparency level, a resolution, a size of window, a color scheme, brightness, contrast, clarity, etc.
  • quality level may include, for example, volume, accuracy, fidelity, pitch, tempo, etc. Quality levels may decrease in such a manner to mimic decreases in quality with respect to distances in real-world communication.
  • a user may be presenting during an event and may wish for all users attending, not just those within a predetermined distance, to be heard.
  • the presenting user (or an administrator) may choose to activate a high-powered mode in order to be heard clearly by all attending users.
  • the quality level of the audio and/or video in associated AV feeds may improve.
  • the video clarity and audio volume may improve to simulate the increased clarity people experience in real life when they get closer together.
  • the quality of the video displayed may range from a best possible quality given network and device constraints (i.e., “100%”) when the avatars are within a specified range, and then as the distance between the avatars increases, the quality of the video may decrease to a lower quality level.
  • the change in quality may vary linearly or otherwise based on distance between avatars. Audio may behave similarly, with volume varying, for example, from a highest volume to a lowest volume.
  • the system may place chat windows at various locations within the virtual spatial environment, responsive to environmental features or user actions.
  • chat windows may be oriented such that a presenter is situated above an audience. Users may choose to focus on other particular users within the environment, moving their chat windows to central locations to focus on them.
  • AV feeds from users associated with nearby avatars may be presented as semi-transparent thumbnails in a peripheral view area, which is located in a corner of the computer screen (See e.g., FIG. 1A and 2).
  • the video feed of each user may be repositioned to be in a more prominent location in the respective views of each user. This more prominent location is referred to herein as the eye-contact region.
  • an eye-contact region of a display screen may be based on a location of an associated camera (e.g., web cam) relative to the display screen.
  • an associated camera e.g., web cam
  • the eye-contact region may be centrally located at or near the top of the screen (e.g., as depicted in FIG. 2).
  • the location of the eye-contact region may vary based on the orientation of the display relative to the device.
  • a display may transition based on an orientation of a tablet device.
  • the eye- contact region may also transition to remain closest to an associated forward-facing camera of the tablet device to retain natural eye-contact between users with established mutual focus.
  • both users may move attention to the eye-contact location and may see a visual confirmation that eye-contact is established, just like in real-life.
  • the user in this interface may lift their eyes to view the other video feed directly below their camera and may make “eye-contact” with the other user.
  • Either user may convey removal of eye-contact, or it may timeout and then the feeds will return to peripheral view (through a variety of methods outlined below).
  • establishing eye-contact occurs through a call-and-response of attention between two users.
  • “Mutual Focus” relates to both sides of the user-to-user connection feeds going to the eye-contact/camera location and becoming larger, due to conveyed attention from one or both users.
  • Eye Contact Region refers to a prominent area of the display conducive to eye contact between users (e.g., the top one-third of the screen) reserved for feeds in which eye- contact is established between both users.
  • Peripheral View Region refers to a lower region of the screen presenting the feeds to highlight the largest change in attention from the central eye-contact view. Feeds in this region may have a lower quality (e.g., semi-transparent thumbnails, etc.) unless interacted with by the mouse.
  • “Feed Clarity” refers to the extent to which audio and video quality may be degraded depending on distance between avatars, virtual environmental conditions, or other means.
  • Feed focus may be established between users in a number of different ways.
  • a user may convey attention to another user by way of a mouse (or other cursor) hover or the click of another’s user’s feed or avatar, by responding verbally to another, and/or by automatic detection of the user’s eye focus with video processing software.
  • the user’s avatar may face the other user’s avatar, and the video screens of the two users may move to locations to facilitate better eye contact between them, [0074] If associated avatars are in close proximity, feed focus may occur automatically.
  • Focusing on a feed may cause it to be non-transparent and may move it from a peripheral region into the eye-contact region.
  • both users may have to convey attention in order for feed focus to be established and their respective feeds to move to the eye-contact region or cause non- transparency.
  • Feeds may remain semi-transparent until users hover or click to convey attention or speak in response.
  • thumbnail may become non-transparent on both sides of the connection, conveying to both sides where attention is being directed. On mobile platforms, this may be established by the user’s finger on the screen.
  • a mouse/cursor may generally declare or establish where a user’s attention is and may be represented by an icon in the display screen.
  • Feed focus may be disestablished between users in a number of different ways as well.
  • either user may click on the other user’s video feed and return it to the peripheral view.
  • Lack of focus Heightened levels of mouse movement and/or avatar movement by one user may indicate a lack of focus on the other user. Similarly, a sustained silence between users may indicate a lack of focus between users. In such cases, this detected clack of focus may disestablish feed focus and may return respective AV feeds to a peripheral view. In some embodiments, the system may automatically determine where user eye focus is with video processing software and disestablish feed focus accordingly.
  • Timeout Automatic focus timeout may occur. With lack of focus on a feed, the feed may auto-timeout after some period of time (e.g., 4-6 seconds). In such cases, an AV feed may slowly transition to a lower quality level (e.g., approximately 20-25% transparency) and then return to a peripheral view. During this time, the mouse may still be hovered over the video feed or click on the feed to stop the timeout process. For example, a quick hover may reset the timeout. As another example, a click may extend that timeout.
  • a lower quality level e.g., approximately 20-25% transparency
  • more than one focused feed may be placed in the eye-contact region of the display screen.
  • FIG. 3 shows a view of screen with three focus feeds in an eye contact region near the top of the screen. As depicted in FIG. 3, additional focused feeds may be placed to the left and right of a center ⁇ feed on the screen.
  • feed windows may be reduced to accommodate the multiple feeds in the eye contact region.
  • feeds may be reduced in width and/or height to only (or primarily) contain he the face of each user.
  • Video processing software may be utilized to detect a region of a feed containing a user’s face and to adjust the feed to, for example, digitally zoom in on a user’s face, for better facial focus and eye contact.
  • the zoomed in window may dynamically track the user’s face as the user’s face moves in the space of the full video feed.
  • Second and third established eye-contact feeds may be left-right paired, so guests are looking towards one another.
  • users may have two or more locations for eye-contact.
  • the most active or focused users at any given moment will be put into the eye-contact region, which may be controlled by the attention-algorithm, matching users who are maintaining the most attention with one another at the present moment.
  • a maximum number of feeds in the eye-contact region may be established through testing and/or may be adjustable based on user preferences.
  • the attention-algorithm may take as input how many users are nearby, how close a user gets to another, frequencies of communication between users, indications of attention going elsewhere, or other attention indicators.
  • a user may be talking to a group of users but may wish provide eye contact to each user in turn, as may occur during a real-world conversation.
  • a user may view the other users providing eye contact to him or her through a semi transparent eye contact location viewable on the user’s chat window. If the user wishes to convey eye contact to a user out of the group, the user may focus on an eye contact location associated with that user. The focusing may show a non-transparent overlay on the speaking user’s interface as well as convey to the listening user that the speaking user is providing eye contact.
  • Focusing on another user may affect how an avatar is presented in the virtual environment. For example, when establishing eye contact with another user, the head of an avatar of one user may turn to face the avatar of the other user, to help convey where focus is even for other users not involved in the user-to-user connection. If the user is not focused on another, the avatar’s head may be pointed in the direction of where the mouse cursor is (potentially represented as a hand or eye-ball), to help convey attention. [0090] When a user is not involved with the video chat application in full-screen mode, through mini-animations, the avatar may show the user is distracted, by looking at a phone or staring at the sky. Similarly, if the user selects their own feed, they may look at a mirror. If the user is looking at the avatar’s inventory, they may look inside a bag.
  • FIGS. 1A-1D and 2 illustrate certain aspects of an example spatial interaction feature.
  • FIG. 1A shows a view of a user’s screen during a video chat interaction.
  • various audiovisual (AV) feeds associated with users are presented along with a view of a virtual spatial environment (in this example, a 2D environment) that is occupied by various avatars.
  • An avatar may be associated with a user participating in the video chat.
  • an AV feed of a primary speaker is depicted in an eye contact region (described in more detail later) at the top of the screen
  • AV feeds of other users are depicted in a peripheral region in the lower left
  • an AV feed of the user viewing the screen is depicted in the lower right.
  • FIG. 1 A shows an example arrangement of elements and is not to be construed as limiting.
  • FIG. IB shows an example AV feed associated with one user that illustrates an example lowest (e.g., 0%) level of quality.
  • the AV feed at this lowest level of quality may be semi-transparent and may have a lowest resolution setting.
  • FIG. 1C shows an example AV feed associated with another user that illustrates an example degraded (e.g., somewhere between 100% and 0%) level of quality.
  • the AV feed may still be semitransparent and may have a degraded resolution but not as low as the AV feed of FIG. IB.
  • FIG. ID shows an example AV feed associated with another user that illustrates and example highest level (e.g., 100%) level of quality.
  • the AV feed is non transparent (i.e., opaque) and video is depicted at a highest resolution, given network and/or device constraints.
  • FIG. 2 shows a view of a user’s screen during a video chat interaction that further illustrates how proximity between avatars may affect the quality of presentation of AV feeds of users associated with the avatars.
  • an avatar associated with a user viewing the screen (the “viewing user”) is presented in the center (i.e., the avatar titled “Tyler”).
  • the 2D region surrounding the avatar is the virtual spatial environment.
  • the thumbnail and active AV feed windows overlay the view of the virtual environment and include live video feeds from other users participating in a video chat with the viewing user.
  • the active feeds are located in the eye-contact region of the screen. Depending on the number of users one is currently actively interacting with, there can be multiple, one or no active feeds in that region.
  • the other avatars are associated with other users.
  • Avatars within a certain radius represented by the dashed line may be associated with AV feeds that are presented with a highest (i.e., 100%) quality level (e.g., non-transparent, highest resolution, highest volume, etc.) while avatars outside of the radius may be associated with AV feeds that are presented with a degraded quality (e.g., semi-transparent, lower resolution, lower volume, etc.).
  • a highest quality level e.g., non-transparent, highest resolution, highest volume, etc.
  • a degraded quality e.g., semi-transparent, lower resolution, lower volume, etc.
  • the quality level may be adjusted using post processing of live stream signals (e.g., using a filter) at a local device associated with a viewing user.
  • a post process can be applied to a received video signal to adjust a resolution and/or opacity of the video presented based on the received signal.
  • quality level may be adjusted by adjusting the live stream signals (e.g., by adjusting bitrate, codec, etc.). The quality may decrease to encourage users to move their avatars closer to avatars of other users they wish to see and hear more clearly. In some situations (e.g., an online board meeting or talk), such quality degradation may not be implemented, as it may cause distractions for participants who need to see and hear a presenter at all times.
  • the specific rate at which video and/or audio quality decrease may be fine-tuned with additional testing.
  • the radius of visibility e.g., the dotted line in FIG. 2
  • a virtual object in the virtual environment may also affect the quality level of AV feeds. For example, virtual obstacles such as walls, shrubs, and other objects may affect the quality level of associated AV feeds if they come between avatars in the virtual environment.
  • the AV feed associated with that user may be degraded (from the perspective of a viewing user) since the virtual walls of the virtual house are between their respective avatars.
  • FIG. 6 illustrates a scenario by which multiple users may have private conversations even while viewing a large public event, according to an embodiment.
  • the system may enable private conversations to occur by fading out or de-emphasizing background objects and/or users, aside from the plurality of users engaged in the private conversation.
  • a user may enter into a private conversation with another user by clicking on the other user’s avatar. This action may shrink or remove a communication feed associated with the event and increase in size a chat module with the clicked user. By clicking oneself, the user may exit the private conversation. This may remove the content associated with the other user and reproduce or enlarge the content associated with the event.
  • FIG. 7 illustrates content collaboration using the system, in an embodiment.
  • Users of the system may share their screens when working on a web or application-based project.
  • the project may be presented in a window which may be shared in a small or large size.
  • Users’ avatars and chat windows may have transparencies applied to them, so they are partially visible when working on the project.
  • FIG. 8 illustrates a scenario wherein users may play an online game together, in various embodiments.
  • the virtual environment may itself be a board game.
  • the virtual environment may have interactive board game pieces as objects within them.
  • the virtual environment may include a virtual card table object or chessboard at particular locations within the environment.
  • FIG. 4 illustrates additional enhancements to promote conveying eye contact and reduce unease in video chat communications.
  • FIG. 4 illustrates an interface for grouping onscreen users into central and peripheral users.
  • Central users may be actively focused on or engaged in conversation.
  • Their video chat windows may be displayed prominently in a central user location (e.g., the center of the screen).
  • Peripheral users may be not actively focused on or engaged with.
  • Views of their video chat windows may be placed along the perimeter of the screen or furthest away from the chat windows of the central users within the user interface (e.g., at the outside comers, on the bottom right of the screen, or on the bottom left of the screen).
  • video quality and audio quality for these peripheral users may be degraded (e.g., their video chats may be smaller or partially transparent, and audio volume may be softer.
  • Clicking or otherwise focusing on a peripheral user may promote the peripheral user to being a central user. This may move the peripheral user’s chat window to the center of the screen and increase the audio and video quality of the video chat.
  • Examples are described in the context of 2D display screens (e.g., associated with tablet computers or mobile phones.
  • the introduced techniques can similarly be applied augmented reality (AR) and/or virtual reality (VR) headset devices. Further, such devices may be configured to capture a user’s face with a wide-angle fisheye lens and video.
  • AR augmented reality
  • VR virtual reality
  • Faces captured via video feeds may be mapped onto the avatars in the virtual environment.
  • audio may be broadcast predominantly in an avatar’s oriented direction.
  • Additional cameras may provide additional user views for eye-contact and non-eye- contact regions.
  • Video processing solutions may be used to also detect motion in other parts of a user’s body (e.g., arms) and may be used to move corresponding portions of an associated avatar. In other words, when a user moves their arms, their avatar may move its arms.
  • Shared content such as browser tabs may be placed in a portion of the display screen to facilitate collaboration or shared viewing.
  • the location of the shared content may be in areas not occupied by the eye-contact and/or peripheral view regions (e.g., the top left and right corners of the screens in FIGS. 1A, 2 or 3.
  • FIG. 5 illustrates an algorithm to make audio degrade or decrease in volume in a manner that replicates real-world conditions. As shown in the plots, both the audio and video degradation proceed slowly until a “knee” point in both curves, and then more rapidly following the “knee.”
  • a video feed wrapper to manage the feeds may connect all users within a large peer- to-peer room but may only update the feeds associated with avatars within a specified proximity to each other with some buffer (or other condition such as always connected users).
  • the video feed wrapper may also mask the feeds (e.g., with pixelation and lower volume) depending on distance between avatars or other conditions generally related to the virtual environment.
  • a server may be utilized to establish peer-to-peer calls but may not be needed to maintain calls.
  • a server may additionally be utilized to record sessions or stabilize video calls.
  • WebRTC may be used as the video chat tool, although others may be utilized within the wrapper.
  • the system may also include a group assistant to perform integration with third-party software applications, such as address books and calendars, to enable users to exchange information and schedule events.
  • the group assistant may also control entertainment for the group.
  • the group assistant may be used to audibly control a user’s avatar (e.g., tell it to go to a specific location, put up an away message, or only allow users from particular groups to interrupt a meeting).
  • the system may be configured to operate with multiple cameras or devices per user to enhance eye contact between users. Using multiple cameras would enable users to interact with other content on other screens (e.g., watching a streaming movie) but use another device to chat with other viewers of the content. Multiple screens and cameras may be used for a single user to enable two different eye contact locations or better fidelity/clarity of eye contact.
  • the system may also include video processing software to capture a user’s gaze direction and body movement.
  • Eye gaze direction may be used to automatically detect where on the screen the users are paying attention to and may help them navigate the user experience with just their eyes and other gestures. For example, a disabled user may be able to move an avatar to locations on the screen by orienting his or her gaze. A user looking at another user’s avatar may cause the eye contact region to move to that user’s video chat screen. If the system detects that a user is viewing a portion of the screen where a virtual concert is occurring, the system may increase the concert audio being played into the user’s speakers or headphones. Body movements may be conveyed to avatars.
  • the avatars may do jumping jacks at the same time as their corresponding users in real life.
  • the system may capture gestures such as these using computer vision algorithms.
  • a computer vision software system may use a user’s camera to track movements of particular body parts of the user, and translate those gestures into the virtual environment, where they may be pantomimed by the user’s avatar.
  • the computer vision algorithms may map the gestures to the avatars or alias real- world gestures for avatars (e.g., a user cannot run in view of a webcam, but could perform a motion that would make the user’s avatar run).
  • the system may use virtual backgrounds to green screen two or more users into the same scene.
  • the users’ faces and their corresponding avatars’ bodies may be put in front of almost any scene, controlled with their keyboards.
  • the virtual environment may enable users to select items (e.g., using a mouse click or a touch on a touchscreen device) to direct the user’s virtual avatar to orient itself so that it is facing those items, or locations or users associated with the items, and may additionally move the eye contact region to focus attention on the users or virtual space associated with the items.
  • Users may be able to select other users to form groups.
  • the users’ avatars may include indicators showing that they share a group. For example, the users’ shadows may all change colors to those of a group they may be associated with.
  • the group may be controlled by combined movements. For example, the group may designate a user to control the movement for all users in the group.
  • the system may be configured to be operable with virtual reality (VR) and augmented reality (AR) headsets.
  • VR virtual reality
  • AR augmented reality
  • the system may be configured to record video and audio from communication sessions and track virtual locations of avatars.
  • the system may be configured to provide sophisticated gaming experiences (e.g., three-dimensional games, old game emulators, and board games).
  • sophisticated gaming experiences e.g., three-dimensional games, old game emulators, and board games.
  • the system may support cross-language translation and closed captioning for real time translations.
  • the system may provide for users to make music performances using in-game sounds (e.g., using sound effects related to dropping/picking up, or hovering over items to create a drum circle dynamic).
  • in-game sounds e.g., using sound effects related to dropping/picking up, or hovering over items to create a drum circle dynamic.
  • the system may provide a function by which administrators or super users are able to create events with multiple rooms included. Supervisors may adjust locations of other users [00124]
  • the system may provide e-commerce and delivery services to vendors within the platform.
  • the system may render a three- dimensional image of a person’s face and create more eye contact locations with more other users.
  • Users may be able to mute themselves or one another. In some cases, if a user in a group is muted, the muting may be conveyed to other users in the group. Muting may be canceled when a user selects a muted user to interact with.
  • the system may have a low-bandwidth embodiment when a user may see one or more users of interests’ chat windows but may only see thumbnails of users not of interest or peripheral users. This may reduce latency costs from streaming large amounts of data.
  • Video screens of peripheral users or users far away may be configured to increase in size if they are activated (e.g., clicked on) by other users (e.g., during a private conversation between users). Conversely, video screens of users who may be close or central may be configured to decrease in size if activated.
  • the term “at least,” “greater than,” or “greater than or equal to” precedes the first numerical value in a series of two or more numerical values applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3.
  • the term “no more than,” “less than,” or “less than or equal to” precedes the first numerical value in a series of two or more numerical values the term “no more than,” “less than,” or “less than or equal to” applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.
  • FIG. 9 shows a computer system 901 that is programmed or otherwise configured to provide a video chat client with spatial interaction and eye contact recognition.
  • the computer system 901 can regulate various aspects of digital communication of the present disclosure, such as, for example, presenting video chat clients, promoting realistic eye contact, and providing a virtual environment.
  • the computer system 901 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device.
  • the electronic device can be a mobile electronic device.
  • the computer system 901 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 905, which can be a single core or multi core processor, or a plurality of processors for parallel processing.
  • the computer system 901 also includes memory or memory location 910 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 915 (e.g., hard disk), communication interface 920 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 925, such as cache, other memory, data storage and/or electronic display adapters.
  • the memory 910, storage unit 915, interface 920 and peripheral devices 925 are in communication with the CPU 905 through a communication bus (solid lines), such as a motherboard.
  • the storage unit 915 can be a data storage unit (or data repository) for storing data.
  • the computer system 901 can be operatively coupled to a computer network (“network”) 930 with the aid of the communication interface 920.
  • the network 930 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet.
  • the network 930 in some cases is a telecommunication and/or data network.
  • the network 930 can include one or more computer servers, which can enable distributed computing, such as cloud computing.
  • the network 930, in some cases with the aid of the computer system 901, can implement a peer-to-peer network, which may enable devices coupled to the computer system 901 to behave as a client or a server.
  • the CPU 905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software.
  • the instructions may be stored in a memory location, such as the memory 910.
  • the instructions can be directed to the CPU 905, which can subsequently program or otherwise configure the CPU 905 to implement methods of the present disclosure. Examples of operations performed by the CPU 905 can include fetch, decode, execute, and writeback.
  • the CPU 905 can be part of a circuit, such as an integrated circuit.
  • a circuit such as an integrated circuit.
  • One or more other components of the system 901 can be included in the circuit.
  • the circuit is an application specific integrated circuit (ASIC).
  • ASIC application specific integrated circuit
  • the storage unit 915 can store files, such as drivers, libraries and saved programs.
  • the storage unit 915 can store user data, e.g., user preferences and user programs.
  • the computer system 901 in some cases can include one or more additional data storage units that are external to the computer system 901, such as located on a remote server that is in communication with the computer system 901 through an intranet or the Internet.
  • the computer system 901 can communicate with one or more remote computer systems through the network 930.
  • the computer system 901 can communicate with a remote computer system of a user (e.g., a mobile device).
  • remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants.
  • the user can access the computer system 901 via the network 930.
  • Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 901, such as, for example, on the memory 910 or electronic storage unit 915.
  • the machine executable or machine readable code can be provided in the form of software.
  • the code can be executed by the processor 905.
  • the code can be retrieved from the storage unit 915 and stored on the memory 910 for ready access by the processor 905.
  • the electronic storage unit 915 can be precluded, and machine-executable instructions are stored on memory 910.
  • the code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code or can be compiled during runtime.
  • the code can be supplied in a programming language that can be selected to enable the code to execute in a pre compiled or as-compiled fashion.
  • aspects of the systems and methods provided herein can be embodied in programming.
  • Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium.
  • Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk.
  • “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server.
  • another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links.
  • a machine readable medium such as computer-executable code
  • a tangible storage medium such as computer-executable code
  • Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings.
  • Volatile storage media include dynamic memory, such as main memory of such a computer platform.
  • Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system.
  • Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications.
  • RF radio frequency
  • IR infrared
  • Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data.
  • the computer system 901 can include or be in communication with an electronic display 935 that comprises a user interface (UI) 940 for providing, for example, a video chat interface.
  • UI user interface
  • Examples of UTs include, without limitation, a graphical user interface (GUI) and web-based user interface.
  • Methods and systems of the present disclosure can be implemented by way of one or more algorithms.
  • An algorithm can be implemented by way of software upon execution by the central processing unit 905.
  • the algorithm can, for example, adjust one or more parameters of the video chat client in response to changes in the virtual environment.

Abstract

Disclosed herein are a system and method for providing a co-responsive virtual spatial environment and chat interface. Users of the chat interface may have paired avatars within the virtual spatial environment. These avatars may interact with other virtual avatars and in-environment virtual objects. When an avatar interacts with another avatar, aspects of virtual communication, including audio and video quality, may be modified responsive to the virtual interaction. Additionally, aspects of virtual communication may be impacted by interactions with virtual objects in the virtual chat environment.

Description

VIDEO CHAT WITH SPATIAL INTERACTION AND EYE CONTACT RECOGNITION
CROSS-REFERENCE
[0001] This application is related to U.S. Provisional Patent Application No. 63/041,023, filed on June 18, 2020, and U.S. Provisional Patent Application No. 63/121,827, filed on December 4, 2020, both of which are incorporated herein by reference in their entirety.
BACKGROUND
[0002] Large group video chats, and video chats in general, do not reflect how people naturally interact with one another in physical space. Group interactions can be draining, when a participant does not know who to talk to, who to look at, or who is looking at them. In real-life, larger groups of people (e.g., of six or more) naturally form into multiple smaller conversations (e.g., two or more) that overlap. Humans are spatial beings and best digest social interactions in the context of space (2D or 3D), using physical distance to control their engagement in conversations with others.
SUMMARY
[0003] Introduced herein are innovations for a video chat service that allows users to interact as they would in real-life. Example features for such a video chat service are described below. In addition to providing a video chat interface enabling users to chat with one another via webcam, the video chat service provides a virtual spatial environment including avatars associated with the users. The virtual spatial environment may enable users to attend virtual events or perform virtual activities with users they communicate with via webcam. Additionally, the video chat service may provide overlays facilitating more realistic communication via video, including features that link actions performed by the users over video chat with actions performed by their virtual avatars.
[0004] Unlike conventional video chat services, the service described herein uses its virtual environment to make virtual communication closer to in-person communication. Virtual avatars may perform actions performed by people in reality, including playing games, moving their bodies, focusing attention on one another, and attending events. Users may control the actions of their avatars using input devices, such as keyboards and mice (e.g., by pressing shortcut keys or hotkeys, typing commands into a terminal, or clicking on avatars or objects). They may also perform gestures which are picked up via webcam and translated into avatar actions (e.g., waving arms on camera may cause an avatar to perform jumping jacks). Such innovations, not available on many web chat programs, broaden the range of virtual activities users may perform when located remotely.
[0005] Further, the video chat service includes eye contact enhancement features to make webcam communication more realistic. For example, the service may move a video chat screen to an eye contact region, enabling users of the service to look directly at one another as if they were communicating in person. Further, the video chat service may orient video chat screens on the periphery of the display to relate them to where the users’ avatars are located within the virtual environment. This feature may map to an in-person communication with multiple users located at different points in space and enable video chat users to direct gazes toward one another in a more realistic way than in currently-used video chat programs.
[0006] In an aspect, method for enhancing online video communication for a plurality of users, is disclosed. The method comprises (a) providing (1) a first instance of a video chat client application for a first user of the plurality of users, (2) a second instance of the video chat client application for a second user of the plurality of users, and (3) a virtual spatial environment for the first user and the second user. The first user has a first avatar in the virtual spatial environment and the second user has a second avatar in the virtual spatial environment; The method further comprises (b) obtaining one or more features associated with virtual spatial information relating to the first avatar and the second avatar in the virtual spatial environment; obtaining one or more features indicative of audiovisual information from the first instance of the video chat client application and from the second instance of the video chat client application; and (c) modifying at least one output of the first instance of the video chat client application, of the second instance of the video chat client application, a state of the first avatar, a state of the second avatar, or a combination thereof, responsive to the one or more features indicative of virtual spatial information, the one or more features indicative of audiovisual information, or a combination thereof.
[0007] In some embodiments, the method further comprises modifying a feature of the virtual spatial environment.
[0008] In some embodiments, (c) comprises modifying a picture quality of the video chat.
[0009] In some embodiments, the picture quality is a resolution, a window size, a color scheme, a brightness, a contrast, a transparency, or a clarity.
[0010] In some embodiments, the resolution is modified according to a distance between the first avatar and the second avatar. [0011] In some embodiments, the distance is an adjustable value based on a preference of the user, a number of participants, device capabilities, and other virtual features.
[0012] In some embodiments, the resolution decreases linearly.
[0013] In some embodiments, the feature is a gaze direction from the first video chat client instance and wherein (c) comprises modifying the orientation of the second avatar.
[0014] In some embodiments, (c) comprises modifying an audio quality.
[0015] In some embodiments, modifying the audio quality comprises modifying a loudness, a volume, an accuracy, a fidelity, a pitch, or a tempo.
[0016] In some embodiments, modifying the volume comprises varying the volume from a highest volume to a lowest volume.
[0017] In some embodiments, modifying the audio quality is performed using post processing of live stream signals of the online video communication.
[0018] In some embodiments, (c) comprises modifying a location of a chat window.
[0019] In some embodiments, modifying the location of the chat window comprises making the chat window more prominent in a view of the first user and in a view of the second user. [0020] In some embodiments, the making the chat window more prominent is moving the chat window to an eye contact region of a screen.
[0021] In some embodiments, the location is based on a location of an associated camera with the screen.
[0022] In some embodiments, the location depends on an orientation of the display.
[0023] In some embodiments, the location depends on a frequency of interaction between the first avatar and the second avatar.
[0024] In some embodiments, the location depends on a focus on a feed.
[0025] In some embodiments, the modifying the audio quality comprises using post processing of live stream signals.
[0026] In some embodiments, the method further comprises, responsive to receiving a click on the first avatar, reversing the modification of the output of the first video chat interface or the state of the first avatar.
[0027] In some embodiments, the method further comprises, upon receiving an action on the second avatar, initiating a private conversation between the first user and the second user.
[0028] In some embodiments, the virtual spatial environment provides access to third party software.
[0029] In some embodiments, the third party software is word processing software.
[0030] In some embodiments, the virtual spatial environment includes gaming objects. [0031] In some embodiments, the virtual spatial environment is a game board.
[0032] In some embodiments, a feature of the one or more features indicative of virtual spatial information is a virtual distance between the first avatar and the second avatar.
[0033] In some embodiments, the first user or the second user may toggle a high-powered mode to prevent modification of a volume of the first video chat client instance or of the second video chat client instance. Another aspect of the present disclosure provides a non-transitory computer readable medium comprising machine executable code that, upon execution by one or more computer processors, implements any of the methods above or elsewhere herein.
[0034] Another aspect of the present disclosure provides a system comprising one or more computer processors and computer memory coupled thereto. The computer memory comprises machine executable code that, upon execution by the one or more computer processors, implements any of the methods above or elsewhere herein.
[0035] Additional aspects and advantages of the present disclosure will become readily apparent to those skilled in this art from the following detailed description, wherein only illustrative embodiments of the present disclosure are shown and described. As will be realized, the present disclosure is capable of other and different embodiments, and its several details are capable of modifications in various obvious respects, all without departing from the disclosure. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
INCORPORATION BY REFERENCE
[0036] All publications, patents, and patent applications mentioned in this specification are herein incorporated by reference to the same extent as if each individual publication, patent, or patent application was specifically and individually indicated to be incorporated by reference.
To the extent publications and patents or patent applications incorporated by reference contradict the disclosure contained in the specification, the specification is intended to supersede and/or take precedence over any such contradictory material.
BRIEF DESCRIPTION OF THE DRAWINGS [0037] The novel features of the invention are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present invention will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the invention are utilized, and the accompanying drawings (also “Figure” and “FIG.” herein), of which:
[0038] FIG. 1A shows a view of a user’s screen during a video chat interaction. [0039] FIG. IB shows an example AV feed associated with one user that illustrates an example lowest (e.g., 0%) level of quality.
[0040] FIG. 1C shows an example AV feed associated with another user that illustrates an example degraded (e.g., somewhere between 100% and 0%) level of quality [0041] FIG. ID shows an example AV feed associated with another user that illustrates and example highest level (e.g., 100%) level of quality
[0042] FIG. 2 shows a view of a user’s screen during a video chat interaction that further illustrates how proximity between avatars may affect the quality of presentation of AV feeds of users associated with the avatars
[0043] FIG. 3 shows a view of screen with three focus feeds in an eye contact region near the top of the screen
[0044] FIG. 4 illustrates additional enhancements to promote conveying eye contact and reduce unease in video chat communications.
[0045] FIG. 5 illustrates an algorithm to make audio degrade or decrease in volume in a manner that replicates real-world conditions.
[0046] FIG. 6 illustrates a scenario by which multiple users may have private conversations even while viewing a large public event, according to an embodiment.
[0047] FIG. 7 illustrates content collaboration using the system, in an embodiment
[0048] FIG. 8 illustrates a scenario wherein users may play an online game together, in various embodiments
[0049] FIG. 9 shows a computer system that is programmed or otherwise configured to implement methods provided herein.
DETAILED DESCRIPTION
[0050] While various embodiments of the invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions may occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein may be employed.
[0051] The system disclosed provides an enhanced user experience with respect to online chat applications, adding features to more closely simulate in person-interactions. The system supplies a chat interface combined with a virtual spatial environment, with user experience additions to improve mutual eye contact between video chat participants. The disclosed system may enable users to interact in many kinds of virtual environments, including online conferences, concerts, events, and games. [0052] In the virtual spatial environment, the system pairs a user of a video chat program with a virtual avatar. The actions of these virtual avatars may affect virtual communications between users. Users may manipulate the motions and actions of their avatars using input devices or using gestures. Additionally, users may program avatars to direct their motions and actions automatically. For example, as one virtual avatar approaches another, the corresponding video chats between corresponding users associated with the avatars may be enhanced. For example, video chat volume may increase or picture quality may sharpen. The converse may be true when users’ avatars move away from one another.
[0053] The eye contact improvements provide more realistic chat experiences for users by providing visual eye contact regions to users, guiding them as to where to focus their eyes in order to maintain proper eye contact with other users. Users may choose to focus on video feeds or particular users in order to have private conversations with them. Eye contact regions may be configured with respect to the location of the user’s webcam or with respect to data collected from user interaction with other users.
Spatial Interaction
[0054] In an example embodiment, each user (i.e., participant in a video chat) may select, develop, create, or otherwise be associated with a virtual character (referred to herein as an avatar) which represents that user in a virtual spatial environment (2D or 3D).
[0055] Proximity between avatars in the virtual spatial environment may affect whether audiovisual (AV) feeds associated with certain users are presented. For example, when an avatar moves towards and/or is within a threshold proximity to other avatars in the virtual spatial environment, the respective users can see and/or hear each other through presentation of live AV feeds. If the avatars move apart and/or are outside the threshold proximity, the AV feeds may not be presented.
[0056] Further, proximity between avatars can affect how AV feeds are presented. For example, a quality level of the audio and/or video associated with user AV feeds can be adjusted based on proximity between avatars. For video, quality level may include, for example, a transparency level, a resolution, a size of window, a color scheme, brightness, contrast, clarity, etc. For audio, quality level may include, for example, volume, accuracy, fidelity, pitch, tempo, etc. Quality levels may decrease in such a manner to mimic decreases in quality with respect to distances in real-world communication.
[0057] In some cases, a user may be presenting during an event and may wish for all users attending, not just those within a predetermined distance, to be heard. The presenting user (or an administrator) may choose to activate a high-powered mode in order to be heard clearly by all attending users.
[0058] In an example embodiment, as avatars get closer together or come within a threshold proximity, the quality level of the audio and/or video in associated AV feeds may improve. In an embodiment, the video clarity and audio volume may improve to simulate the increased clarity people experience in real life when they get closer together.
[0059] The quality of the video displayed may range from a best possible quality given network and device constraints (i.e., “100%”) when the avatars are within a specified range, and then as the distance between the avatars increases, the quality of the video may decrease to a lower quality level. The change in quality may vary linearly or otherwise based on distance between avatars. Audio may behave similarly, with volume varying, for example, from a highest volume to a lowest volume.
[0060] The system may place chat windows at various locations within the virtual spatial environment, responsive to environmental features or user actions. For example, in a conference environment, video chat windows may be oriented such that a presenter is situated above an audience. Users may choose to focus on other particular users within the environment, moving their chat windows to central locations to focus on them.
Eye-Contact
[0061] Another drawback with current video chat platforms is the inability of users to convey eye-contact in groups. Speaking to others without knowing others are paying attention may to uneasiness. To combat this, speakers may engage in exaggerated gestures or vocal dynamics, which may lead to fatigue. A technique for conveying attention between users that addresses the issues associated with exiting platforms is described below.
[0062] In some embodiments, AV feeds from users associated with nearby avatars may be presented as semi-transparent thumbnails in a peripheral view area, which is located in a corner of the computer screen (See e.g., FIG. 1A and 2).
[0063] When mutual focus is established between two users (through a variety of methods outlined below), the video feed of each user may be repositioned to be in a more prominent location in the respective views of each user. This more prominent location is referred to herein as the eye-contact region.
[0064] In some embodiments, an eye-contact region of a display screen may be based on a location of an associated camera (e.g., web cam) relative to the display screen. For example, in devices with a centrally located webcam directly above the display screen (e.g., as in many laptop computing devices), the eye-contact region may be centrally located at or near the top of the screen (e.g., as depicted in FIG. 2). In some embodiments, the location of the eye-contact region may vary based on the orientation of the display relative to the device. For example, a display may transition based on an orientation of a tablet device. In such an example, the eye- contact region may also transition to remain closest to an associated forward-facing camera of the tablet device to retain natural eye-contact between users with established mutual focus.
[0065] With this action, both users may move attention to the eye-contact location and may see a visual confirmation that eye-contact is established, just like in real-life. The user in this interface may lift their eyes to view the other video feed directly below their camera and may make “eye-contact” with the other user. Either user may convey removal of eye-contact, or it may timeout and then the feeds will return to peripheral view (through a variety of methods outlined below).
Establishing and Disestablishing Eye-Contact
[0066] In some embodiments, establishing eye-contact occurs through a call-and-response of attention between two users. Some useful terminology is described:
[0067] “Mutual Focus” relates to both sides of the user-to-user connection feeds going to the eye-contact/camera location and becoming larger, due to conveyed attention from one or both users.
[0068] “Call-and-response“ relates to one user conveying attention to a second user, and the second responding to establish mutual focus between the two users.
[0069] “Eye Contact Region” refers to a prominent area of the display conducive to eye contact between users (e.g., the top one-third of the screen) reserved for feeds in which eye- contact is established between both users.
[0070] “Peripheral View Region” refers to a lower region of the screen presenting the feeds to highlight the largest change in attention from the central eye-contact view. Feeds in this region may have a lower quality (e.g., semi-transparent thumbnails, etc.) unless interacted with by the mouse.
[0071] “Feed Clarity” refers to the extent to which audio and video quality may be degraded depending on distance between avatars, virtual environmental conditions, or other means.
[0072] Feed focus may be established between users in a number of different ways.
[0073] A user may convey attention to another user by way of a mouse (or other cursor) hover or the click of another’s user’s feed or avatar, by responding verbally to another, and/or by automatic detection of the user’s eye focus with video processing software. When a user conveys attention to another, the user’s avatar may face the other user’s avatar, and the video screens of the two users may move to locations to facilitate better eye contact between them, [0074] If associated avatars are in close proximity, feed focus may occur automatically.
[0075] Focusing on a feed may cause it to be non-transparent and may move it from a peripheral region into the eye-contact region.
[0076] In some embodiments, both users may have to convey attention in order for feed focus to be established and their respective feeds to move to the eye-contact region or cause non- transparency.
[0077] Feeds may remain semi-transparent until users hover or click to convey attention or speak in response.
[0078] If a user decides to focus on one user’s feed in particular, they may hover the mouse over that user’s thumbnail feed. In response, the thumbnail may become non-transparent on both sides of the connection, conveying to both sides where attention is being directed. On mobile platforms, this may be established by the user’s finger on the screen.
[0079] A mouse/cursor may generally declare or establish where a user’s attention is and may be represented by an icon in the display screen.
[0080] Feed focus may be disestablished between users in a number of different ways as well.
[0081] In some embodiments, either user may click on the other user’s video feed and return it to the peripheral view.
[0082] Lack of focus: Heightened levels of mouse movement and/or avatar movement by one user may indicate a lack of focus on the other user. Similarly, a sustained silence between users may indicate a lack of focus between users. In such cases, this detected clack of focus may disestablish feed focus and may return respective AV feeds to a peripheral view. In some embodiments, the system may automatically determine where user eye focus is with video processing software and disestablish feed focus accordingly.
[0083] Timeout: Automatic focus timeout may occur. With lack of focus on a feed, the feed may auto-timeout after some period of time (e.g., 4-6 seconds). In such cases, an AV feed may slowly transition to a lower quality level (e.g., approximately 20-25% transparency) and then return to a peripheral view. During this time, the mouse may still be hovered over the video feed or click on the feed to stop the timeout process. For example, a quick hover may reset the timeout. As another example, a click may extend that timeout.
Additional Eye-Contact Locations
[0084] In some embodiment, more than one focused feed may be placed in the eye-contact region of the display screen. For example, FIG. 3 shows a view of screen with three focus feeds in an eye contact region near the top of the screen. As depicted in FIG. 3, additional focused feeds may be placed to the left and right of a center\ feed on the screen.
[0085] Dimensions of the feed windows may be reduced to accommodate the multiple feeds in the eye contact region. For example, feeds may be reduced in width and/or height to only (or primarily) contain he the face of each user. Video processing software may be utilized to detect a region of a feed containing a user’s face and to adjust the feed to, for example, digitally zoom in on a user’s face, for better facial focus and eye contact. The zoomed in window may dynamically track the user’s face as the user’s face moves in the space of the full video feed. [0086] Generally, the closer a displayed feed is to the camera location, the better eye contact is established as experienced by the user. Second and third established eye-contact feeds may be left-right paired, so guests are looking towards one another.
[0087] Therefore, users may have two or more locations for eye-contact. The most active or focused users at any given moment will be put into the eye-contact region, which may be controlled by the attention-algorithm, matching users who are maintaining the most attention with one another at the present moment. A maximum number of feeds in the eye-contact region may be established through testing and/or may be adjustable based on user preferences. The attention-algorithm may take as input how many users are nearby, how close a user gets to another, frequencies of communication between users, indications of attention going elsewhere, or other attention indicators.
[0088] In some scenarios, a user may be talking to a group of users but may wish provide eye contact to each user in turn, as may occur during a real-world conversation. In such a scenario, a user may view the other users providing eye contact to him or her through a semi transparent eye contact location viewable on the user’s chat window. If the user wishes to convey eye contact to a user out of the group, the user may focus on an eye contact location associated with that user. The focusing may show a non-transparent overlay on the speaking user’s interface as well as convey to the listening user that the speaking user is providing eye contact.
Avatar Animations
[0089] Focusing on another user (i.e., establishing eye contact) may affect how an avatar is presented in the virtual environment. For example, when establishing eye contact with another user, the head of an avatar of one user may turn to face the avatar of the other user, to help convey where focus is even for other users not involved in the user-to-user connection. If the user is not focused on another, the avatar’s head may be pointed in the direction of where the mouse cursor is (potentially represented as a hand or eye-ball), to help convey attention. [0090] When a user is not involved with the video chat application in full-screen mode, through mini-animations, the avatar may show the user is distracted, by looking at a phone or staring at the sky. Similarly, if the user selects their own feed, they may look at a mirror. If the user is looking at the avatar’s inventory, they may look inside a bag.
Descriptions of the Figures
[0091] FIGS. 1A-1D and 2 illustrate certain aspects of an example spatial interaction feature.
[0092] FIG. 1A shows a view of a user’s screen during a video chat interaction. As shown in FIG. 1A, various audiovisual (AV) feeds associated with users are presented along with a view of a virtual spatial environment (in this example, a 2D environment) that is occupied by various avatars. An avatar may be associated with a user participating in the video chat. In the example view depicted in FIG. 1 A, an AV feed of a primary speaker is depicted in an eye contact region (described in more detail later) at the top of the screen, AV feeds of other users are depicted in a peripheral region in the lower left, and an AV feed of the user viewing the screen is depicted in the lower right. FIG. 1 A shows an example arrangement of elements and is not to be construed as limiting.
[0093] FIG. IB shows an example AV feed associated with one user that illustrates an example lowest (e.g., 0%) level of quality. In this example, the AV feed at this lowest level of quality may be semi-transparent and may have a lowest resolution setting.
[0094] FIG. 1C shows an example AV feed associated with another user that illustrates an example degraded (e.g., somewhere between 100% and 0%) level of quality. In this example, the AV feed may still be semitransparent and may have a degraded resolution but not as low as the AV feed of FIG. IB.
[0095] FIG. ID shows an example AV feed associated with another user that illustrates and example highest level (e.g., 100%) level of quality. In this example, the AV feed is non transparent (i.e., opaque) and video is depicted at a highest resolution, given network and/or device constraints.
[0096] FIG. 2 shows a view of a user’s screen during a video chat interaction that further illustrates how proximity between avatars may affect the quality of presentation of AV feeds of users associated with the avatars. In the example depicted in FIG. 2, an avatar associated with a user viewing the screen (the “viewing user”) is presented in the center (i.e., the avatar titled “Tyler”). The 2D region surrounding the avatar is the virtual spatial environment. The thumbnail and active AV feed windows overlay the view of the virtual environment and include live video feeds from other users participating in a video chat with the viewing user. The active feeds are located in the eye-contact region of the screen. Depending on the number of users one is currently actively interacting with, there can be multiple, one or no active feeds in that region. The other avatars are associated with other users.
[0097] Avatars within a certain radius represented by the dashed line may be associated with AV feeds that are presented with a highest (i.e., 100%) quality level (e.g., non-transparent, highest resolution, highest volume, etc.) while avatars outside of the radius may be associated with AV feeds that are presented with a degraded quality (e.g., semi-transparent, lower resolution, lower volume, etc.). As avatars move further from the depicted radius the associated AV feeds may decrease in quality (linearly or otherwise) unit a particular distance (not depicted) at which point the AV feeds may not be presented at all.
[0098] In some embodiments, the quality level may be adjusted using post processing of live stream signals (e.g., using a filter) at a local device associated with a viewing user. For example, a post process can be applied to a received video signal to adjust a resolution and/or opacity of the video presented based on the received signal. In other embodiments, quality level may be adjusted by adjusting the live stream signals (e.g., by adjusting bitrate, codec, etc.). The quality may decrease to encourage users to move their avatars closer to avatars of other users they wish to see and hear more clearly. In some situations (e.g., an online board meeting or talk), such quality degradation may not be implemented, as it may cause distractions for participants who need to see and hear a presenter at all times.
[0099] The specific rate at which video and/or audio quality decrease may be fine-tuned with additional testing. The radius of visibility (e.g., the dotted line in FIG. 2) may be an adjustable value, based on preference of the user, number of participants, device capabilities, and other virtual features. In some embodiments, a virtual object in the virtual environment may also affect the quality level of AV feeds. For example, virtual obstacles such as walls, shrubs, and other objects may affect the quality level of associated AV feeds if they come between avatars in the virtual environment. As an illustrative example, if an avatar associated with another user goes into a virtual structure such as a virtual house, the AV feed associated with that user may be degraded (from the perspective of a viewing user) since the virtual walls of the virtual house are between their respective avatars.
[00100] FIG. 6 illustrates a scenario by which multiple users may have private conversations even while viewing a large public event, according to an embodiment. The system may enable private conversations to occur by fading out or de-emphasizing background objects and/or users, aside from the plurality of users engaged in the private conversation. A user may enter into a private conversation with another user by clicking on the other user’s avatar. This action may shrink or remove a communication feed associated with the event and increase in size a chat module with the clicked user. By clicking oneself, the user may exit the private conversation. This may remove the content associated with the other user and reproduce or enlarge the content associated with the event.
[00101] FIG. 7 illustrates content collaboration using the system, in an embodiment. Users of the system may share their screens when working on a web or application-based project. The project may be presented in a window which may be shared in a small or large size. Users’ avatars and chat windows may have transparencies applied to them, so they are partially visible when working on the project.
[00102] FIG. 8 illustrates a scenario wherein users may play an online game together, in various embodiments. In some embodiments, the virtual environment may itself be a board game. In other embodiments, the virtual environment may have interactive board game pieces as objects within them. For example, the virtual environment may include a virtual card table object or chessboard at particular locations within the environment.
[00103] FIG. 4 illustrates additional enhancements to promote conveying eye contact and reduce unease in video chat communications. Specifically, FIG. 4 illustrates an interface for grouping onscreen users into central and peripheral users. Central users may be actively focused on or engaged in conversation. Their video chat windows may be displayed prominently in a central user location (e.g., the center of the screen). Peripheral users may be not actively focused on or engaged with. Views of their video chat windows may be placed along the perimeter of the screen or furthest away from the chat windows of the central users within the user interface (e.g., at the outside comers, on the bottom right of the screen, or on the bottom left of the screen). Additionally, video quality and audio quality for these peripheral users may be degraded (e.g., their video chats may be smaller or partially transparent, and audio volume may be softer. Clicking or otherwise focusing on a peripheral user may promote the peripheral user to being a central user. This may move the peripheral user’s chat window to the center of the screen and increase the audio and video quality of the video chat.
Mobile Implications
[00104] With smaller screens, similar eye-contact and peripheral regions can be established. Feeds may appear smaller, but as with a larger display, a region nearest the camera may be reserved for eye-contact establishment.
Additional Considerations
[00105] Examples are described in the context of 2D display screens (e.g., associated with tablet computers or mobile phones. The introduced techniques can similarly be applied augmented reality (AR) and/or virtual reality (VR) headset devices. Further, such devices may be configured to capture a user’s face with a wide-angle fisheye lens and video.
[00106] Faces captured via video feeds may be mapped onto the avatars in the virtual environment.
[00107] In a higher fidelity directional environment audio may be broadcast predominantly in an avatar’s oriented direction.
[00108] Additional cameras may provide additional user views for eye-contact and non-eye- contact regions.
[00109] Video processing solutions may be used to also detect motion in other parts of a user’s body (e.g., arms) and may be used to move corresponding portions of an associated avatar. In other words, when a user moves their arms, their avatar may move its arms.
[00110] Shared content such as browser tabs may be placed in a portion of the display screen to facilitate collaboration or shared viewing. The location of the shared content may be in areas not occupied by the eye-contact and/or peripheral view regions (e.g., the top left and right corners of the screens in FIGS. 1A, 2 or 3.
[00111] FIG. 5 illustrates an algorithm to make audio degrade or decrease in volume in a manner that replicates real-world conditions. As shown in the plots, both the audio and video degradation proceed slowly until a “knee” point in both curves, and then more rapidly following the “knee.”
Video Feed Wrapper
[00112] A video feed wrapper to manage the feeds may connect all users within a large peer- to-peer room but may only update the feeds associated with avatars within a specified proximity to each other with some buffer (or other condition such as always connected users). The video feed wrapper may also mask the feeds (e.g., with pixelation and lower volume) depending on distance between avatars or other conditions generally related to the virtual environment. A server may be utilized to establish peer-to-peer calls but may not be needed to maintain calls. A server may additionally be utilized to record sessions or stabilize video calls. In some embodiments, WebRTC may be used as the video chat tool, although others may be utilized within the wrapper.
Additional Features
[00113] The system may also include a group assistant to perform integration with third-party software applications, such as address books and calendars, to enable users to exchange information and schedule events. The group assistant may also control entertainment for the group. The group assistant may be used to audibly control a user’s avatar (e.g., tell it to go to a specific location, put up an away message, or only allow users from particular groups to interrupt a meeting).
[00114] The system may be configured to operate with multiple cameras or devices per user to enhance eye contact between users. Using multiple cameras would enable users to interact with other content on other screens (e.g., watching a streaming movie) but use another device to chat with other viewers of the content. Multiple screens and cameras may be used for a single user to enable two different eye contact locations or better fidelity/clarity of eye contact.
[00115] The system may also include video processing software to capture a user’s gaze direction and body movement. Eye gaze direction may be used to automatically detect where on the screen the users are paying attention to and may help them navigate the user experience with just their eyes and other gestures. For example, a disabled user may be able to move an avatar to locations on the screen by orienting his or her gaze. A user looking at another user’s avatar may cause the eye contact region to move to that user’s video chat screen. If the system detects that a user is viewing a portion of the screen where a virtual concert is occurring, the system may increase the concert audio being played into the user’s speakers or headphones. Body movements may be conveyed to avatars. For example, in an exercise class, the avatars may do jumping jacks at the same time as their corresponding users in real life. The system may capture gestures such as these using computer vision algorithms. For example, a computer vision software system may use a user’s camera to track movements of particular body parts of the user, and translate those gestures into the virtual environment, where they may be pantomimed by the user’s avatar. The computer vision algorithms may map the gestures to the avatars or alias real- world gestures for avatars (e.g., a user cannot run in view of a webcam, but could perform a motion that would make the user’s avatar run).
[00116] The system may use virtual backgrounds to green screen two or more users into the same scene. The users’ faces and their corresponding avatars’ bodies may be put in front of almost any scene, controlled with their keyboards.
[00117] The virtual environment may enable users to select items (e.g., using a mouse click or a touch on a touchscreen device) to direct the user’s virtual avatar to orient itself so that it is facing those items, or locations or users associated with the items, and may additionally move the eye contact region to focus attention on the users or virtual space associated with the items. Users may be able to select other users to form groups. The users’ avatars may include indicators showing that they share a group. For example, the users’ shadows may all change colors to those of a group they may be associated with. The group may be controlled by combined movements. For example, the group may designate a user to control the movement for all users in the group. [00118] The system may be configured to be operable with virtual reality (VR) and augmented reality (AR) headsets.
[00119] The system may be configured to record video and audio from communication sessions and track virtual locations of avatars.
[00120] The system may be configured to provide sophisticated gaming experiences (e.g., three-dimensional games, old game emulators, and board games).
[00121] The system may support cross-language translation and closed captioning for real time translations.
[00122] The system may provide for users to make music performances using in-game sounds (e.g., using sound effects related to dropping/picking up, or hovering over items to create a drum circle dynamic).
[00123] The system may provide a function by which administrators or super users are able to create events with multiple rooms included. Supervisors may adjust locations of other users [00124] The system may provide e-commerce and delivery services to vendors within the platform.
[00125] If two cameras are configured for a particular user, the system may render a three- dimensional image of a person’s face and create more eye contact locations with more other users.
[00126] Users may be able to mute themselves or one another. In some cases, if a user in a group is muted, the muting may be conveyed to other users in the group. Muting may be canceled when a user selects a muted user to interact with.
[00127] The system may have a low-bandwidth embodiment when a user may see one or more users of interests’ chat windows but may only see thumbnails of users not of interest or peripheral users. This may reduce latency costs from streaming large amounts of data.
[00128] Video screens of peripheral users or users far away may be configured to increase in size if they are activated (e.g., clicked on) by other users (e.g., during a private conversation between users). Conversely, video screens of users who may be close or central may be configured to decrease in size if activated.
[00129] Whenever the term “at least,” “greater than,” or “greater than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “at least,” “greater than” or “greater than or equal to” applies to each of the numerical values in that series of numerical values. For example, greater than or equal to 1, 2, or 3 is equivalent to greater than or equal to 1, greater than or equal to 2, or greater than or equal to 3. [00130] Whenever the term “no more than,” “less than,” or “less than or equal to” precedes the first numerical value in a series of two or more numerical values, the term “no more than,” “less than,” or “less than or equal to” applies to each of the numerical values in that series of numerical values. For example, less than or equal to 3, 2, or 1 is equivalent to less than or equal to 3, less than or equal to 2, or less than or equal to 1.
Computer systems
[00131] The present disclosure provides computer systems that are programmed to implement methods of the disclosure. FIG. 9 shows a computer system 901 that is programmed or otherwise configured to provide a video chat client with spatial interaction and eye contact recognition. The computer system 901 can regulate various aspects of digital communication of the present disclosure, such as, for example, presenting video chat clients, promoting realistic eye contact, and providing a virtual environment. The computer system 901 can be an electronic device of a user or a computer system that is remotely located with respect to the electronic device. The electronic device can be a mobile electronic device.
[00132] The computer system 901 includes a central processing unit (CPU, also “processor” and “computer processor” herein) 905, which can be a single core or multi core processor, or a plurality of processors for parallel processing. The computer system 901 also includes memory or memory location 910 (e.g., random-access memory, read-only memory, flash memory), electronic storage unit 915 (e.g., hard disk), communication interface 920 (e.g., network adapter) for communicating with one or more other systems, and peripheral devices 925, such as cache, other memory, data storage and/or electronic display adapters. The memory 910, storage unit 915, interface 920 and peripheral devices 925 are in communication with the CPU 905 through a communication bus (solid lines), such as a motherboard. The storage unit 915 can be a data storage unit (or data repository) for storing data. The computer system 901 can be operatively coupled to a computer network (“network”) 930 with the aid of the communication interface 920. The network 930 can be the Internet, an internet and/or extranet, or an intranet and/or extranet that is in communication with the Internet. The network 930 in some cases is a telecommunication and/or data network. The network 930 can include one or more computer servers, which can enable distributed computing, such as cloud computing. The network 930, in some cases with the aid of the computer system 901, can implement a peer-to-peer network, which may enable devices coupled to the computer system 901 to behave as a client or a server. [00133] The CPU 905 can execute a sequence of machine-readable instructions, which can be embodied in a program or software. The instructions may be stored in a memory location, such as the memory 910. The instructions can be directed to the CPU 905, which can subsequently program or otherwise configure the CPU 905 to implement methods of the present disclosure. Examples of operations performed by the CPU 905 can include fetch, decode, execute, and writeback.
[00134] The CPU 905 can be part of a circuit, such as an integrated circuit. One or more other components of the system 901 can be included in the circuit. In some cases, the circuit is an application specific integrated circuit (ASIC).
[00135] The storage unit 915 can store files, such as drivers, libraries and saved programs.
The storage unit 915 can store user data, e.g., user preferences and user programs. The computer system 901 in some cases can include one or more additional data storage units that are external to the computer system 901, such as located on a remote server that is in communication with the computer system 901 through an intranet or the Internet.
[00136] The computer system 901 can communicate with one or more remote computer systems through the network 930. For instance, the computer system 901 can communicate with a remote computer system of a user (e.g., a mobile device). Examples of remote computer systems include personal computers (e.g., portable PC), slate or tablet PC’s (e.g., Apple® iPad, Samsung® Galaxy Tab), telephones, Smart phones (e.g., Apple® iPhone, Android-enabled device, Blackberry®), or personal digital assistants. The user can access the computer system 901 via the network 930.
[00137] Methods as described herein can be implemented by way of machine (e.g., computer processor) executable code stored on an electronic storage location of the computer system 901, such as, for example, on the memory 910 or electronic storage unit 915. The machine executable or machine readable code can be provided in the form of software. During use, the code can be executed by the processor 905. In some cases, the code can be retrieved from the storage unit 915 and stored on the memory 910 for ready access by the processor 905. In some situations, the electronic storage unit 915 can be precluded, and machine-executable instructions are stored on memory 910.
[00138] The code can be pre-compiled and configured for use with a machine having a processer adapted to execute the code or can be compiled during runtime. The code can be supplied in a programming language that can be selected to enable the code to execute in a pre compiled or as-compiled fashion.
[00139] Aspects of the systems and methods provided herein, such as the computer system 901, can be embodied in programming. Various aspects of the technology may be thought of as “products” or “articles of manufacture” typically in the form of machine (or processor) executable code and/or associated data that is carried on or embodied in a type of machine readable medium. Machine-executable code can be stored on an electronic storage unit, such as memory (e.g., read-only memory, random-access memory, flash memory) or a hard disk. “Storage” type media can include any or all of the tangible memory of the computers, processors or the like, or associated modules thereof, such as various semiconductor memories, tape drives, disk drives and the like, which may provide non-transitory storage at any time for the software programming. All or portions of the software may at times be communicated through the Internet or various other telecommunication networks. Such communications, for example, may enable loading of the software from one computer or processor into another, for example, from a management server or host computer into the computer platform of an application server. Thus, another type of media that may bear the software elements includes optical, electrical and electromagnetic waves, such as used across physical interfaces between local devices, through wired and optical landline networks and over various air-links. The physical elements that carry such waves, such as wired or wireless links, optical links or the like, also may be considered as media bearing the software. As used herein, unless restricted to non-transitory, tangible “storage” media, terms such as computer or machine “readable medium” refer to any medium that participates in providing instructions to a processor for execution.
[00140] Hence, a machine readable medium, such as computer-executable code, may take many forms, including but not limited to, a tangible storage medium, a carrier wave medium or physical transmission medium. Non-volatile storage media include, for example, optical or magnetic disks, such as any of the storage devices in any computer(s) or the like, such as may be used to implement the databases, etc. shown in the drawings. Volatile storage media include dynamic memory, such as main memory of such a computer platform. Tangible transmission media include coaxial cables; copper wire and fiber optics, including the wires that comprise a bus within a computer system. Carrier-wave transmission media may take the form of electric or electromagnetic signals, or acoustic or light waves such as those generated during radio frequency (RF) and infrared (IR) data communications. Common forms of computer-readable media therefore include for example: a floppy disk, a flexible disk, hard disk, magnetic tape, any other magnetic medium, a CD-ROM, DVD or DVD-ROM, any other optical medium, punch cards paper tape, any other physical storage medium with patterns of holes, a RAM, a ROM, a PROM and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave transporting data or instructions, cables or links transporting such a carrier wave, or any other medium from which a computer may read programming code and/or data. Many of these forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to a processor for execution. [00141] The computer system 901 can include or be in communication with an electronic display 935 that comprises a user interface (UI) 940 for providing, for example, a video chat interface. Examples of UTs include, without limitation, a graphical user interface (GUI) and web-based user interface.
[00142] Methods and systems of the present disclosure can be implemented by way of one or more algorithms. An algorithm can be implemented by way of software upon execution by the central processing unit 905. The algorithm can, for example, adjust one or more parameters of the video chat client in response to changes in the virtual environment.
While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. It is not intended that the invention be limited by the specific examples provided within the specification. While the invention has been described with reference to the aforementioned specification, the descriptions and illustrations of the embodiments herein are not meant to be construed in a limiting sense. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. Furthermore, it shall be understood that all aspects of the invention are not limited to the specific depictions, configurations or relative proportions set forth herein which depend upon a variety of conditions and variables. It should be understood that various alternatives to the embodiments of the invention described herein may be employed in practicing the invention. It is therefore contemplated that the invention shall also cover any such alternatives, modifications, variations or equivalents. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

CLAIMS WHAT IS CLAIMED IS:
1. A method for enhancing online video communication for a plurality of users, comprising:
(a) providing (1) a first instance of a video chat client application for a first user of said plurality of users, (2) a second instance of said video chat client application for a second user of said plurality of users, and (3) a virtual spatial environment for said first user and said second user, wherein said first user has a first avatar in said virtual spatial environment and said second user has a second avatar in said virtual spatial environment;
(b) obtaining one or more features associated with virtual spatial information relating to said first avatar and said second avatar in said virtual spatial environment; obtaining one or more features indicative of audiovisual information from said first instance of said video chat client application and from said second instance of said video chat client application; and
(c) modifying at least one output of said first instance of said video chat client application, of said second instance of said video chat client application , a state of said first avatar, a state of said second avatar, or a combination thereof, responsive to said one or more features indicative of virtual spatial information, said one or more features indicative of audiovisual information, or a combination thereof.
2. The method of claim 1, further comprising modifying a feature of said virtual spatial environment.
3. The method of claim 1, wherein (c) comprises modifying a picture quality of said video chat.
4. The method of claim 3, wherein said picture quality is a resolution, a window size, a color scheme, a brightness, a contrast, a transparency, or a clarity.
5. The method of claim 4, wherein said resolution is modified according to a distance between said first avatar and said second avatar.
6. The method of claim 5, wherein said distance is an adjustable value based on a preference of the user, a number of participants, device capabilities, and other virtual features.
7. The method of claim 5, wherein said resolution decreases linearly.
8. The method of claim 3, wherein said feature is a gaze direction from said first video chat client instance and wherein (c) comprises modifying said orientation of said second avatar.
9. The method of claim 1, wherein (c) comprises modifying an audio quality.
10. The method of claim 9, wherein modifying said audio quality comprises modifying a loudness, a volume, an accuracy, a fidelity, a pitch, or a tempo.
11. The method of claim 10, wherein modifying said volume comprises varying said volume from a highest volume to a lowest volume.
12. The method of claim 10, wherein modifying said audio quality is performed using post processing of live stream signals of said online video communication.
13. The method of claim 1, wherein (c) comprises modifying a location of a chat window.
14. The method of claim 13, wherein modifying said location of said chat window comprises making said chat window more prominent in a view of said first user and in a view of said second user.
15. The method of claim 13, wherein said making said chat window more prominent is moving said chat window to an eye contact region of a screen.
16. The method of claim 15, wherein said location is based on a location of an associated camera with said screen.
17. The method of claim 15, wherein said location depends on an orientation of said display.
18. The method of claim 15, wherein said location depends on a frequency of interaction between said first avatar and said second avatar.
19. The method of claim 15, wherein said location depends on a focus on a feed.
20. The method of claim 9, wherein said modifying said audio quality comprises using post processing of live stream signals.
21. The method of claim 1, further comprising, responsive to receiving a click on said first avatar, reversing said modification of said output of said first video chat interface or said state of said first avatar.
22. The method of claim 1, further comprising, upon receiving an action on said second avatar, initiating a private conversation between said first user and said second user.
23. The method of claim 1, wherein said virtual spatial environment provides access to third party software.
24. The method of claim 23, wherein said third party software is word processing software.
25. The method of claim 1, wherein said virtual spatial environment includes gaming objects.
26. The method of claim 25, wherein said virtual spatial environment is a game board.
27. The method of claim 1, wherein a feature of said one or more features indicative of virtual spatial information is a virtual distance between said first avatar and said second avatar.
28. The method of claim 1, wherein said first user or said second user may toggle a high- powered mode to prevent modification of a volume of said first video chat client instance or of said second video chat client instance.
PCT/US2021/037887 2020-06-18 2021-06-17 Video chat with spatial interaction and eye contact recognition WO2021257868A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
US202063041023P 2020-06-18 2020-06-18
US63/041,023 2020-06-18
US202063121827P 2020-12-04 2020-12-04
US63/121,827 2020-12-04

Publications (1)

Publication Number Publication Date
WO2021257868A1 true WO2021257868A1 (en) 2021-12-23

Family

ID=79268443

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/037887 WO2021257868A1 (en) 2020-06-18 2021-06-17 Video chat with spatial interaction and eye contact recognition

Country Status (1)

Country Link
WO (1) WO2021257868A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023229748A1 (en) * 2022-05-27 2023-11-30 Microsoft Technology Licensing, Llc Automation of audio and viewing perspectives for bringing focus to relevant activity of a communication session
WO2023231598A1 (en) * 2022-05-31 2023-12-07 腾讯科技(深圳)有限公司 Call interaction method and apparatus, computer device, and storage medium
WO2024072576A1 (en) * 2022-09-26 2024-04-04 Microsoft Technology Licensing, Llc Eye contact optimization

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6806898B1 (en) * 2000-03-20 2004-10-19 Microsoft Corp. System and method for automatically adjusting gaze and head orientation for video conferencing
US7346654B1 (en) * 1999-04-16 2008-03-18 Mitel Networks Corporation Virtual meeting rooms with spatial audio
US20090193327A1 (en) * 2008-01-30 2009-07-30 Microsoft Corporation High-fidelity scalable annotations
US20100156781A1 (en) * 2008-12-19 2010-06-24 Samsung Electronics Co., Ltd. Eye gaze control during avatar-based communication
US20120182381A1 (en) * 2010-10-14 2012-07-19 Umberto Abate Auto Focus
US20120242776A1 (en) * 2008-09-12 2012-09-27 Embarq Holdings Company, Llc System and method for setting resolution utilized for video conferencing through a streaming device
US20140135124A1 (en) * 2008-06-03 2014-05-15 Tweedletech, Llc Multi-dimensional game comprising interactive physical and virtual components

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7346654B1 (en) * 1999-04-16 2008-03-18 Mitel Networks Corporation Virtual meeting rooms with spatial audio
US6806898B1 (en) * 2000-03-20 2004-10-19 Microsoft Corp. System and method for automatically adjusting gaze and head orientation for video conferencing
US20090193327A1 (en) * 2008-01-30 2009-07-30 Microsoft Corporation High-fidelity scalable annotations
US20140135124A1 (en) * 2008-06-03 2014-05-15 Tweedletech, Llc Multi-dimensional game comprising interactive physical and virtual components
US20120242776A1 (en) * 2008-09-12 2012-09-27 Embarq Holdings Company, Llc System and method for setting resolution utilized for video conferencing through a streaming device
US20100156781A1 (en) * 2008-12-19 2010-06-24 Samsung Electronics Co., Ltd. Eye gaze control during avatar-based communication
US20120182381A1 (en) * 2010-10-14 2012-07-19 Umberto Abate Auto Focus

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023229748A1 (en) * 2022-05-27 2023-11-30 Microsoft Technology Licensing, Llc Automation of audio and viewing perspectives for bringing focus to relevant activity of a communication session
WO2023231598A1 (en) * 2022-05-31 2023-12-07 腾讯科技(深圳)有限公司 Call interaction method and apparatus, computer device, and storage medium
WO2024072576A1 (en) * 2022-09-26 2024-04-04 Microsoft Technology Licensing, Llc Eye contact optimization

Similar Documents

Publication Publication Date Title
US11522925B2 (en) Systems and methods for teleconferencing virtual environments
US11700286B2 (en) Multiuser asymmetric immersive teleconferencing with synthesized audio-visual feed
US11314376B2 (en) Augmented reality computing environments—workspace save and load
US10838574B2 (en) Augmented reality computing environments—workspace save and load
JP6602393B2 (en) Filtering and parental control methods to limit visual effects on head mounted displays
CN112385241B (en) Apparatus and method for VR media content generation with depth-of-field based multi-layer structure
US11444990B1 (en) System and method of enabling a non-host, participant-initiated breakout session in a videoconferencing system utilizing a virtual space, and simultaneously displaying a session view of a videoconferencing session and the participant-initiated breakout session
WO2021257868A1 (en) Video chat with spatial interaction and eye contact recognition
WO2019231641A1 (en) Human-computer interface for computationally efficient placement and sizing of virtual objects in a three-dimensional representation of a real-world environment
US20100156781A1 (en) Eye gaze control during avatar-based communication
US20130242064A1 (en) Apparatus, system, and method for providing social content
JP2010533006A (en) System and method for communicating with a virtual world
US20220197403A1 (en) Artificial Reality Spatial Interactions
CN112261481A (en) Interactive video creating method, device and equipment and readable storage medium
KR20220123576A (en) Integrated input/output (i/o) for a three-dimensional (3d) environment
JP2022111966A (en) Volume control for audio and video conference application
US20210335003A1 (en) Systems and methods to facilitate interaction by one or more participants with content presented across multiple distinct physical locations
CN108271056B (en) Video interaction method, user client, server and storage medium
WO2020248682A1 (en) Display device and virtual scene generation method
CN116490250A (en) Asymmetric presentation of environments
JP7329209B1 (en) Information processing system, information processing method and computer program
US11856039B2 (en) Conversation privacy for third party applications
US20240163319A1 (en) Conversation Privacy for Third Party Applications
US20230236792A1 (en) Audio configuration switching in virtual reality
US20230336593A1 (en) Work in VR

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21824813

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21824813

Country of ref document: EP

Kind code of ref document: A1