WO2023052187A1 - Methods and apparatuses for teleconferencing systems - Google Patents

Methods and apparatuses for teleconferencing systems Download PDF

Info

Publication number
WO2023052187A1
WO2023052187A1 PCT/EP2022/076031 EP2022076031W WO2023052187A1 WO 2023052187 A1 WO2023052187 A1 WO 2023052187A1 EP 2022076031 W EP2022076031 W EP 2022076031W WO 2023052187 A1 WO2023052187 A1 WO 2023052187A1
Authority
WO
WIPO (PCT)
Prior art keywords
participant
state
data stream
modification
video data
Prior art date
Application number
PCT/EP2022/076031
Other languages
French (fr)
Inventor
Gabriella NORDQUIST
Athanasios KARAPANTELAKIS
Alexandros NIKOU
Lackis ELEFTHERIADIS
Xiajing LI
Original Assignee
Telefonaktiebolaget Lm Ericsson (Publ)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget Lm Ericsson (Publ) filed Critical Telefonaktiebolaget Lm Ericsson (Publ)
Publication of WO2023052187A1 publication Critical patent/WO2023052187A1/en

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/14Systems for two-way working
    • H04N7/15Conference systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/1066Session management
    • H04L65/1083In-session procedures
    • H04L65/1089In-session procedures by adding media; by removing media
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/40Support for services or applications
    • H04L65/403Arrangements for multi-party communication, e.g. for conferences
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L65/00Network arrangements, protocols or services for supporting real-time applications in data packet communication
    • H04L65/60Network streaming of media packets
    • H04L65/75Media network packet handling
    • H04L65/765Media network packet handling intermediate

Definitions

  • Embodiments of the disclosure generally relate to methods and apparatuses for teleconferencing systems.
  • US9576190B2 Several approaches have been proposed to use psychology research and deep neural networks to be able to detect meeting participant emotions such as those shown in US9576190B2. Specifically, in US9576190B2, the system is configured to convey the detected emotion to third parties and in addition, to allow a third party to enter and review the video conference. In US9576190B2 only negative emotions are captured.
  • One of the objects of the disclosure is to provide an improved solution for teleconferencing. In particular, to provide improved usability of a teleconference system.
  • a computer implemented method for a teleconferencing system comprises: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the method comprises determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the method comprises, where it is determined that a modification is to be performed, performing the determined modification.
  • the determined state may be processed by a trained neural network, wherein if the state is an unwanted state the trained neural network may determine a modification to be performed.
  • the state may be further based on at least one of: elapsed time; elapsed time relative to a predetermined length of the teleconference; background context of the participant; a change in at least one of: background context of the participant, the emotional state of the participant; content shared amongst participants; complexity of content shared amongst participants; previous content shared amongst participants; future content to be shared amongst participants.
  • the processing to determine a state may further comprise processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
  • Content shared amongst participants may be, for example, a presentation (such as comprising slides) that is shared between participants.
  • the content may be a screen (such as a desktop screen) of a presenter that is shared between participants.
  • the modification may comprise at least one of: modifying the playback speed of an audio data stream of the participant for a time period; reducing the playback speed of an audio data stream of the participant; turning off the video data stream; reducing the playback speed of the video data stream; modifying the background of frames of the video data stream.
  • the modification of the background of frames of the video data stream of a participant may comprise at least one of: modifying the background to a neutral setting; modifying hue; modifying saturation; modifying contrast; using image segmentation to modify the background.
  • An unwanted state may be a state that has been predetermined to be likely to negatively impact a participant.
  • the determined modification may further comprise determining a modification of the environment of the participant.
  • the modification of the environment of the participant may comprise at least one of: turning on noise cancelling in a device of the participant; sending at least one of a notification and an actuation to a remote device; sending at least one of a notification and an actuation to a device comprised in the teleconference system.
  • a computer implemented method for training a neural network based on reinforcement learning comprises processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant.
  • the state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the method comprises determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the method comprises, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification.
  • the method comprises processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action.
  • the method comprises training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • the method may further comprise determining the reward relating to the action responsive to receiving input from the participant relating to the modification.
  • the reward may indicate the benefit of the modification to the participant.
  • the state may comprise information on at least one of: the emotional state of the participant; elapsed time of the teleconference; elapsed time relative to a predetermined length of the teleconference; content shared between participants; the state of the environment of the participant; a time at which the state is observed.
  • An action may comprise processing instructions indicating how the at least one of the audio data stream and the video data stream are to be modified or an indication that no modification is to be performed.
  • the action may be that the video data stream and/or audio data stream are to continue without a modification.
  • the processing to determine a state relating to the participant may further comprise processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
  • the method may comprise selecting a random action to perform for an unwanted state.
  • the neural network may select an action to perform for the unwanted state.
  • a plurality of iterations of the method may be performed up to the determining of the reward, wherein the method may further comprise storing a plurality of rewards, and the neural network may be trained using the plurality of rewards.
  • a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the methods described herein.
  • a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the methods described herein.
  • a data processing system configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the data processing system is configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the data processing system is configured to, where it is determined that a modification is to be performed, perform the determined modification.
  • a data processing system configured to train a neural network based on reinforcement learning.
  • the system being configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant.
  • the state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the data processing system is configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the data processing system is configured to, if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification; process the video data stream of the participant to determine a new state relating to the participant.
  • the data processing system is configured to determine a reward relating to the action.
  • the data processing system is configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • the systems may be configured to perform any of the methods described herein.
  • a data processing system comprising a processor and a memory.
  • the memory contains instructions executable by said processor.
  • the system is operative to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the system is operative to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the system is operative to, where it is determined that a modification is to be performed, perform the determined modification.
  • a data processing system for training a neural network based on reinforcement learning.
  • the data processing system comprises a processor and a memory, said memory containing instructions executable by said processor.
  • the data processing system is operative to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the data processing system is operative to determine an action that pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the data processing system is operative to, if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification.
  • the data processing system is operative to process the video data stream of the participant to determine a new state relating to the participant.
  • the data processing system is operative to determine a reward relating to the action.
  • the data processing system is operative to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • a data processing system comprising a state detector configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the data processing system comprises an agent configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the data processing system comprises a data stream processor configured to, where it is determined that a modification is to be performed, perform the determined modification.
  • a data processing system for training a neural network.
  • the data processing system comprises a state detector configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the data processing system comprises an agent configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the data processing system comprises a data stream processor configured to, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification.
  • the state detector is further configured to process the video data stream of the participant to determine a new state relating to the participant and determine a reward relating to the action.
  • the agent further configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • FIG. 1 is a diagram illustrating a method according to an example
  • FIG. 2 is a block diagram illustrating a system according to an example
  • FIG. 3 is a diagram illustrating emotions as combinations of valence and arousal
  • FIG. 4 is a diagram illustrating an example of a knowledge graph
  • FIG. 5 is a diagram illustrating a method according to an example
  • FIG. 6 is a block diagram illustrating a system according to an example
  • FIG. 7 illustrates a user interface according to an example
  • FIG. 8 illustrates a block diagram illustrating the processes performed by a system for training a neural network according to an example
  • FIG. 9 illustrates a system according to an example
  • FIG. 10 illustrates a system according to an example.
  • solutions are being developed to analyze electromagnetic signals from the brain.
  • the signals can be detected by electroencephalography, EEG, a non-invasive method where no surgical procedure is required. It is therefore possible to collect data from the brain signals.
  • Supervised learning techniques can be used to train models to interpret state of human based on electromagnetic brain signals, for example, mental fatigue or a broader range of emotional states.
  • GSR galvanic skin response
  • RR respiration rate analysis
  • EOG electrooculography
  • a common feature of the aforementioned approaches is the requirement for special type of sensors (albeit non-invasive in some cases), that need to measure qualities of the subject in order to identify their emotional state.
  • Another area if interest which does not require specialized sensors is that of using visual cues to identify the emotional state of a subject, for example by analyzing facial expressions, body posture and movements. For example, there are available systems able to detect real time stress levels and arousal, or deal with a broader range of emotions ranging from happiness, to sadness, anger and disgust.
  • a computer implemented method for a teleconferencing system comprising, in a first step, processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • a third step where it is determined that a modification is to be performed, performing the determined modification.
  • FIG. 1 shows a method according to an example.
  • this Figure illustrates the first step 102 of processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, the second step 104 of determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a third step 106 of, where it is determined that a modification is to be performed, performing the determined modification.
  • the distractions maybe predetermined (e.g. preset), where the method detects a predetermined distraction and acts to mitigate the distraction, for example by turning off the video for the participant causing the distraction.
  • this may result in online meetings being more adapted to a presenters’ needs.
  • there may be some drawbacks from using teleconferencing systems, including psychological and physiological side effects in the users, such as increased stress.
  • the benefits of the methods described herein may include reducing negative emotions such as stress or anxiety related to the session and facilitating breaks.
  • An unwanted state may be a state that has been predetermined to be likely to negatively impact a participant.
  • unwanted states include an audience member speaking, a person walking into the video stream of a participant (e.g., who is not the participant), a participant moving excessively, a participant moving out of the frame of the video data stream, background noise (such as a baby crying, talking or shouting from persons other than participants).
  • Further examples include the conveying of negative emotions, for example from participants. For example, shaking of a head of a participant, or frowning, may be distracting for a presenter and therefore negatively impact them.
  • the methods described herein may improve the presenter’s experience by reducing unwanted events, such as drinking water, without generating any disturbance for the meeting audience, or making other business during a virtual event.
  • Unwanted events may comprise, for example, a child or adult walking into a room, an animal causing a distraction, an anxiety or panic attack, disorders that cause movement or sounds, etc.
  • the unwanted states may be states that are unwanted to any participant, to a presenter, or to an audience.
  • the unwanted state may be a state that would negatively impact a presenter, for example, by causing them emotional distress or distracting them.
  • a teleconferencing system may comprise a plurality of remote devices that are connected through a network, such as a wireless network, radio network, telecommunications network, via an internet connection, etc.
  • a remote device may be, or may be comprised in, a client device. Each participant may have access to a client device.
  • the client device may comprise at least one of a camera, a microphone, a system for sharing content (e.g. a presentation), a network interface for communicating video, audio, and/or content (e.g. cellular connectivity), and a display for displaying the teleconference (so that the user can see other participants).
  • the teleconferencing system may comprise client devices configured to display the teleconference, where the teleconference may be displayed differently at each client device.
  • a client device may be, for example, a computer, laptop, mobile phone.
  • the teleconferencing system may comprise at least one camera and at least one microphone which are configured to record video data streams and audio data streams respectively.
  • a participant may be recorded via a camera and/or a microphone.
  • the data streams resulting from these recordings may be played to other participants.
  • the video data streams may be displayed using an application, such as a web application, mobile application, desktop application.
  • Content of a display device may be shared as shared content in the teleconferencing system, such as the desktop of a user, slides of a presentation etc.
  • the determined state may be processed by a trained neural network, wherein if the state is an unwanted state the trained neural network determines a modification to be performed.
  • a trained neural network may have been trained in order to determine the optimal action relating to a modification to take for a particular state of a user. This may be choosing to take no action, make no modification, where the state is not an unwanted state (e.g. a typical state where no predetermined distractions are occurring) or choosing the optimal modification where the state is an unwanted state. Therefore, the most appropriate modification may be selected for a given state.
  • one of the participants of a teleconference is a “presenter” (for example is primarily speaking or presenting information to other participants), and other of the participants are the “audience” (for example generally do not speak or present information). It may be assumed in some examples that there are no interactions from the audience during the presentation.
  • a “presenter” for example is primarily speaking or presenting information to other participants
  • the “audience” for example generally do not speak or present information
  • Fig. 2 illustrates a system 200 configured to perform the method for a teleconferencing system according to an example.
  • the system comprises a state detector 214, an agent 216, and a data stream processor 218.
  • the system may also comprise, or be connected to, a data stream provider 220.
  • the data stream provider 220 may provide at least one of an audio data stream, video data stream, shared content of at least one participant to the state detector.
  • the data stream provider may comprise data streams of a plurality of participants, such as audience members and a presenter.
  • the “State Detector” (SD) may be configured to use reinforcement learning to leam the state relating to a participant, such as the presenter, and predict future states.
  • the “Data stream Processor” (DP) may decide what actions should be carried out (e.g. modifications to perform) in order to improve and facilitate the presentation by actuating control on the audio and video data streams.
  • the DP may use reinforcement learning in order to predict the best actions.
  • the state detector 214 is configured to receive data 222 from a data stream provider 220 (e.g., the presenter’s mobile application).
  • This data may include any of a video data stream and an audio data stream as well as content that is being shared between participants (for example, a presentation).
  • the SD processes these data streams and outputs a user (participant) state description 224 which contains a set of information regarding the user’s current state.
  • a state relating to the participant may be determined. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the state relating to the participant may comprise information elapsed time (of the teleconference).
  • the state may comprise information on elapsed time relative to a predetermined length of the teleconference. For example, with an increase in the elapsed time, participants may begin to experience fatigue or lose focus.
  • the state may comprise information on background context of the participant.
  • the background context may indicate a potential distraction in the frame of a participant.
  • the state may comprise information on a change in background context of the participant.
  • the background context of the participant may indicate that a person has entered the room in which the participant is participating in the teleconference, and is likely to cause an interruption.
  • the state may comprise information on a change in the emotional state of the participant.
  • the state may comprise information on content shared amongst participants.
  • the state may comprise information on complexity of content shared amongst participants.
  • the state may comprise information on previous content shared amongst participants.
  • the state may comprise information on future content to be shared amongst participants.
  • a state may comprise the information that is indicative of the current user status, and may be used to inform the agent of the current status of the user.
  • this user state description is the output of SD, which uses raw data stream information from the presenter and the audience to create a semantic description of the state.
  • the state of a presenter at a given iteration (episode) i can be described as:
  • Si ⁇ pi, ... , pk, ai, ... , a x ⁇ where pi to pk are observations concerning the presenter, and ai to a x are observations concerning the audience members of the meeting. Then,
  • V pm 6 Si: pm ⁇ em,Sm,Cm,bm,tm ⁇
  • e m is the emotional state of the user as product of sentiment analysis at the SD of video and audio data streams.
  • the video data stream of a participant may be processed using sentiment analysis.
  • the video data stream may be processed by a convolutional neural network classifier, trained to detect valence and arousal emotions from labeled video data frames.
  • a user state description may be output based on the video data stream.
  • Fig. 3 illustrates an example of various emotions determined as combinations of valence and arousal.
  • the valence may range from unpleasant (-1) to pleasant (1)
  • the arousal may range from deactivated (-1) to activated (1).
  • the detection of an unpleasant emotional state in a participant may result in a modification, where the specific emotional state detected may be usable to determine an appropriate action to take. For example, where the detected emotion is “stressed”, it may be desirable to turn off the video feed of the participant who is expressing stress in order that the presenter is not distracted.
  • the timeline of the teleconference is defined as s m , which may indicate a timeline of the presentation.
  • a timeline can be expressed, for example, as a percentage of the time elapsed if the time of the meeting is known in advance, in which case s m can be normalized (s m £ [0,1], where 0 indicates the start of the meeting and 1 the end of the meeting).
  • s m may indicate the time elapsed, e.g., the number of seconds elapsed.
  • sm can indicate the slide number (which can be used in conjunction with other metrics such as time elapsed - in this case s m is also a 2-tuple, or in isolation).
  • a further optional parameter that may be used is c m ; in this case the optional parameter is the content being shared is slides.
  • the optional parameter indicates the complexity of the content being presented and about to be presented (i.e., in the subsequent x number of slides).
  • a higher c m may indicate that a break may be needed for both presenter and the audience.
  • the optional parameter may indicate that slower playback is required, in order for people to have enough time for both presenting clearly and understanding the presentation.
  • the complexity of the content may be dependent on a number of factors.
  • the volume of content (for example, in slides), which may include the total number of words n W ords, pictures n PiCS , the number of animations nanim, in the slides of the presentation (Nwords,Npics,Nanim), S:
  • Ccontent is normalized to range between [0,1] as:
  • Ccontent_norm (Ccontent_min “Ccontent )/ (Ccontent_max “Ccontent_min).
  • readability metrics e.g. Flesch-Kincaid score, Dale-Chall readability
  • the range of Creadabiiity will be between [0, 1] where 0 is text that is easily understood and 1 is text that is very difficult to understand.
  • Cm_ s iides [c m _i, Cm 2, ... , c m _n ] where c m _nis the value of c m on slide number n.
  • the number x of slides to take into account may depend on the implementation (e.g., it could be 1 - i.e., current slide or 3, i.e., current and next 2 slides, etc.).
  • the background context, or environment, of the user may also indicate events that could distract, or negatively impact, the user.
  • the background context may be indicated by bm which is an assessment of the background context of the user. The assessment may be based both on analysis of the audio stream and the video stream, or just an analysis of the video stream or audio stream.
  • FFT fast fourier transformation
  • audio fingerprinting can be used to identify the background audio streams semantically.
  • a graph knowledge base can be used to associate the identified background audio streams with a probability for user distraction.
  • FIG. 4 shows an ontology of entities and their relations.
  • This graph shows an example of a knowledge graph that correlates fingerprinted background audio with a probability of distraction (POD), a scalar that shows how probable it is that the user will be distracted from the background audio.
  • POD probability of distraction
  • a baby crying, yelling talking are determined as having POD of 0.7, 0.6, and 0.4 respectively, and are also classed as human audio, which is further classed as sound.
  • a phone ringing, a doorbell ringing, an alarm clock ringing have POD of 0.4, 0.3 and 0.2 respectively, and are classed as electronic devices, which is also classed as sound. Sound may be classed as a “thing” (where a “thing” is a superclass that all classes inherit from).
  • a similar graph to that shown in Fig. 4 can be used for visual observations, or the graph may additionally incorporate information from visual observation. For example, if a yelling person or crying baby is in the frame, then the POD will be increased.
  • the background context bm may therefore indicate the sum of PODs from audio and/or visual analysis of the audio data stream and/or the video data stream.
  • a state descriptor may further comprise a time at which a state occurs.
  • tm may be the timestamp of observation indicating the duration through which em, Cm, Sm, and bm were observed.
  • a user state description may include multiple observations (for example, those prefixed by “p” as described above), where a change in state over time may be used to determine if an unwanted state is occurring. (As will be described later, the neural network may use the time at which a state occurs to leam trends in how the different constituents change over time, in order to be able to predict those constituents into the future).
  • the analysis of SD for these observations may only include a visual assessment as audience members are typically muted while another participant is presenting.
  • the visual assessment in this case may indicate a degree of frame change for every audience member - this could be performed as an image processing task where the pixels of one frame are compared with the pixels of the previous frame (e.g., in all three RGB channels in case of a colour video feed) and if the difference in pixels exceeds a certain threshold, this could be interpreted as one or more of the audience members moving excessively.
  • the calculation may augment all video streams from all audience members to a single scalar represented by ai, . . . , a x as described above.
  • the agent 216 may be an intelligent agent, which is configured to determine an optimal action to take on the data stream (comprising at least one of the audio data stream and the video data stream) based on the user state description 224.
  • the agent may output an action 226 based on an input user state description 224.
  • the agent may comprise a trained neural network 228 which is configured to determine an action indicative of a modification to perform on a data stream based on the detection of an unwanted state.
  • the trained neural network may determine whether the detected state of the user is an unwanted state (for example, a state that has been predetermined as likely to cause a distraction to a participant, be it the participant themself or another participant such as the presenter, or a state that has been predetermined to be likely to negatively impact a participant) and determine an appropriate modification to perform if the state is an unwanted state.
  • an unwanted state for example, a state that has been predetermined as likely to cause a distraction to a participant, be it the participant themself or another participant such as the presenter, or a state that has been predetermined to be likely to negatively impact a participant
  • An action may be defined as processing instructions which are sent to the DP, where the DP is configured to modify incoming audio and video streams from the presenter and/or the audience members.
  • the action may be the delaying of video and/or audio, and/or the turning off of the video or audio, or employing noise cancelling, as described above.
  • the action space is defined as:
  • the first member prolong ⁇ 10s,20s,30s,40s ⁇ of the set indicates by how much playback of the audio data should be delayed.
  • playback of the audio stream slows down.
  • the video data stream may be turned off at this point, or the playback of the video data stream may slow down as well (in case presenter_video_state is set to 1).
  • This example comprises a 4-tuple, wherein if the first element of the 4-tuple is set to 1 then audiovisual playback is prolonged for 10 seconds, if second element is set to 1 then playback is prolonged for 20 seconds and so on.
  • the values described herein are exemplary and any appropriate values could be used to prolong the playback and/or increase granularity of values.
  • the second member “presenter_video_state” can be set to 0, in which case the presenter video is transmitted to audience members, or 1, in which case the video is turned off.
  • the third member “audience_video_state” can be set to 0, in which case all video from audience members is not forwarded to presenter’s meeting application for rendering, or 1, in which case the video is forwarded.
  • the fourth member “noise_cancellation” is set to 0 if noise cancellation for the user needs to be inactivated (and vice versa).
  • the modification indicated by the action may be a modification of the speed of the data stream for a given amount of time.
  • the speed of the audio data stream is reduced while video is turned off. Sharing of content may continue as normal.
  • the modification may be modification of background colors of audience members to a neutral setting (i.e., by adjusting hue/saturation/contrast and/or using image segmentation techniques) or turning off video stream of audience members (for example, if it is determined that reactions of audience members may cause anxiety and/or agitation in the presenter).
  • the modification may comprise modifying the playback speed of an audio data stream of the participant for a time period.
  • the modification may comprise reducing the playback speed of an audio data stream of the participant.
  • the modification may comprise turning off the video data stream.
  • the modification may comprise reducing the playback speed of the video data stream,.
  • the modification may comprise modifying the background of frames of the video data stream.
  • the modification of the background of frames of the video data stream of a participant may comprise modifying the background to a neutral setting.
  • the modification of the background of frames of the video data stream of a participant may comprise modifying hue.
  • the modification of the background of frames of the video data stream of a participant may comprise modifying saturation.
  • the modification of the background of frames of the video data stream of a participant may comprise modifying contrast.
  • the modification of the background of frames of the video data stream of a participant may comprise using image segmentation to modify the background.
  • the modification provides the presenter with a time period in which they are no longer required to present while the other participants still receive a continuous presentation.
  • the presenter method may cause the video data stream of the participant to be turned off while the playback speed of the audio data stream is reduced (while the presenter continues to present). After a time period, the presenter will have presented sufficient material that they can stop presenting while the slowed audio data stream is still presented to the other participants. This may provide the presenter with an opportunity to deal with the distraction.
  • the determined modification further comprises determining a modification of the environment of the participant.
  • the modification of the environment of the participant may comprise at least one of: turning on noise cancelling in a device of the participant; sending at least one of a notification and an actuation to a remote device; sending at least one of a notification and an actuation to a device comprised in the teleconference system.
  • the modification may be an interaction with a wearable device of the user, for example in order to turn off or on noise cancellation on the user’s headset or initiate a notification on their smartwatch.
  • the modification may comprise sending a notification/actuation to a remote device that is remote from the teleconference system.
  • the modification may comprise a notification sent to a smart watch, or mobile phone, of a participant.
  • the modification may comprise actuation of lights, windows, curtains and so on. This may alert the participant that they are causing a distraction.
  • a notification may be sent to a device comprised in the teleconference system such as a laptop which is being used by a participant as a part of the teleconference system.
  • the data stream processor 218 may implement the modification chosen by the intelligent agent.
  • the DP may intercept any of the video, audio and content sharing data stream and modify the content of any one of these based on the determined modification, and output at least one processed data stream 230 that has been processed based on the action or modification determined by the agent.
  • the processed data streams 230 may be transmitted to user interfaces 232 (for example, at least to the user interface used by the presenter).
  • the trained neural network may be developed during a training phase prior to use, in which the agent leams to choose the best modification given a user state description by training its neural network using user feedback formalized as a reward. Then, the trained neural network may be used as a part of the agent, wherein the agent executes its neural network given a user state description to determine an action to take (a modification to perform).
  • the trained neural network may be trained using a computer implemented method for training a neural network based on reinforcement learning, the method comprising: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action; and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • this method comprises a first step 508 of processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, a second step 510 of determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a third step 512 of, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action; and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • a training phase may first occur prior to use of the trained neural network.
  • the training may be reinforcement learning.
  • the agent may operate in a deep reinforcement learning (RL) loop with its environment.
  • RL deep reinforcement learning
  • MDP Markov Decision Process
  • an intelligent agent interacts with the environment.
  • Federated learning can be used to generalize training for multiple distributed workers (i.e. , users). Such generalizations may benefit variance that a model would experience when deployed in reality, and may mitigate bias.
  • Fig. 6 illustrates the block components of a system for training a neural network based on reinforcement learning.
  • the system comprises a state detector 614, an agent 616, and a data stream processor 618.
  • the system may also comprise, or be connected to, a data provider 620.
  • the system of Fig. 6 may be configured in substantially the same way as the system of Fig. 2 (the corresponding description is not repeated here for brevity, however, it will be appreciated that the description corresponding to Fig. 2 applies herewith, where reference numerals of Fig. 6 correspond to like reference numerals of Fig. 2), however, in this example, the agent is further configured to train the neural network, where Fig. 2 illustrates a system with the trained neural network.
  • the state may be detected/determined as described in Fig. 2, but in this case the state may also be used for training of the neural network.
  • the method of training the neural network may also comprise determining an action to take based on the state.
  • the method may further comprise determining the reward relating to the action responsive to receiving input from the participant relating to the modification (e.g. via a user interface).
  • system may further comprise a presenter feedback interface 634 configured to receive user input (for example, regarding the modification), user feedback on the action 646.
  • the reward indicates the benefit of the modification to the participant.
  • the user may indicate the benefit of the modification.
  • the agent 616 may further comprise an agent experience storage 636 configured to store information regarding the state of the user along with associated actions and rewards, and an action selection policy module 638, where the action selection policy module is configured to select either a random (exploration) action for a given user state description, or determine an action to take using a neural network 628, where the neural network is trained using the stored agent experience.
  • the determined action 626 is then sent to the data stream processor 618 as is described in relation to Fig. 2.
  • the data stream processor 618 modifies the data stream, and the state detector determines a new state occurring after the modification has been performed.
  • the agent may observe a reward and the new state.
  • the reward may be scalar and quantify the effectiveness of the selected action.
  • a 4-tuple ⁇ old state, action, reward, new_state> may be stored as an experience in the agent’s buffer (the agent experience storage 636).
  • the state may comprise information on at least one of: the emotional state of the participant; elapsed time of the teleconference; elapsed time relative to a predetermined length of the teleconference; content shared between participants; the state of the environment of the participant; a time at which the state is observed.
  • An action may comprise processing instructions indicating how the at least one of the audio data stream and the video data stream are to be modified or an indication that no modification is to be performed.
  • the agent may train its deep neural network using experiences from the buffer.
  • the deep neural network here may try to approximate the optimal policy. In order to do so an agent learns to select the best action over a series of repetitions-iterations known as “episodes”. In every episode, the agent observes the current state and takes an action using a selection policy (an action may be “no modification”, for example, where the state is not an unwanted state). For example, e-greedy policy favors exploration (i.e., choice of a random action) over exploitation (i.e., execution of the neural network), which may change in later iterations to favoring exploitation over exploration.
  • a random action to perform for a state may be selected, whereas optionally in a second number of iterations, an action to perform for a state may be selected by the neural network.
  • the agent 616 may leam the optimal policy, i.e., the action that yields the highest short and longterm reward for any user state (for example, the action chosen in one state will also affect the future state of the environment, in addition to the immediate reward being returned. Therefore, an action also affects future rewards that will be returned, because future states are a derivative of the chosen action.
  • the future rewards in RL are typically discounted using a discount factor to give greater importance to the immediate reward).
  • the neural network may additionally use the time at which a state occurs to leam trends in how the different constituents change over time, in order to be able to predict those constituents into the future
  • Any appropriate algorithm may be used for training.
  • value-learning algorithms such as Deep-Q Learning approaches use deep neural networks to leam the best action for every state (i.e., the action that produces the highest value).
  • Policy learning algorithms such as actor-critic approaches leam a policy that maximizes the reward. Regardless of the approach, in order to formalize MDP mathematically the state, reward and action may be defined.
  • Training may start from random weights (where the agent does not know anything about the environment, a priori), or alternatively there may be a baseline (pretrained) model.
  • the baseline model can be trained using supervised learning, and on a dataset collected in a controlled environment, reflecting the “majority view”, i.e., the type of actions and states presenters may typically find themselves in.
  • the use of a baseline model may reduce training time. Further training using a reinforcement learning algorithm may help the neural network leam comer cases and adapt to the emotional state and environment of specific presenters.
  • the reward may be determined using a feedback interface during training, for example an interface that is part of the teleconferencing system such as a computer of a user.
  • a feedback interface for example an interface that is part of the teleconferencing system such as a computer of a user.
  • an action chosen by the agent is presented as an option to the user. If user confirms that the action is to be performed, then the Data stream Processor performs the modification corresponding to the action.
  • the agent is rewarded negatively or positively. For example, if the user accepts the modification, the agent may be positively rewarded. However, if the user does not accept the modification, the agent may be negatively rewarded.
  • An example interface is illustrated in Fig. 7. Fig.
  • FIG. 7 illustrates an exemplary user interface showing the video stream of the presenter 742 in addition to a graphical user interface with a suggested action 744 of the agent.
  • the options given in this example for a suggested action “We have detected that your baby is crying. We can slow down your speech and give you 20 seconds. Is that OK?” are “reject”, “accept”, or “I need more time” (which provides alternative time delays).
  • the user is given a time window in which they can confirm the action. If the user misses the time window, then the reward is not transmitted and therefore the system is required to wait for the next suggestion. If user rejects the action, then the reward is zero. If the user confirms the action, then the reward is 1.
  • a time delay is suggested, and in addition to the option to accept or reject the suggested time delay, an option for a different time delay is also provided.
  • the reward may therefore be between 0 and 1 if the system correctly detects that more time is needed, but the user indicates that they require more time than the time that is suggested (e.g., where 20 seconds is suggested as is shown in Fig. 7, but a user selects a different time delay of 40s from the drop-down option).
  • the reward may be determined as follows:
  • This reward only takes into account the fact that the user can set the time they need.
  • a more complex interface may alternatively be used where the user/presenter can also indicate if they agree to a modification such as stopping/starting their video or video of the audience and/or turning on/off the noise cancellation.
  • the responses to these suggestions would be captured as indicating that the modification is correct or incorrect by the binary accept/select action, but it is possible to have more fine-grained approach, where the time window in which a user can make a choice may also be increased with granularity.
  • the feedback interface from the user is not needed, as the agent makes automatic decisions on the user’s behalf.
  • the user state description is fed directly to the agent by the SD, which inputs the user state description to its trained neural network (e.g. a deep recurrent Q network, DRQN), which in turns outputs an action, which is interpreted by DP.
  • the DP then processes incoming data streams to the presenter’s rendering application.
  • Fig. 8 illustrates a block diagram illustrating the processes performed by a system for training a neural network according to an example.
  • user feedback on an action 846 is sent from a presenter feedback interface 834 to a state detector 814.
  • Data streams (comprising, for example, audio and video data streams and content shared between participants) are sent from data stream providers 820 to the state detector 814.
  • An agent 816 initialized an experience replay memory 863.
  • the state detector 814 sends a selected action 864 (selected using a selection policy) to the agent 816.
  • the state detector also sends an observed reward and a new state 866 to the agent.
  • the agent stores the action, observed reward, new state 868 ( ⁇ s(j+ l),s(j),a,r(j)>) ( ⁇ current state as result of the action, old state, action taken in old state, reward for action>) in the memory.
  • new state 868 ⁇ s(j+ l),s(j),a,r(j)>
  • the state detector sends the data stream to the data stream processor 818, and the data stream processor sends the processed data streams 830 to the data stream playback for the presenter 832.
  • the method may be executed as outlined below:
  • State_Detector -> Agent Select action a using selection policy (e.g., e-greedy)
  • Agent->Agent Select random minibatch of experiences ⁇ s(j+l), s(j), a, r(j)> from B
  • Agent->Agent Perform gradient descent step on tn (y(j)-Q(s(j),a(j),dqn)) f 2 end
  • the implementation of the method during a teleconference may be triggered on demand, e.g., when user starts the presentation, or when a user indicates they would like the method to be performed. Alternatively, the method may be performed automatically.
  • the system may not be in operation for the full duration of the meeting, i.e., it can stop before questions and answers interactive session that usually follows a presentation.
  • the training of the neural network may occur on demand, for example, by an indication by a user. Training may occur each time a new user uses the teleconference system.
  • the (data processing) system 900 comprises processing circuitry (or logic) 948.
  • the processing circuitry 948 controls the operation of the system 900 and can implement the method described herein in respect of the system 900.
  • the processing circuitry 948 can be configured or programmed to control the system 900 in the manner described herein.
  • the processing circuitry 948 can comprise one or more hardware components, such as one or more processors, one or more processing units, one or more multi-core processors and/or one or more modules.
  • each of the one or more hardware components can be configured to perform, or is for performing, individual or multiple steps of the method described herein in respect of the system 900.
  • the processing circuitry 948 can be configured to run software to perform the method described herein in respect of the system 900.
  • the software may be containerised according to some embodiments.
  • the processing circuitry 948 may be configured to run a container to perform the method described herein in respect of the system 900.
  • the processing circuitry 948 of the system 900 may be configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant.
  • the processing circuitry 948 is further configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state.
  • the processing circuitry 948 is further configured to where it is determined that a modification is to be performed, perform the determined modification.
  • the processing circuitry 948 of the system 900 may be configured to train a neural network based on reinforcement learning.
  • the processing circuitry 948 may be further configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification, processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action, and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • the system 900 may optionally comprise a memory 950.
  • the memory 950 of the system 900 can comprise a volatile memory or a non- volatile memory.
  • the memory 950 of the system 900 may comprise a non-transitory media. Examples of the memory 950 of the system 900 include, but are not limited to, a random access memory (RAM), a read only memory (ROM), a mass storage media such as a hard disk, a removable storage media such as a compact disk (CD) or a digital video disk (DVD), and/or any other memory.
  • RAM random access memory
  • ROM read only memory
  • CD compact disk
  • DVD digital video disk
  • the processing circuitry 948 of the system 900 can be connected to the memory 950 of the system 900.
  • the memory 950 of the system 900 may be for storing program code or instructions which, when executed by the processing circuitry 948 of the system 900, cause the system 900 to operate in the manner described herein in respect of the system 900.
  • the memory 950 of the system 900 may be configured to store program code or instructions that can be executed by the processing circuitry 948 of the system 900 to cause the system 900 to operate in accordance with the method described herein in respect of the system 900.
  • the memory 950 of the system 900 can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 948 of the system 900 may be configured to control the memory 950 of the system 900 to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the system 900 may optionally comprise a communications interface 952.
  • the communications interface 952 of the system 900 can be connected to the processing circuitry 948 of the system 900 and/or the memory 950 of system 900.
  • the communications interface 952 of the system 900 may be operable to allow the processing circuitry 948 of the system 900 to communicate with the memory 950 of the system 900 and/or vice versa.
  • the communications interface 952 of the system 900 can be configured to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • the processing circuitry 948 of the system 900 may be configured to control the communications interface 952 of the system 900 to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
  • system 900 is illustrated in Fig. 9 as comprising a single memory 950, it will be appreciated that the system 900 may comprise at least one memory (i.e. a single memory or a plurality of memories) 34 that operate in the manner described herein.
  • system 900 is illustrated in Fig. 9 as comprising a single communications interface 952, it will be appreciated that the system 900 may comprise at least one communications interface (i.e. a single communications interface or a plurality of communications interface) 36 that operate in the manner described herein.
  • Fig. 9 only shows the components required to illustrate an embodiment of the system 900 and, in practical implementations, the system 900 may comprise additional or alternative components to those shown.
  • Fig. 10 illustrates system 1000 according to an embodiment
  • the system 1000 may comprise a state detector 1054 configured to processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, an agent 1056 determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a data stream processor 1058 configured to, where it is determined that a modification is to be performed, performing the determined modification.
  • a state detector 1054 configured to processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant
  • an agent 1056 determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed
  • the system 1000 may comprise a state detector 1054 configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, an agent 1056 configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a data stream processor 1058 configured to, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification, the state detector being further configured to process the video data stream of the participant to determine a new state relating to the participant and determine a reward relating to the action; and the agent being further configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
  • a state detector 1054 configured to
  • the various exemplary embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof.
  • some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto.
  • firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto.
  • While various aspects of the exemplary embodiments of this disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
  • the exemplary embodiments of the disclosure may be practiced in various components such as integrated circuit chips and modules. It should thus be appreciated that the exemplary embodiments of this disclosure may be realized in an apparatus that is embodied as an integrated circuit, where the integrated circuit may comprise circuitry (as well as possibly firmware) for embodying at least one or more of a data processor, a digital signal processor, baseband circuitry and radio frequency circuitry that are configurable so as to operate in accordance with the exemplary embodiments of this disclosure.
  • exemplary embodiments of the disclosure may be embodied in computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices.
  • program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device.
  • the computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid state memory, RAM, etc.
  • the function of the program modules may be combined or distributed as desired in various embodiments.
  • the function may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like.
  • first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of the disclosure.
  • second element could be termed a first element, without departing from the scope of the disclosure.
  • the term “and/or” includes any and all combinations of one or more of the associated listed terms.

Abstract

Methods and apparatuses for teleconferencing systems are disclosed. According to an embodiment, there is provided a computer implemented method for a teleconferencing system, the method comprising: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; and where it is determined that a modification is to be performed, performing the determined modification. There is also provided a computer implemented method of training the neural network.

Description

METHODS AND APPARATUSES FOR TELECONFERENCING SYSTEMS
Technical Field
[0001] Embodiments of the disclosure generally relate to methods and apparatuses for teleconferencing systems.
Background
[0002] This section introduces aspects that may facilitate better understanding of the present disclosure. Accordingly, the statements of this section are to be read in this light and are not to be understood as admissions about what is in the prior art or what is not in the prior art.
[0003] In recent years there has been an increase in the use of telecommunications equipment for teleconferencing in order to perform meetings for users who are located remotely from one another, or who are interfacing though different devices even if they are in the same location, for example in different offices in the same building. In order to replicate a face-to-face meeting, teleconferencing often incorporates video as well as audio of participants including audiences and presenters. In such meetings, it is commonplace that attendees see each other and can interact with one another by talking to one another. However, it is more difficult to determine emotional cues in a teleconferencing system where participants are represented in two dimensions (2D).
[0004] Several approaches have been proposed to use psychology research and deep neural networks to be able to detect meeting participant emotions such as those shown in US9576190B2. Specifically, in US9576190B2, the system is configured to convey the detected emotion to third parties and in addition, to allow a third party to enter and review the video conference. In US9576190B2 only negative emotions are captured.
Summary
[0005] This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
[0006] While teleconferencing attempts to replicate a face-to-face meeting, there may be disadvantages of teleconferencing over a real word meeting. For example, where a teleconference system reduces video to a two-dimensional (2D) representation of participants, it becomes harder to use body language, facial expressions, posture and movement as a means of communication. Furthermore, while an advantage of a teleconference is that it can be performed by participants in any appropriate location, there may be issues with the environment that the teleconference takes place in, such as from external environmental distractions, which may be disruptive to other participants as well as the participant who is directly affected by the environment. It may also be difficult for a presenter to distinguish between reactions of participants to their environment with reactions of participants to the material being conveyed in the teleconference.
[0007] One of the objects of the disclosure is to provide an improved solution for teleconferencing. In particular, to provide improved usability of a teleconference system.
[0008] According to a first aspect of the disclosure, there is provided a computer implemented method for a teleconferencing system. The method comprises: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The method comprises determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The method comprises, where it is determined that a modification is to be performed, performing the determined modification.
[0009] The determined state may be processed by a trained neural network, wherein if the state is an unwanted state the trained neural network may determine a modification to be performed.
[0010] The state may be further based on at least one of: elapsed time; elapsed time relative to a predetermined length of the teleconference; background context of the participant; a change in at least one of: background context of the participant, the emotional state of the participant; content shared amongst participants; complexity of content shared amongst participants; previous content shared amongst participants; future content to be shared amongst participants.
[0011] The processing to determine a state may further comprise processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
[0012] Content shared amongst participants may be, for example, a presentation (such as comprising slides) that is shared between participants. The content may be a screen (such as a desktop screen) of a presenter that is shared between participants. [0013] The modification may comprise at least one of: modifying the playback speed of an audio data stream of the participant for a time period; reducing the playback speed of an audio data stream of the participant; turning off the video data stream; reducing the playback speed of the video data stream; modifying the background of frames of the video data stream.
[0014] The modification of the background of frames of the video data stream of a participant may comprise at least one of: modifying the background to a neutral setting; modifying hue; modifying saturation; modifying contrast; using image segmentation to modify the background.
[0015] An unwanted state may be a state that has been predetermined to be likely to negatively impact a participant.
[0016] The determined modification may further comprise determining a modification of the environment of the participant. The modification of the environment of the participant may comprise at least one of: turning on noise cancelling in a device of the participant; sending at least one of a notification and an actuation to a remote device; sending at least one of a notification and an actuation to a device comprised in the teleconference system.
[0017] According to an aspect there is provided a computer implemented method for training a neural network based on reinforcement learning. The method comprises processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The method comprises determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The method comprises, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification. The method comprises processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action. The method comprises training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[0018] The method may further comprise determining the reward relating to the action responsive to receiving input from the participant relating to the modification.
[0019] The reward may indicate the benefit of the modification to the participant. [0020] The state may comprise information on at least one of: the emotional state of the participant; elapsed time of the teleconference; elapsed time relative to a predetermined length of the teleconference; content shared between participants; the state of the environment of the participant; a time at which the state is observed.
[0021] An action may comprise processing instructions indicating how the at least one of the audio data stream and the video data stream are to be modified or an indication that no modification is to be performed. For example, the action may be that the video data stream and/or audio data stream are to continue without a modification.
[0022] The processing to determine a state relating to the participant may further comprise processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
[0023] In a first number of iterations of the method, the method may comprise selecting a random action to perform for an unwanted state. In a second number of iterations of the method, the neural network may select an action to perform for the unwanted state.
[0024] A plurality of iterations of the method may be performed up to the determining of the reward, wherein the method may further comprise storing a plurality of rewards, and the neural network may be trained using the plurality of rewards.
[0025] According to an aspect there is provided a computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out any of the methods described herein.
[0026] According to an aspect there is provided a computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out any of the methods described herein.
[0027] According to an aspect there is provided a data processing system. The data processing system is configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The data processing system is configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The data processing system is configured to, where it is determined that a modification is to be performed, perform the determined modification.
[0028] According to an aspect there is provided a data processing system configured to train a neural network based on reinforcement learning. The system being configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The data processing system is configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The data processing system is configured to, if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification; process the video data stream of the participant to determine a new state relating to the participant. The data processing system is configured to determine a reward relating to the action. The data processing system is configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[0029] The systems may be configured to perform any of the methods described herein.
[0030] According to an aspect there is provided a data processing system comprising a processor and a memory. The memory contains instructions executable by said processor. The system is operative to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The system is operative to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The system is operative to, where it is determined that a modification is to be performed, perform the determined modification.
[0031] According to an aspect there is provided a data processing system for training a neural network based on reinforcement learning. The data processing system comprises a processor and a memory, said memory containing instructions executable by said processor. The data processing system is operative to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The data processing system is operative to determine an action that pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The data processing system is operative to, if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification. The data processing system is operative to process the video data stream of the participant to determine a new state relating to the participant. The data processing system is operative to determine a reward relating to the action. The data processing system is operative to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[0032] According to an aspect there is provided a data processing system comprising a state detector configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The data processing system comprises an agent configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The data processing system comprises a data stream processor configured to, where it is determined that a modification is to be performed, perform the determined modification.
[0033] According to an aspect there is provided a data processing system for training a neural network. The data processing system comprises a state detector configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The data processing system comprises an agent configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The data processing system comprises a data stream processor configured to, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification. The state detector is further configured to process the video data stream of the participant to determine a new state relating to the participant and determine a reward relating to the action. The agent further configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
Brief Description of the Drawings
[0034] These and other objects, features and advantages of the disclosure will become apparent from the following detailed description of illustrative embodiments thereof, which are to be read in connection with the accompanying drawings.
[0035] FIG. 1 is a diagram illustrating a method according to an example;
[0036] FIG. 2 is a block diagram illustrating a system according to an example;
[0037] FIG. 3 is a diagram illustrating emotions as combinations of valence and arousal;
[0038] FIG. 4 is a diagram illustrating an example of a knowledge graph;
[0039] FIG. 5 is a diagram illustrating a method according to an example;
[0040] FIG. 6 is a block diagram illustrating a system according to an example;
[0041] FIG. 7 illustrates a user interface according to an example;
[0042] FIG. 8 illustrates a block diagram illustrating the processes performed by a system for training a neural network according to an example;
[0043] FIG. 9 illustrates a system according to an example; and
[0044] FIG. 10 illustrates a system according to an example.
Detailed Description
[0045] For the purpose of explanation, details are set forth in the following description in order to provide a thorough understanding of the embodiments disclosed. It is apparent, however, to those skilled in the art that the embodiments may be implemented without these specific details or with an equivalent arrangement.
[0046] While teleconferencing has facilitated communication between remote users, as this form of communication becomes more common, people who struggle with technology, have pre-existing anxiety disorders and also have other household tasks and/or distractions to attend to while in the meeting, may suffer from stress and anxiety. It is desirable to be able to detect the emotional state of a user, or events occurring in their surroundings, in order to mitigate the effect on other users, or to convey the information to the other users. In doing so, a more effective teleconferencing system may be provided.
[0047] There are various solutions that may be used to monitor reactions of users. For example, solutions are being developed to analyze electromagnetic signals from the brain. The signals can be detected by electroencephalography, EEG, a non-invasive method where no surgical procedure is required. It is therefore possible to collect data from the brain signals.
[0048] Supervised learning techniques can be used to train models to interpret state of human based on electromagnetic brain signals, for example, mental fatigue or a broader range of emotional states.
[0049] Other methods for detecting emotional states are based on galvanic skin response (GSR), which measures electrical parameters of human skin, respiration rate analysis (RR), which measures respiration velocity and depth (that vary with human emotion), and electrooculography (EOG), that measures the comeo-retinal standing potentials between the front and back of the human eye.
[0050] A common feature of the aforementioned approaches is the requirement for special type of sensors (albeit non-invasive in some cases), that need to measure qualities of the subject in order to identify their emotional state.
[0051] Another area if interest which does not require specialized sensors is that of using visual cues to identify the emotional state of a subject, for example by analyzing facial expressions, body posture and movements. For example, there are available systems able to detect real time stress levels and arousal, or deal with a broader range of emotions ranging from happiness, to sadness, anger and disgust.
[0052] In current systems which perform emotion recognition during online meetings, approaches that detect emotional condition of participants look to either introduce third parties to arbitrate or moderate the session or convey the emotion of a participant to the rest of the party. In contrast, the methods and systems described herein reprocess the audio and video data streams in order to normalize the online session, thereby providing an improved teleconference system. Such a system may also mitigate negative emotions and facilitate presenting.
[0053] Furthermore, current systems are limited to a focus on particular (typically negative) emotions, rather than on emotions such as apprehension which may be due to an interruption which is about to occur (e.g. the need to step out of a meeting). In contrast, the methods and systems herein adapt the functioning of a telecommunication system by processing a video data stream of a participant (and in some cases the audio data stream as well) to provide an improved telecommunication system.
[0054] According to an example there is provided a computer implemented method for a teleconferencing system, the method comprising, in a first step, processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant. In a second step, determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. In a third step, where it is determined that a modification is to be performed, performing the determined modification.
[0055] This method is illustrated in Fig. 1, which shows a method according to an example. In particular, this Figure illustrates the first step 102 of processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, the second step 104 of determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a third step 106 of, where it is determined that a modification is to be performed, performing the determined modification.
[0056] In this way, modifications may be performed to reduced unwanted states, for example, distractions for participants. Thus, usability of a teleconference system is improved. The distractions maybe predetermined (e.g. preset), where the method detects a predetermined distraction and acts to mitigate the distraction, for example by turning off the video for the participant causing the distraction. Advantageously, this may result in online meetings being more adapted to a presenters’ needs. For example, there may be some drawbacks from using teleconferencing systems, including psychological and physiological side effects in the users, such as increased stress. The benefits of the methods described herein may include reducing negative emotions such as stress or anxiety related to the session and facilitating breaks. By reducing unwanted states, a presenter may be more able to focus on their presentation, rather than being required to focus on what is happening around them in their environment or distractions caused by audience members. [0057] An unwanted state may be a state that has been predetermined to be likely to negatively impact a participant. Examples of unwanted states include an audience member speaking, a person walking into the video stream of a participant (e.g., who is not the participant), a participant moving excessively, a participant moving out of the frame of the video data stream, background noise (such as a baby crying, talking or shouting from persons other than participants). Further examples include the conveying of negative emotions, for example from participants. For example, shaking of a head of a participant, or frowning, may be distracting for a presenter and therefore negatively impact them.
[0058] The methods described herein may improve the presenter’s experience by reducing unwanted events, such as drinking water, without generating any disturbance for the meeting audience, or making other business during a virtual event. Unwanted events may comprise, for example, a child or adult walking into a room, an animal causing a distraction, an anxiety or panic attack, disorders that cause movement or sounds, etc. The unwanted states may be states that are unwanted to any participant, to a presenter, or to an audience. In particular, the unwanted state may be a state that would negatively impact a presenter, for example, by causing them emotional distress or distracting them.
[0059] A teleconferencing system may comprise a plurality of remote devices that are connected through a network, such as a wireless network, radio network, telecommunications network, via an internet connection, etc. A remote device may be, or may be comprised in, a client device. Each participant may have access to a client device. The client device may comprise at least one of a camera, a microphone, a system for sharing content (e.g. a presentation), a network interface for communicating video, audio, and/or content (e.g. cellular connectivity), and a display for displaying the teleconference (so that the user can see other participants). The teleconferencing system may comprise client devices configured to display the teleconference, where the teleconference may be displayed differently at each client device. A client device may be, for example, a computer, laptop, mobile phone. The teleconferencing system may comprise at least one camera and at least one microphone which are configured to record video data streams and audio data streams respectively. A participant may be recorded via a camera and/or a microphone. The data streams resulting from these recordings may be played to other participants. The video data streams may be displayed using an application, such as a web application, mobile application, desktop application. Content of a display device may be shared as shared content in the teleconferencing system, such as the desktop of a user, slides of a presentation etc.
[0060] The determined state may be processed by a trained neural network, wherein if the state is an unwanted state the trained neural network determines a modification to be performed.
[0061] In this way, a trained neural network may have been trained in order to determine the optimal action relating to a modification to take for a particular state of a user. This may be choosing to take no action, make no modification, where the state is not an unwanted state (e.g. a typical state where no predetermined distractions are occurring) or choosing the optimal modification where the state is an unwanted state. Therefore, the most appropriate modification may be selected for a given state.
[0062] In one example, one of the participants of a teleconference is a “presenter” (for example is primarily speaking or presenting information to other participants), and other of the participants are the “audience” (for example generally do not speak or present information). It may be assumed in some examples that there are no interactions from the audience during the presentation. Herein, we assume the presence of one “presenter” who is presenting to a virtual “audience” of potentially multiple members. The role of the “presenter” and “audience” can be changed during the presentation, between the attendees.
[0063] Fig. 2 illustrates a system 200 configured to perform the method for a teleconferencing system according to an example. In particular, the system comprises a state detector 214, an agent 216, and a data stream processor 218. The system may also comprise, or be connected to, a data stream provider 220. The data stream provider 220 may provide at least one of an audio data stream, video data stream, shared content of at least one participant to the state detector. For example, the data stream provider may comprise data streams of a plurality of participants, such as audience members and a presenter. The “State Detector” (SD) may be configured to use reinforcement learning to leam the state relating to a participant, such as the presenter, and predict future states. The “Data stream Processor” (DP) may decide what actions should be carried out (e.g. modifications to perform) in order to improve and facilitate the presentation by actuating control on the audio and video data streams. The DP may use reinforcement learning in order to predict the best actions.
[0064] The state detector 214 (SD) is configured to receive data 222 from a data stream provider 220 (e.g., the presenter’s mobile application). This data may include any of a video data stream and an audio data stream as well as content that is being shared between participants (for example, a presentation). The SD processes these data streams and outputs a user (participant) state description 224 which contains a set of information regarding the user’s current state. Thus, a state relating to the participant may be determined. The state is based on at least one of an emotional state of the participant and a state of an environment of the participant. In particular, the state relating to the participant may comprise information elapsed time (of the teleconference). The state may comprise information on elapsed time relative to a predetermined length of the teleconference. For example, with an increase in the elapsed time, participants may begin to experience fatigue or lose focus. The state may comprise information on background context of the participant. For example, the background context may indicate a potential distraction in the frame of a participant. The state may comprise information on a change in background context of the participant. For example, the background context of the participant may indicate that a person has entered the room in which the participant is participating in the teleconference, and is likely to cause an interruption. The state may comprise information on a change in the emotional state of the participant. The state may comprise information on content shared amongst participants. The state may comprise information on complexity of content shared amongst participants. The state may comprise information on previous content shared amongst participants. The state may comprise information on future content to be shared amongst participants.
[0065] For example, a state may comprise the information that is indicative of the current user status, and may be used to inform the agent of the current status of the user. Looking at Fig. 2, this user state description is the output of SD, which uses raw data stream information from the presenter and the audience to create a semantic description of the state. The state of a presenter at a given iteration (episode) i can be described as:
Si = {pi, ... , pk, ai, ... , ax} where pi to pk are observations concerning the presenter, and ai to ax are observations concerning the audience members of the meeting. Then,
V pm 6 Si: pm={em,Sm,Cm,bm,tm} where em is the emotional state of the user as product of sentiment analysis at the SD of video and audio data streams.
[0066] An example of sentiment analysis is presented in J. A. Russell, “A circumplex model of affect,” J. Pers. Soc. Psychol., vol. 39, no. 6, pp. 1161-1178, 1980, wherein each emotion can be understood as a linear combination of valence and arousal. Therefore, em can be considered to be a 2-tuple of <valm, arm>, where both valm and arm are normalized values: valm, arm G [- 1,1].
[0067] Thus, the video data stream of a participant may be processed using sentiment analysis. In particular, the video data stream may be processed by a convolutional neural network classifier, trained to detect valence and arousal emotions from labeled video data frames. Thus, a user state description may be output based on the video data stream.
[0068] Fig. 3 illustrates an example of various emotions determined as combinations of valence and arousal. In particular, the valence may range from unpleasant (-1) to pleasant (1), and the arousal may range from deactivated (-1) to activated (1). Generally, the detection of an unpleasant emotional state in a participant may result in a modification, where the specific emotional state detected may be usable to determine an appropriate action to take. For example, where the detected emotion is “stressed”, it may be desirable to turn off the video feed of the participant who is expressing stress in order that the presenter is not distracted.
[0069] The timeline of the teleconference is defined as sm, which may indicate a timeline of the presentation. A timeline can be expressed, for example, as a percentage of the time elapsed if the time of the meeting is known in advance, in which case sm can be normalized (sm £ [0,1], where 0 indicates the start of the meeting and 1 the end of the meeting). Alternatively, if the length of meeting is not known in advance, then sm may indicate the time elapsed, e.g., the number of seconds elapsed. Alternatively or additionally, in a case where the content being shared between participants are slides (for example, of a presentation), then sm can indicate the slide number (which can be used in conjunction with other metrics such as time elapsed - in this case sm is also a 2-tuple, or in isolation).
[0070] A further optional parameter that may be used is cm; in this case the optional parameter is the content being shared is slides. In this example, the optional parameter indicates the complexity of the content being presented and about to be presented (i.e., in the subsequent x number of slides). A higher cm may indicate that a break may be needed for both presenter and the audience. Additionally, or alternatively, the optional parameter may indicate that slower playback is required, in order for people to have enough time for both presenting clearly and understanding the presentation.
[0071] The complexity of the content may be dependent on a number of factors. Firstly, the volume of content (for example, in slides), which may include the total number of words nWords, pictures nPiCS, the number of animations nanim, in the slides of the presentation (Nwords,Npics,Nanim), S:
Ccontent (nwords /Nwords )+( Upics /Npics )+ (nanim /Nanim)
Ccontent is normalized to range between [0,1] as:
Ccontent_norm = (Ccontent_min “Ccontent )/ (Ccontent_max “Ccontent_min).
[0072] Secondly, readability metrics (e.g. Flesch-Kincaid score, Dale-Chall readability) inferring the difficulty of understanding and lexical sophistication as readability may also be considered as contributing to the complexity of the content. The range of Creadabiiity will be between [0, 1] where 0 is text that is easily understood and 1 is text that is very difficult to understand.
[0073] Then, an exemplary formula for calculating cm could be:
Cm = Ccontent + Creadabiiity
[0074] The Cm value is then stored in a vector Cm siides where each element is the complexity of each slide. For example, Cm_siides = [cm_i, Cm 2, ... , cm_n ] where cm_nis the value of cm on slide number n. This makes it possible to see where in the presentations the slides with the highest complexity occur. The number x of slides to take into account may depend on the implementation (e.g., it could be 1 - i.e., current slide or 3, i.e., current and next 2 slides, etc.).
[0075] The background context, or environment, of the user may also indicate events that could distract, or negatively impact, the user. The background context may be indicated by bm which is an assessment of the background context of the user. The assessment may be based both on analysis of the audio stream and the video stream, or just an analysis of the video stream or audio stream. For analysis of the audio stream, fast fourier transformation (FFT) can be used to distinguish the user’s voice from the background audio streams, and then audio fingerprinting can be used to identify the background audio streams semantically. Subsequently, a graph knowledge base can be used to associate the identified background audio streams with a probability for user distraction.
[0076] An example graph is shown in Fig. 4, which shows an ontology of entities and their relations. This graph shows an example of a knowledge graph that correlates fingerprinted background audio with a probability of distraction (POD), a scalar that shows how probable it is that the user will be distracted from the background audio. As is illustrated in this example, it is possible to use the knowledge base to extract a probability of distraction value (POD) from the graph. As is illustrated in this Figure, a baby crying, yelling talking are determined as having POD of 0.7, 0.6, and 0.4 respectively, and are also classed as human audio, which is further classed as sound. Similarly, a phone ringing, a doorbell ringing, an alarm clock ringing have POD of 0.4, 0.3 and 0.2 respectively, and are classed as electronic devices, which is also classed as sound. Sound may be classed as a “thing” (where a “thing” is a superclass that all classes inherit from).
[0077] It will be appreciated that a similar graph to that shown in Fig. 4 can be used for visual observations, or the graph may additionally incorporate information from visual observation. For example, if a yelling person or crying baby is in the frame, then the POD will be increased.
[0078] The background context bm may therefore indicate the sum of PODs from audio and/or visual analysis of the audio data stream and/or the video data stream.
[0079] A state descriptor may further comprise a time at which a state occurs. For example, tm may be the timestamp of observation indicating the duration through which em, Cm, Sm, and bm were observed. Thus, a user state description may include multiple observations (for example, those prefixed by “p” as described above), where a change in state over time may be used to determine if an unwanted state is occurring. (As will be described later, the neural network may use the time at which a state occurs to leam trends in how the different constituents change over time, in order to be able to predict those constituents into the future).
[0080] Considering observations from audience members, (for example, those prefixed by “a” described above), the analysis of SD for these observations may only include a visual assessment as audience members are typically muted while another participant is presenting. The visual assessment in this case may indicate a degree of frame change for every audience member - this could be performed as an image processing task where the pixels of one frame are compared with the pixels of the previous frame (e.g., in all three RGB channels in case of a colour video feed) and if the difference in pixels exceeds a certain threshold, this could be interpreted as one or more of the audience members moving excessively. For reasons of simplicity (although not exclusively), the calculation may augment all video streams from all audience members to a single scalar represented by ai, . . . , ax as described above.
[0081] Referring back to Fig. 2, the agent 216 may be an intelligent agent, which is configured to determine an optimal action to take on the data stream (comprising at least one of the audio data stream and the video data stream) based on the user state description 224. Thus, the agent may output an action 226 based on an input user state description 224. In particular, the agent may comprise a trained neural network 228 which is configured to determine an action indicative of a modification to perform on a data stream based on the detection of an unwanted state. For example, the trained neural network may determine whether the detected state of the user is an unwanted state (for example, a state that has been predetermined as likely to cause a distraction to a participant, be it the participant themself or another participant such as the presenter, or a state that has been predetermined to be likely to negatively impact a participant) and determine an appropriate modification to perform if the state is an unwanted state.
[0082] An action may be defined as processing instructions which are sent to the DP, where the DP is configured to modify incoming audio and video streams from the presenter and/or the audience members.
[0083] In an example, the action may be the delaying of video and/or audio, and/or the turning off of the video or audio, or employing noise cancelling, as described above. In this example the action space is defined as:
A = {ai, ... , ak}: am= {prolong{10s,20s,30s,40s}, presenter_video_state, audience_video_state, noise_cancellation} V am G A
In the above formalization, the first member prolong{10s,20s,30s,40s} of the set indicates by how much playback of the audio data should be delayed. In this case, playback of the audio stream slows down. The video data stream may be turned off at this point, or the playback of the video data stream may slow down as well (in case presenter_video_state is set to 1). This example comprises a 4-tuple, wherein if the first element of the 4-tuple is set to 1 then audiovisual playback is prolonged for 10 seconds, if second element is set to 1 then playback is prolonged for 20 seconds and so on. The values described herein are exemplary and any appropriate values could be used to prolong the playback and/or increase granularity of values. In this example of a 4-tuple, only one of the elements can have a value of “1”, where the other values are “0”. If all values are 0 then playback is not slowed down. In this example the second member “presenter_video_state” can be set to 0, in which case the presenter video is transmitted to audience members, or 1, in which case the video is turned off. The third member “audience_video_state” can be set to 0, in which case all video from audience members is not forwarded to presenter’s meeting application for rendering, or 1, in which case the video is forwarded. The fourth member “noise_cancellation” is set to 0 if noise cancellation for the user needs to be inactivated (and vice versa).
[0084] Thus, the modification indicated by the action may be a modification of the speed of the data stream for a given amount of time. In one example, the speed of the audio data stream is reduced while video is turned off. Sharing of content may continue as normal. The modification may be modification of background colors of audience members to a neutral setting (i.e., by adjusting hue/saturation/contrast and/or using image segmentation techniques) or turning off video stream of audience members (for example, if it is determined that reactions of audience members may cause anxiety and/or agitation in the presenter). Thus, the modification may comprise modifying the playback speed of an audio data stream of the participant for a time period. The modification may comprise reducing the playback speed of an audio data stream of the participant. The modification may comprise turning off the video data stream. The modification may comprise reducing the playback speed of the video data stream,. The modification may comprise modifying the background of frames of the video data stream. The modification of the background of frames of the video data stream of a participant may comprise modifying the background to a neutral setting. The modification of the background of frames of the video data stream of a participant may comprise modifying hue. The modification of the background of frames of the video data stream of a participant may comprise modifying saturation. The modification of the background of frames of the video data stream of a participant may comprise modifying contrast. The modification of the background of frames of the video data stream of a participant may comprise using image segmentation to modify the background.
[0085] In this way, the modification provides the presenter with a time period in which they are no longer required to present while the other participants still receive a continuous presentation. For example, the presenter method may cause the video data stream of the participant to be turned off while the playback speed of the audio data stream is reduced (while the presenter continues to present). After a time period, the presenter will have presented sufficient material that they can stop presenting while the slowed audio data stream is still presented to the other participants. This may provide the presenter with an opportunity to deal with the distraction.
[0086] In a further example, the determined modification further comprises determining a modification of the environment of the participant. For example, the modification of the environment of the participant may comprise at least one of: turning on noise cancelling in a device of the participant; sending at least one of a notification and an actuation to a remote device; sending at least one of a notification and an actuation to a device comprised in the teleconference system. [0087] Thus, the modification may be an interaction with a wearable device of the user, for example in order to turn off or on noise cancellation on the user’s headset or initiate a notification on their smartwatch. Thus, the modification may comprise sending a notification/actuation to a remote device that is remote from the teleconference system. For example, the modification may comprise a notification sent to a smart watch, or mobile phone, of a participant. The modification may comprise actuation of lights, windows, curtains and so on. This may alert the participant that they are causing a distraction. A notification may be sent to a device comprised in the teleconference system such as a laptop which is being used by a participant as a part of the teleconference system.
[0088] The data stream processor 218 (DP) may implement the modification chosen by the intelligent agent. For example, the DP may intercept any of the video, audio and content sharing data stream and modify the content of any one of these based on the determined modification, and output at least one processed data stream 230 that has been processed based on the action or modification determined by the agent. The processed data streams 230 may be transmitted to user interfaces 232 (for example, at least to the user interface used by the presenter).
[0089] The trained neural network may be developed during a training phase prior to use, in which the agent leams to choose the best modification given a user state description by training its neural network using user feedback formalized as a reward. Then, the trained neural network may be used as a part of the agent, wherein the agent executes its neural network given a user state description to determine an action to take (a modification to perform).
[0090] The trained neural network may be trained using a computer implemented method for training a neural network based on reinforcement learning, the method comprising: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action; and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action. [0091] This method is illustrated in Fig. 5, which shows a method according to an example. In particular, this method comprises a first step 508 of processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, a second step 510 of determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a third step 512 of, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action; and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[0092] Thus, a training phase may first occur prior to use of the trained neural network. (It will be further appreciated, however, that additional training of the neural network may be performed at any time.) The training may be reinforcement learning. For example, the agent may operate in a deep reinforcement learning (RL) loop with its environment. As per RL theory, a single agent case may be considered, which solves a Markov Decision Process (MDP) with an unknown transition probability model. According to MDP, an intelligent agent interacts with the environment. Federated learning can be used to generalize training for multiple distributed workers (i.e. , users). Such generalizations may benefit variance that a model would experience when deployed in reality, and may mitigate bias.
[0093] Fig. 6 illustrates the block components of a system for training a neural network based on reinforcement learning. In particular, the system comprises a state detector 614, an agent 616, and a data stream processor 618. The system may also comprise, or be connected to, a data provider 620.
[0094] The system of Fig. 6 may be configured in substantially the same way as the system of Fig. 2 (the corresponding description is not repeated here for brevity, however, it will be appreciated that the description corresponding to Fig. 2 applies herewith, where reference numerals of Fig. 6 correspond to like reference numerals of Fig. 2), however, in this example, the agent is further configured to train the neural network, where Fig. 2 illustrates a system with the trained neural network. The state may be detected/determined as described in Fig. 2, but in this case the state may also be used for training of the neural network. The method of training the neural network may also comprise determining an action to take based on the state. The method may further comprise determining the reward relating to the action responsive to receiving input from the participant relating to the modification (e.g. via a user interface). Thus, as is shown in Fig. 6, system may further comprise a presenter feedback interface 634 configured to receive user input (for example, regarding the modification), user feedback on the action 646. The reward indicates the benefit of the modification to the participant. The user may indicate the benefit of the modification.
[0095] The agent 616 may further comprise an agent experience storage 636 configured to store information regarding the state of the user along with associated actions and rewards, and an action selection policy module 638, where the action selection policy module is configured to select either a random (exploration) action for a given user state description, or determine an action to take using a neural network 628, where the neural network is trained using the stored agent experience. The determined action 626 is then sent to the data stream processor 618 as is described in relation to Fig. 2. The data stream processor 618 modifies the data stream, and the state detector determines a new state occurring after the modification has been performed.
[0096] After the action execution, the agent may observe a reward and the new state. The reward may be scalar and quantify the effectiveness of the selected action. A 4-tuple <old state, action, reward, new_state> may be stored as an experience in the agent’s buffer (the agent experience storage 636). As is described above, the state may comprise information on at least one of: the emotional state of the participant; elapsed time of the teleconference; elapsed time relative to a predetermined length of the teleconference; content shared between participants; the state of the environment of the participant; a time at which the state is observed. An action may comprise processing instructions indicating how the at least one of the audio data stream and the video data stream are to be modified or an indication that no modification is to be performed.
[0097] After several iterations of determining a state, performing an action and receiving a reward have elapsed, the agent may train its deep neural network using experiences from the buffer. The deep neural network here may try to approximate the optimal policy. In order to do so an agent learns to select the best action over a series of repetitions-iterations known as “episodes”. In every episode, the agent observes the current state and takes an action using a selection policy (an action may be “no modification”, for example, where the state is not an unwanted state). For example, e-greedy policy favors exploration (i.e., choice of a random action) over exploitation (i.e., execution of the neural network), which may change in later iterations to favoring exploitation over exploration. In particular, in a first number of iterations, a random action to perform for a state may be selected, whereas optionally in a second number of iterations, an action to perform for a state may be selected by the neural network. Over time, the agent 616 may leam the optimal policy, i.e., the action that yields the highest short and longterm reward for any user state (for example, the action chosen in one state will also affect the future state of the environment, in addition to the immediate reward being returned. Therefore, an action also affects future rewards that will be returned, because future states are a derivative of the chosen action. The future rewards in RL are typically discounted using a discount factor to give greater importance to the immediate reward). The neural network may additionally use the time at which a state occurs to leam trends in how the different constituents change over time, in order to be able to predict those constituents into the future
[0098] Any appropriate algorithm may be used for training. For example, value-learning algorithms such as Deep-Q Learning approaches use deep neural networks to leam the best action for every state (i.e., the action that produces the highest value). Policy learning algorithms such as actor-critic approaches leam a policy that maximizes the reward. Regardless of the approach, in order to formalize MDP mathematically the state, reward and action may be defined.
[0099] Training may start from random weights (where the agent does not know anything about the environment, a priori), or alternatively there may be a baseline (pretrained) model. In the latter case, the baseline model can be trained using supervised learning, and on a dataset collected in a controlled environment, reflecting the “majority view”, i.e., the type of actions and states presenters may typically find themselves in. The use of a baseline model may reduce training time. Further training using a reinforcement learning algorithm may help the neural network leam comer cases and adapt to the emotional state and environment of specific presenters.
[00100] The reward may be determined using a feedback interface during training, for example an interface that is part of the teleconferencing system such as a computer of a user. During training of the agent, an action chosen by the agent is presented as an option to the user. If user confirms that the action is to be performed, then the Data stream Processor performs the modification corresponding to the action. Depending on the user’s input the agent is rewarded negatively or positively. For example, if the user accepts the modification, the agent may be positively rewarded. However, if the user does not accept the modification, the agent may be negatively rewarded. [00101] An example interface is illustrated in Fig. 7. Fig. 7 illustrates an exemplary user interface showing the video stream of the presenter 742 in addition to a graphical user interface with a suggested action 744 of the agent. The options given in this example for a suggested action “We have detected that your baby is crying. We can slow down your speech and give you 20 seconds. Is that OK?” are “reject”, “accept”, or “I need more time” (which provides alternative time delays). The user is given a time window in which they can confirm the action. If the user misses the time window, then the reward is not transmitted and therefore the system is required to wait for the next suggestion. If user rejects the action, then the reward is zero. If the user confirms the action, then the reward is 1.
[00102] In this example, a time delay is suggested, and in addition to the option to accept or reject the suggested time delay, an option for a different time delay is also provided. The reward may therefore be between 0 and 1 if the system correctly detects that more time is needed, but the user indicates that they require more time than the time that is suggested (e.g., where 20 seconds is suggested as is shown in Fig. 7, but a user selects a different time delay of 40s from the drop-down option). In this case, assuming the user makes a choice within allowed time, the reward may be described as follows: R = 1 - abs(time_suggested - time wanted). Thus, in a case where a reject option, an accept option, and an alternative modification are presented to a user, the reward may be determined as follows:
Figure imgf000024_0001
[00103] This reward only takes into account the fact that the user can set the time they need. A more complex interface may alternatively be used where the user/presenter can also indicate if they agree to a modification such as stopping/starting their video or video of the audience and/or turning on/off the noise cancellation. In the above example, the responses to these suggestions would be captured as indicating that the modification is correct or incorrect by the binary accept/select action, but it is possible to have more fine-grained approach, where the time window in which a user can make a choice may also be increased with granularity.
[00104] It is noted that in the operation phase (using the trained neural network), the feedback interface from the user is not needed, as the agent makes automatic decisions on the user’s behalf. As such, the user state description is fed directly to the agent by the SD, which inputs the user state description to its trained neural network (e.g. a deep recurrent Q network, DRQN), which in turns outputs an action, which is interpreted by DP. The DP then processes incoming data streams to the presenter’s rendering application.
[00105] Fig. 8 illustrates a block diagram illustrating the processes performed by a system for training a neural network according to an example. In particular, in a first loop 860, user feedback on an action 846 is sent from a presenter feedback interface 834 to a state detector 814. Data streams (comprising, for example, audio and video data streams and content shared between participants) are sent from data stream providers 820 to the state detector 814. An agent 816 initialized an experience replay memory 863. In a second loop 862 within the first loop 860, the second loop performed for episodes 1-k, the state detector 814 sends a selected action 864 (selected using a selection policy) to the agent 816. The state detector also sends an observed reward and a new state 866 to the agent. The agent stores the action, observed reward, new state 868 (<s(j+ l),s(j),a,r(j)>) (<current state as result of the action, old state, action taken in old state, reward for action>) in the memory. In a third loop 864 within the second loop 862, the third loop performed for i=l . . . T, the agent selects a random minibatch of experiences from the memory870, sets y(j) = r(j)+ymaxQ(s(j+l),a(j+l),tn) 872 (the calculation of the target value - y(j), using target Q network), and performs a gradient descent step on tn (y(j)- Q(s(j),a(j),dqn))t2 874 (mean squared error calculation for gradient descent (training) of Deep- Q network, squared difference of target value (above) minus Deep Q-network predicted value). In the second loop 862, the state detector sends the data stream to the data stream processor 818, and the data stream processor sends the processed data streams 830 to the data stream playback for the presenter 832. For example, the method may be executed as outlined below:
@startuml participant Agent participant State Detector participant Datastream Processor participant Presenter Feedback lnterface participant Datastream Providers participant Datastream Playback Presenter loop
Presenter Feedback lnterface -> State Detector : User feedback on action Datastream Providers -> State_Detector : Audio, Video, Content being shared
Agent -> Agent : Initialize experience replay memory B loop or episode = 1...K
State_Detector -> Agent: Select action a using selection policy (e.g., e-greedy)
State_Detector -> Agent: observe reward r(i), new state s(i+l)
Agent -> Agent: Store <s(i+l), s(i), a, r(i)> in B loop for i = 1...T
Agent->Agent: Select random minibatch of experiences <s(j+l), s(j), a, r(j)> from B
Agent->Agent: Set y(j) = r(j)+ymaxQ(s(j+l),a(j+l),tn)
Agent->Agent: Perform gradient descent step on tn (y(j)-Q(s(j),a(j),dqn)) f 2 end
Agent -> Datastream_Processor : Action end
State Detector -> Datastream Processor : Datastream FW
Datastream Processor -> Datastream Playback Presenter : Processed Datastreams end
@enduml
[00106] The implementation of the method during a teleconference may be triggered on demand, e.g., when user starts the presentation, or when a user indicates they would like the method to be performed. Alternatively, the method may be performed automatically. The system may not be in operation for the full duration of the meeting, i.e., it can stop before questions and answers interactive session that usually follows a presentation. Furthermore, the training of the neural network may occur on demand, for example, by an indication by a user. Training may occur each time a new user uses the teleconference system.
[00107] As illustrated in Fig. 9, in aspects of embodiments the (data processing) system 900 comprises processing circuitry (or logic) 948. The processing circuitry 948 controls the operation of the system 900 and can implement the method described herein in respect of the system 900. The processing circuitry 948 can be configured or programmed to control the system 900 in the manner described herein. The processing circuitry 948 can comprise one or more hardware components, such as one or more processors, one or more processing units, one or more multi-core processors and/or one or more modules. In particular implementations, each of the one or more hardware components can be configured to perform, or is for performing, individual or multiple steps of the method described herein in respect of the system 900. In some embodiments, the processing circuitry 948 can be configured to run software to perform the method described herein in respect of the system 900. The software may be containerised according to some embodiments. Thus, in some embodiments, the processing circuitry 948 may be configured to run a container to perform the method described herein in respect of the system 900.
[00108] Briefly, the processing circuitry 948 of the system 900 may be configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant. The processing circuitry 948 is further configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state. The processing circuitry 948 is further configured to where it is determined that a modification is to be performed, perform the determined modification.
[00109] In another example, the processing circuitry 948 of the system 900 may be configured to train a neural network based on reinforcement learning. The processing circuitry 948 may be further configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification, processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action, and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[00110] As illustrated in Fig. 9, in some embodiments, the system 900 may optionally comprise a memory 950. The memory 950 of the system 900 can comprise a volatile memory or a non- volatile memory. In some embodiments, the memory 950 of the system 900 may comprise a non-transitory media. Examples of the memory 950 of the system 900 include, but are not limited to, a random access memory (RAM), a read only memory (ROM), a mass storage media such as a hard disk, a removable storage media such as a compact disk (CD) or a digital video disk (DVD), and/or any other memory.
[00111] The processing circuitry 948 of the system 900 can be connected to the memory 950 of the system 900. In some embodiments, the memory 950 of the system 900 may be for storing program code or instructions which, when executed by the processing circuitry 948 of the system 900, cause the system 900 to operate in the manner described herein in respect of the system 900. For example, in some embodiments, the memory 950 of the system 900 may be configured to store program code or instructions that can be executed by the processing circuitry 948 of the system 900 to cause the system 900 to operate in accordance with the method described herein in respect of the system 900. Alternatively or in addition, the memory 950 of the system 900 can be configured to store any information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein. The processing circuitry 948 of the system 900 may be configured to control the memory 950 of the system 900 to store information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
[00112] In some embodiments, as illustrated in Fig. 9, the system 900 may optionally comprise a communications interface 952. The communications interface 952 of the system 900 can be connected to the processing circuitry 948 of the system 900 and/or the memory 950 of system 900. The communications interface 952 of the system 900 may be operable to allow the processing circuitry 948 of the system 900 to communicate with the memory 950 of the system 900 and/or vice versa. The communications interface 952 of the system 900 can be configured to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein. In some embodiments, the processing circuitry 948 of the system 900 may be configured to control the communications interface 952 of the system 900 to transmit and/or receive information, data, messages, requests, responses, indications, notifications, signals, or similar, that are described herein.
[00113] Although the system 900 is illustrated in Fig. 9 as comprising a single memory 950, it will be appreciated that the system 900 may comprise at least one memory (i.e. a single memory or a plurality of memories) 34 that operate in the manner described herein. Similarly, although the system 900 is illustrated in Fig. 9 as comprising a single communications interface 952, it will be appreciated that the system 900 may comprise at least one communications interface (i.e. a single communications interface or a plurality of communications interface) 36 that operate in the manner described herein. It will also be appreciated that Fig. 9 only shows the components required to illustrate an embodiment of the system 900 and, in practical implementations, the system 900 may comprise additional or alternative components to those shown.
[00114] Fig. 10 illustrates system 1000 according to an embodiment, the system 1000 may comprise a state detector 1054 configured to processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, an agent 1056 determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a data stream processor 1058 configured to, where it is determined that a modification is to be performed, performing the determined modification.
[00115] The system 1000 may comprise a state detector 1054 configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant, an agent 1056 configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state, and a data stream processor 1058 configured to, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification, the state detector being further configured to process the video data stream of the participant to determine a new state relating to the participant and determine a reward relating to the action; and the agent being further configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
[00116] In general, the various exemplary embodiments may be implemented in hardware or special purpose circuits, software, logic or any combination thereof. For example, some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device, although the disclosure is not limited thereto. While various aspects of the exemplary embodiments of this disclosure may be illustrated and described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that these blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
[00117] As such, it should be appreciated that at least some aspects of the exemplary embodiments of the disclosure may be practiced in various components such as integrated circuit chips and modules. It should thus be appreciated that the exemplary embodiments of this disclosure may be realized in an apparatus that is embodied as an integrated circuit, where the integrated circuit may comprise circuitry (as well as possibly firmware) for embodying at least one or more of a data processor, a digital signal processor, baseband circuitry and radio frequency circuitry that are configurable so as to operate in accordance with the exemplary embodiments of this disclosure.
[00118] It should be appreciated that at least some aspects of the exemplary embodiments of the disclosure may be embodied in computer-executable instructions, such as in one or more program modules, executed by one or more computers or other devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types when executed by a processor in a computer or other device. The computer executable instructions may be stored on a computer readable medium such as a hard disk, optical disk, removable storage media, solid state memory, RAM, etc. As will be appreciated by one of skill in the art, the function of the program modules may be combined or distributed as desired in various embodiments. In addition, the function may be embodied in whole or in part in firmware or hardware equivalents such as integrated circuits, field programmable gate arrays (FPGA), and the like.
[00119] References in the present disclosure to “one embodiment”, “an embodiment” and so on, indicate that the embodiment described may include a particular feature, structure, or characteristic, but it is not necessary that every embodiment includes the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to implement such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described. [00120] It should be understood that, although the terms “first”, “second” and so on may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, a first element could be termed a second element, and similarly, a second element could be termed a first element, without departing from the scope of the disclosure. As used herein, the term “and/or” includes any and all combinations of one or more of the associated listed terms.
[00121] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. As used herein, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms “comprises”, “comprising”, “has”, “having”, “includes” and/or “including”, when used herein, specify the presence of stated features, elements, and/or components, but do not preclude the presence or addition of one or more other features, elements, components and/ or combinations thereof. The terms “connect”, “connects”, “connecting” and/or “connected” used herein cover the direct and/or indirect connection between two elements.
[00122] The present disclosure includes any novel feature or combination of features disclosed herein either explicitly or any generalization thereof. Various modifications and adaptations to the foregoing exemplary embodiments of this disclosure may become apparent to those skilled in the relevant arts in view of the foregoing description, when read in conjunction with the accompanying drawings. However, any and all modifications will still fall within the scope of the non-Limiting and exemplary embodiments of this disclosure.

Claims

Claims
1. A computer implemented method for a teleconferencing system, the method comprising: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determining a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; and where it is determined that a modification is to be performed, performing the determined modification.
2. The computer implemented method as claimed in claim 1 , wherein the determined state is processed by a trained neural network, wherein if the state is an unwanted state the trained neural network determines a modification to be performed.
3. The computer implemented method as claimed in any preceding claim, wherein the state is further based on at least one of: elapsed time; elapsed time relative to a predetermined length of the teleconference; background context of the participant; a change in at least one of: background context of the participant, the emotional state of the participant; content shared amongst participants; complexity of content shared amongst participants; previous content shared amongst participants; future content to be shared amongst participants.
4. The computer implemented method as claimed in any preceding claim, wherein the processing to determine a state further comprises processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
5. The computer implemented method as claimed in any preceding claim, wherein the modification comprises at least one of: modifying the playback speed of an audio data stream of the participant for a time period; reducing the playback speed of an audio data stream of the participant; turning off the video data stream; reducing the playback speed of the video data stream; modifying the background of frames of the video data stream.
6. The computer implemented method as claimed in claim 5, wherein the modification of the background of frames of the video data stream of a participant comprises at least one of: modifying the background to a neutral setting; modifying hue; modifying saturation; modifying contrast; using image segmentation to modify the background.
7. The computer implemented method as claimed in any preceding claim, wherein an unwanted state is a state that has been predetermined to be likely to negatively impact a participant.
8. The computer implemented method as claimed in any preceding claim, wherein the determined modification further comprises determining a modification of the environment of the participant.
9. The computer implemented method as claimed in claim 8, wherein the modification of the environment of the participant comprises at least one of: turning on noise cancelling in a device of the participant; sending at least one of a notification and an actuation to a remote device; sending at least one of a notification and an actuation to a device comprised in the teleconference system.
10. A computer implemented method for training a neural network based on reinforcement learning, the method comprising: processing a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determining an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; processing the video data stream of the participant to determine a new state relating to the participant; determining a reward relating to the action; and training a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
11. The computer implemented method as claimed in claim 10, wherein the method further comprises determining the reward relating to the action responsive to receiving input from the participant relating to the modification.
12. The computer implemented method as claimed in claim 10 or 11, wherein the reward indicates the benefit of the modification to the participant.
13. The computer implemented method as claimed in any of claims 10 to 12, wherein the state comprises information on at least one of: the emotional state of the participant; elapsed time of the teleconference; elapsed time relative to a predetermined length of the teleconference; content shared between participants; the state of the environment of the participant; a time at which the state is observed.
14. The computer implemented method as claimed in any of claims 10 to 13, wherein an action comprises processing instructions indicating how the at least one of the audio data stream and the video data stream are to be modified or an indication that no modification is to be performed.
15. The computer implemented method as claimed in any of claim 10 to 14, wherein the processing to determine a state relating to the participant further comprises processing at least one of: an audio data stream of the at least one participant; content shared amongst participants.
16. The computer implemented method as claim in any of claim 10 to 15, wherein in a first number of iterations of the method, the method comprises selecting a random action to perform for an unwanted state.
17. The computer implemented method as claimed in claim 16, wherein in a second number of iterations of the method, the neural network selects an action to perform for the unwanted state.
18. The computer implemented method as claimed in any of claim 10 to 17, wherein a plurality of iterations of the method are performed up to the determining of the reward, wherein the method further comprises storing a plurality of rewards, and the neural network is trained using the plurality of rewards.
19. A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claim 1 to 18.
20. A computer-readable storage medium comprising instructions which, when executed by a computer, cause the computer to carry out the method of any of claim 1 to 18.
21. A data processing system configured to: process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; and where it is determined that a modification is to be performed, perform the determined modification.
22. The data processing system as claimed in claim 21, wherein the system is configured to perform the method of any of claim 1 to 9.
23. A data processing system configured to train a neural network based on reinforcement learning, the system being configured to: process a video data stream of at least one participant of a teleconference to detect a state relating to a participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification; process the video data stream of the participant to determine a new state relating to the participant; determine a reward relating to the action; train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
24. The data processing system as claimed in claim 23, wherein the system is configured to perform the method of any of claim 10 to 18.
25. A data processing system comprising a processor and a memory, said memory containing instructions executable by said processor, whereby said system is operative to: process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; and where it is determined that a modification is to be performed, perform the determined modification.
26. A data processing system for training a neural network based on reinforcement learning, the data processing system comprising a processor and a memory, said memory containing instructions executable by said processor, whereby said system is operative to: process a video data stream of at least one participant of a teleconference to detect a state relating to a participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; determine an action that pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; if a modification is to be performed, modify the at least one of the audio data stream and the video data stream by performing the determined modification; process the video data stream of the participant to determine a new state relating to the participant; determine a reward relating to the action; and train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
27. A data processing system comprising: a state detector configured to process a video data stream of at least one participant of a teleconference to determine a state relating to the participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; an agent configured to determine a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; and a datastream processor configured to, where it is determined that a modification is to be performed, perform the determined modification.
28. A data processing system for training a neural network, the data processing system comprising: a state detector configured to process a video data stream of at least one participant of a teleconference to detect a state relating to a participant, wherein the state is based on at least one of an emotional state of the participant and a state of an environment of the participant; an agent configured to determine an action which pertains to a modification of at least one of the video data stream and an audio data stream of the at least one participant to be performed if the state is an unwanted state; a datastream processor configured to, if a modification is to be performed, modifying the at least one of the audio data stream and the video data stream by performing the determined modification; the state detector further configured to process the video data stream of the participant to determine a new state relating to the participant and determine a reward relating to the action; and the agent further configured to train a neural network based on: the state relating to the participant, the new state relating to the participant, the reward relating to the action, and the action.
PCT/EP2022/076031 2021-10-01 2022-09-20 Methods and apparatuses for teleconferencing systems WO2023052187A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GR20210100665 2021-10-01
GR20210100665 2021-10-01

Publications (1)

Publication Number Publication Date
WO2023052187A1 true WO2023052187A1 (en) 2023-04-06

Family

ID=83898091

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2022/076031 WO2023052187A1 (en) 2021-10-01 2022-09-20 Methods and apparatuses for teleconferencing systems

Country Status (1)

Country Link
WO (1) WO2023052187A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090033737A1 (en) * 2007-08-02 2009-02-05 Stuart Goose Method and System for Video Conferencing in a Virtual Environment
US20160148043A1 (en) * 2013-06-20 2016-05-26 Elwha Llc Systems and methods for enhancement of facial expressions
US9576190B2 (en) 2015-03-18 2017-02-21 Snap Inc. Emotion recognition in video conferencing
US20190289258A1 (en) * 2018-03-16 2019-09-19 Lenovo (Singapore) Pte. Ltd. Appropriate modification of video call images
EP3772850A1 (en) * 2019-08-08 2021-02-10 Avaya Inc. Optimizing interaction results using ai-guided manipulated video
US20210099672A1 (en) * 2019-10-01 2021-04-01 Hyperconnect, Inc. Terminal and operating method thereof
US20210185276A1 (en) * 2017-09-11 2021-06-17 Michael H. Peters Architecture for scalable video conference management

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090033737A1 (en) * 2007-08-02 2009-02-05 Stuart Goose Method and System for Video Conferencing in a Virtual Environment
US20160148043A1 (en) * 2013-06-20 2016-05-26 Elwha Llc Systems and methods for enhancement of facial expressions
US9576190B2 (en) 2015-03-18 2017-02-21 Snap Inc. Emotion recognition in video conferencing
US20210185276A1 (en) * 2017-09-11 2021-06-17 Michael H. Peters Architecture for scalable video conference management
US20190289258A1 (en) * 2018-03-16 2019-09-19 Lenovo (Singapore) Pte. Ltd. Appropriate modification of video call images
EP3772850A1 (en) * 2019-08-08 2021-02-10 Avaya Inc. Optimizing interaction results using ai-guided manipulated video
US20210099672A1 (en) * 2019-10-01 2021-04-01 Hyperconnect, Inc. Terminal and operating method thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
J. A. RUSSELL: "A circumplex model of affect", J. PERS. SOC. PSYCHOL., vol. 39, no. 6, 1980, pages 1161 - 1178, XP055648211, DOI: 10.1037/h0077714

Similar Documents

Publication Publication Date Title
US10861483B2 (en) Processing video and audio data to produce a probability distribution of mismatch-based emotional states of a person
Macdonald et al. Gaze in a real-world social interaction: A dual eye-tracking study
US10956831B2 (en) Detecting interaction during meetings
US9711056B1 (en) Apparatus, method, and system of building and processing personal emotion-based computer readable cognitive sensory memory and cognitive insights for enhancing memorization and decision making skills
Ba et al. Multiperson visual focus of attention from head pose and meeting contextual cues
US8243116B2 (en) Method and system for modifying non-verbal behavior for social appropriateness in video conferencing and other computer mediated communications
US20160042648A1 (en) Emotion feedback based training and personalization system for aiding user performance in interactive presentations
US10834456B2 (en) Intelligent masking of non-verbal cues during a video communication
US11546182B2 (en) Methods and systems for managing meeting notes
JP2021044001A (en) Information processing system, control method, and program
Ogawa et al. Favorite video classification based on multimodal bidirectional LSTM
EP3693847B1 (en) Facilitating awareness and conversation throughput in an augmentative and alternative communication system
WO2017062163A1 (en) Proxies for speech generating devices
US11632258B1 (en) Recognizing and mitigating displays of unacceptable and unhealthy behavior by participants of online video meetings
US20170301037A1 (en) Group discourse architecture
Byun et al. Honest signals in video conferencing
Bilac et al. Gaze and filled pause detection for smooth human-robot conversations
Zeng et al. Emotion recognition based on multimodal information
WO2023052187A1 (en) Methods and apparatuses for teleconferencing systems
US20230402191A1 (en) Conveying aggregate psychological states of multiple individuals
Shan et al. Speech-in-noise comprehension is improved when viewing a deep-neural-network-generated talking face
WO2022269801A1 (en) Video analysis system
WO2022201270A1 (en) Video analysis program
WO2022269802A1 (en) Video analysis system
KR20190030549A (en) Method, system and non-transitory computer-readable recording medium for controlling flow of advertising contents based on video chat

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22790300

Country of ref document: EP

Kind code of ref document: A1