WO2010125488A2

WO2010125488A2 - Prompting communication between remote users

Info

Publication number: WO2010125488A2
Application number: PCT/IB2010/051605
Authority: WO
Inventors: Pavankumar M. Dadlani Mahtani; Aki S. Harma; Marten J. Pijl; Steven L. J. D. E. Van De Par; Boris E. R. De Ruyter; Mauro Barbieri; Caifeng Shan
Original assignee: Koninklijke Philips Electronics N.V.
Priority date: 2009-04-29
Filing date: 2010-04-14
Publication date: 2010-11-04

Description

Prompting communication between remote users

FIELD OF THE INVENTION

The invention relates to a method of communicating information relating to remote users, a system for communicating information relating to remote users, and a computer program.

BACKGROUND OF THE INVENTION

US 2007/0300174 Al discloses a monitoring system and method that involves monitoring user activity in order to facilitate managing and optimizing the utilization of various system resources. A monitoring component can monitor and collect activity data from one or more users on a continuous basis, when prompted, or when certain activities are detected. Activity data can include, but is not limited to the following: the application name or type, document name or type, activity template name or type, start/end date, completion date, category, priority level for document or matter, document owner, stage or phase of document or matter, time spent, time remaining until completion and/or error occurrence. User data about the user who is engaged in such activity can be collected as well. This can include the user's name, title or level, certifications, group memberships, department memberships, experience with current activity or activities related thereto, current physiological and emotional state and/or current projects. The monitoring component continues to monitor other user activity in addition to a target activity. As a result, the system can locate other users who are currently working on or involved with an activity similar to the target activity. A different exemplary monitoring system includes an aggregation component that aggregates activity data and the corresponding user data from local and/or remote users. An analysis component can process this data and then group it according to which users appear to be working on the same project or are working on similar tasks. A problem of the known system and method is that it only recognizes user activities defined by the devices and software applications made available to the various users. SUMMARY OF THE INVENTION

It is an object of the invention to provide a method, system and computer program of the types referred to above for supporting dispersed families and friends in staying connected in a wide variety of everyday situations. This object is achieved by the method according to the invention, which includes: obtaining data derived by analyzing and extracting descriptive features from data including at least one of audio and video data obtained at remote locations; detecting when matches indicative of particular contextual similarities occur, based on the data derived by analyzing and extracting descriptive features, the contexts including at least one of: activities of users at the remote locations, environments of users at the remote locations and moods of users at the remote locations; and - making at least one remote user at a remote location aware of synchronous matching contexts.

A match indicative of a particular similarity can be a match indicating that the contexts are different in some way. Thus, a particular similarity means no more than that there is a pre-determined relationship between at least certain aspects of the contexts being compared.

Because the method uses at least one of audio and video data, it is not reliant on specially modified devices operated by a user that are configured to facilitate the recognition of pre-determined operations using those devices. By obtaining data derived by analyzing and extracting descriptive features from at least one of audio and video data obtained at two locations, the method becomes suitable for putting remote parties in touch. It is not necessary to communicate all the audio and video data to a central server or the other of the two parties. This also makes the method less intrusive. The descriptive features are numerical in one embodiment. In another, they additionally or alternatively include binary or enumerated features. The extracted descriptive features can be those suitable for a supervised or unsupervised learning method directed at determining contexts. Potentially, they can be used in a direct comparison to determine similarity of contexts. Because the method includes making at least one remote user aware of synchronous contexts, it is potentially able to prompt communication at the right moment, namely when the similarity of context is actually occurring. There are an increasing number of dispersed family members and generations living apart. They are often struggling to stay connected and losing touch. Due to the distance that separates them it is hard to share moments together, and people often miss this. Furthermore, it is difficult to know when it is a good moment to talk. Often calls are cut short between parents and children. Studies show that many parents and children want to know when it is a good time for both parties to have a good phone conversation, instead of constantly 'polling' each other (mostly parents polling children) to see if it is a good moment to talk. The present method provides a solution to this.

In an embodiment, the at least one remote user is made aware of synchronous matching contexts by causing information to be rendered at at least one of the remote locations.

An effect is that this user or the users can still decide whether to open a communication link, e.g. make a phone call or open an audiovisual communication link. In a variant of this embodiment, an object related to at least one of a matching activity and task is augmented by an output device for rendering the information.

In the light of the description, "synchronous" can at least mean to an accuracy of one minute. The step of "making aware" may or may not be carried out in real-time. There can be a time lag between determining a match and making users aware of matching contexts. In a variant, the step of causing information to be rendered is carried out on the basis of user preferences for rendering mechanisms.

This is suitable for providing users with information on the type of similarity detected, or to provide information only when a certain type of similarity is determined to occur. Thus, the user is provided with more information for taking a decision on whether to open a communication link. In a variant of this embodiment, one of a number of available rendering modalities is selected based on the user preferences.

An embodiment of the method includes detecting when particular contextual similarities occur by determining similarities in the audio-visual feature data directly.

This embodiment has the effect of enabling a system carrying out the method to determine similarities in a wide range of contexts (activities, environments, etc.) that are not necessarily tied to particular devices a user can operate. The effect is achieved without making use of supervised learning (where a user indicates what each event is). The unsupervised classification is used to detect the similarity in activities at two locations and thereafter it is used to trigger communication. To determine the similarity in context between two locations or people, visual scene interpretation, recognition of sound events and final data fusion and reasoning steps are left out. Instead, the focus is on determining the similarities in the raw audio -visual feature data. In this way, there is no need for training of the system for the local environment. Audio visual features are processed to determine the similarity of context at remote locations in real time without explicit training. The proposed approach does not require extensive on-site training of an activity recognition system implementing the method.

If a customer purchases an activity recognition system, it is undesirable to require that the customer spends several hours training the system with a huge technical handbook of the product and later correcting the recognition errors in the trained classifier. In addition, it would be unrealistic to have a global database of activity patterns that would generalize well to different home environments and activities. Thus, any supervised pattern recognizer would have to be trained locally by the user.

In unsupervised pattern recognition, no off-line training is needed and the system learns regularly occurring events during the operation of the system like a child learns. However, the problem then is that although the system can learn all kinds of regular activity classes just by listening, nobody really knows what real activities these classes will represent. This is a fundamental problem in unsupervised classification and the reason why it is difficult to make it work in a meaningful way. The method outlined herein is actually a superb application for unsupervised pattern recognition. It is not fundamentally necessary to know in this application what the activities are. For example, a system implementing the method doesn't need to know if a person is preparing a meal or doing some physical exercises. It is enough to know that the activities performed by users at two ends of a communication link have some similarity. This can be performed using the present method, and it is possible to do without any training of the system in users' homes or need for some external pattern recognition service provider. It is simply a peer-to-peer similarity check.

An embodiment of the method includes receiving from the users an indication of agreement that specific activities should be determined to match in future.

This allows the users to establish a connection when only a certain degree or type of similarity is detected. A system carrying out the method can also notify users when a certain combination of activities (not necessarily the same) is taking place. A system implementing the method may propose to users that they agree on new specific activities or rituals that would make the system activate, which is possible due to the dynamic unsupervised training model. In this way they could even create their own (secret) language to communicate through the system.

The same system, in an embodiment, uses the same component to implement other applications of the unsupervised bilateral feature-pattern similarity analysis. This includes an application referred to as Call Shared Activity, for when two users want to get connected to do something together (watch a movie, engage in physical exercise, have a cup of coffee). The first time they do this; they open the call and indicate to the system that these activities should be matched. The system learns the activities by comparing feature vectors obtained by analyzing and extracting descriptive features from data including at least one of audio and video data obtained at a location of a user. In this application, not only the similarity but also the dissimilarity of the activities (distance in some feature vector space) is learned. If, later, one of the users starts doing the same, the system can propose that he or she be connected to the other, and that the other person be invited to join so that they can do it together. An embodiment of the method includes, when a user starts carrying out a certain activity, informing the user when another user was last involved in the same activity. If this was, for example, some physical exercise (running on a treadmill), this feature could motivate to stay fit. Again, there is no need for training activities for the system or for connecting the treadmill to a phone or anything complicated. Thus, this embodiment provides an improved sense of connectedness between dispersed members of a family or other social group.

In an embodiment, detecting when matches indicative of particular contextual similarities occur is based on user preferences.

Thus, the matching process is modified in accordance with the user preferences. This embodiment can be used if it is undesirable always to perform the same action, i.e. always make the user aware of synchronous matching contexts in the same way. It can also be used to limit the number of times that users are made aware of matches, by providing output only in case of certain types of matches.

In an embodiment, the data derived by analyzing and extracting descriptive features from data including at least one of audio and video data obtained at remote locations are received at a first location, associated with one of the users, from a second location, associated with the other of the users, and the determination of similarity is carried out at the first location. This embodiment thus does not require a central server. For this reason, it scales well with additional users. Users can subscribe to information relating to only certain other users or locations associated with such users, whereupon such information is exchanged directly between computing devices at either side. In an embodiment, when a contextual similarity is matched, a communication link is made automatically.

In this embodiment, users need not first themselves determine who the other person with matching context is and what their contact details are. This is taken care of by the system carrying out the method. In a variant of this embodiment, the communication link is made automatically based on user preferences, in particular user preferences of one of the remote users relating to an identity of another of the remote users.

In an embodiment, at least one of audio and video data from a camera is used to register a location of a user.

Environment is a relatively important aspect of a context a person is in. The use of video data to register a location of a user means that the number of environments a user can be in is not limited to those environments associated with the places where particular devices the user might operate are positioned. The system is configured such that it identifies the environment automatically. Using audio and video analysis it is possible to identify if a room is a living room (couches and tables present) or if it is a kitchen (stove and fridge are present, plus the sounds of cutlery and plates) or if it is a bedroom (presence of a bed), etc. This could potentially remove the load from the user in terms of set-up and installation of a system implementing the present method.

In an embodiment of the method, the data from which descriptive features are extracted further include data from at least one sensor sensitive to some variable related to at least one of a user and an environment of the user.

This embodiment is useful for better matching of contexts, in particular also based on moods. Moreover, activities can be characterized more accurately. This is particularly true where a limited or no interpretation is carried out, and feature vectors based on the extracted descriptive features are compared directly. An embodiment of the method includes obtaining activity data and availability data from locations associated with two of the remote users and at least one of: signaling that opening a connection for communication between the two locations is possible to the users, and opening the connection for communication between the two locations. This particular embodiment provides for the decoupling of activity and availability information. Each can be made dependent on the communication partner. The embodiment provides users with more control over what is communicated to other users. A communication system implementing the proposed method is based on detection of a user's activities and availability in one environment from cameras, microphones and possibly other sensor data, encoding and transmitting the information to a remote location, and consuming it at the other location by displaying it to other users, or controlling some other communication devices based on the information. When there is a match in activity data, this is indicated. The proposed system may be used together with another communication system such as a telephone or a video conference system to make it possible for the users to determine the best moment to call.

In a further embodiment, a user can change his or her behavior to create a shared experience, and thus maintain a feeling of connectedness. This can be initiated by one of users, or after the method makes at least one remote user at one of the remote locations aware of an absence of matches indicative of any of the particular contextual similarities, in particular an absence over a certain period of time.

In a variant of this embodiment, the at least one remote use is made aware of an absence of matches by causing information to be rendered at at least one of the remote locations. In particular, the information can be rendered such as to provide information indicative of at least an aspect of a context of at least one other of the remote users. The context can be a current or expected context of at least one other of the remote users.

Thus, the user is made aware of the lack of shared experiences. In particular, the user can be prompted to engage in particular type of activity, simply by the choice of rendering modality. For example, a fork or other item of cutlery can be augmented with LEDs, and light up when the other remote person is eating, thus prompting the user to start cooking or eating. Alternatively, the fork or other item of cutlery can light up when the other user is expected to be having dinner according to data in a diary system.

According to another aspect, the system for communicating information relating to remote users according to the invention includes: - a system for obtaining data derived by analyzing and extracting descriptive features from at least one of audio and video data obtained at remote locations, wherein the system for prompting communication is configured to detect when matches indicative of particular contextual similarities occur, based on the data derived by analyzing and extracting descriptive features, the contexts including at least one of: activities of users at the remote locations, environments of users at the remote locations and moods of users at the remote locations; and an interface to a device at at least one of the remote locations for making at least one remote user at a remote location aware of synchronous matching contexts.

In an embodiment, the system is configured to carry out a method according to the invention.

According to another aspect of the invention, there is provided a computer program including a set of instructions capable, when incorporated in a machine-readable medium, of causing a system having information processing capabilities to perform a method according to the invention.

BRIEF DESCRIPTION OF THE DRAWINGS The invention will be explained in detail with reference to the accompanying drawings, in which:

Fig. 1 shows a system for context determination using audio -visual scene analysis;

Fig. 2 is a schematic diagram illustrating functional units of the system for context determination using audio -visual scene analysis;

Fig. 3 is a flow chart illustrating a method for context-matching using audiovisual scene analysis;

Fig. 4 is a schematic diagram of an alternative system for matching context using a comparison of received feature vectors corresponding to individual audio events at remote locations;

Fig. 5 illustrates feature vectors used in the system of Fig. 4;

Fig. 6 is a schematic diagram illustrating an example of a home communication system concept incorporating a system for matching contexts, the home communication system including an Audio-Visual real-time Communication system and a Status and Availability Communication system;

Fig. 7 is a schematic diagram of functions implemented by the system for matching contexts included in the system of Fig. 6;

Fig. 8 is a flow chart illustrating steps carried out at a transmitter side in the system of Fig. 6; and Fig. 9 is an exemplary block diagram of a receiver side of the Status and Availability Communication (SAC) system.

DETAILED DESCRIPTION OF THE EMBODIMENTS By means of context determination on both sides, it is possible to match similarities. A first system is based on audio and video scene analysis for context determination. By using a combination of cameras and microphones, certain events are detected in the environment. A particular sequence of them will lead to the detection of activities, and these are recorded in a central database. The system further reasons upon these activities to provide a meaningful purpose, (e.g. match both parties' availability, match activities, etc.). The system checks whenever a match is made in both databases and triggers a particular signal at each location (e.g. a glowing picture frame).

In addition to matching activities or availability, other contextual similarities can be matched, such as being in the 'same' area (both are in their kitchens or living rooms). Mood or emotions can be detected using other existing technologies and methods and conveyed to both parties. Even minor instances such as wearing similar clothes (by using video analysis) could be detected and rendered. Instances of any of these are placed in the database and an algorithm looks for matches happening at the same time on both sides.

To provide users with awareness of contextual similarity, the information is in one embodiment rendered in some unobtrusive form or another. Some potential rendering forms include: (1) using lamps which will display color patterns representing similarities on both sides (e.g. it is agreed that a yellow glow means cooking and yellow flickering means cooking a similar item, a green glow means watching TV and other patterns of green means watching the same program or genre, etc.); (2) a picture frame to provide similarity cues on the screen and/or with light patterns around it; (3) the object related to the activity/task/etc, being matched is augmented. For instance, if both users are eating then part of the table can indicate this (light or pattern formed). If both are in the kitchen, then the stove could indicate this, etc.

In this way, it is possible automatically to detect shared moments and experiences that will trigger dispersed family and friends to feel connected at 'some level' and think about each other. This trigger could encourage them to initiate communication, (e.g. a phone call, an e-mail, a simple SMS). Furthermore, in a further embodiment, users are able to set up their system such that whenever a contextual similarity is matched and rendered, an automatic communication is made, e.g. phone rings are triggered on both sides.

Fig. 1 illustrates devices of one embodiment of the system that are provided at a location associated with the user (e.g. in the user's home). It is part of a larger system configured to support, for example, dispersed families and friends in staying connected by prompting moments of similar contextual occurrences, such as similar activities, location, availability, mood or even other small similar instances. This larger system captures context information in the environment of two or more people (that is, it understands what is happening, what activities they are engaged in, etc.) and based on matching preferences and preferences for rendering mechanisms, automatically prompts or renders something when a particular contextual similarity occurs. To be able to match contextual similarities of two remote parties (e.g. parent and child), the following steps are carried out: (1) understand what each party is doing at home, (2) capture and understand user preferences (e.g. for making a match of activities), and (3) do something based on what each party is doing and each of their matching preferences.

The components of the system that are provided in an environment of a user in the embodiment of Fig. 1 include a computer device 1, comprising a central processing unit 2 and a memory unit 3, e.g. in the form of a random access memory (RAM) module. The audio and video analysis and other steps of the method can be performed by the CPU 2, suitably arranged to implement the present method and enable the operation of the device as explained herein. The processor 2 may be arranged to read from the memory unit 3 at least one instruction to enable the functioning of the device.

The computer device 1 is coupled to input units 4-6 including at least one camera and at least one microphone for obtaining audio and video information. The input units 4-6 may comprise a photo camera for taking pictures.

The computer device 1 also comprises display means 7, which may be any conventional means for presenting video information to the user, for example, a CRT (cathode ray tube), LCD (Liquid Crystal Display), LCOS (Liquid Crystal on Silicon) rear- projection technology, DLP (Digital Light Processing) television/Projector, Plasma Screen display device, etc.

The expression "audio data", or "audio content", is hereinafter used as data pertaining to audio comprising audible tones, silence, speech, music, tranquility, external noise or the like. The audio data may be in formats like the MPEG-I layer III (mp3) standard (Moving Picture Experts Group), AVI (Audio Video Interleave) format, WMA (Windows Media Audio) format, etc. The expression "video data", or "video content", is used as data which are visible such as a motion picture, "still pictures", video text etc. The video data may be in formats like GIF (Graphic Interchange Format), JPEG (named after the Joint Photographic Experts Group), MPEG-4, etc. The text information may be in the ASCII (American Standard Code for Information Interchange) format, PDF (Adobe Acrobat Format) format, HTML (HyperText Markup Language) format, for example. The meta-data may be in the XML (Extensible Markup Language) format, MPEG7 format, stored in a SQL database or any other format. The computer device 1 also includes a content storage unit 8, for example a computer hard disk drive, a versatile flash memory card, e.g., a "Memory Stick" device, etc.

The computer device 1 is provided with a network interface 9 for sharing context information with other computer devices (not shown).

In the illustrated embodiment, there is further provided a communication device in the form of a telephone 10. There is also an output device 11 in the form of a picture frame. It comprises a network interface 12 for receiving data, a screen 13 for displaying digital images and an ambient lighting device 14, for example configured to provide illumination around the screen 13 in one of several available colors and/or patterns. The methods illustrated herein could provide distinctive features to several such consumer lifestyle products, such as photo frames, ambient lighting devices (e.g. of the type available from Philips under the trade name "living colors"), DECT phones, etc. The computer device 1 is configured to include several functional units (Fig. 2), which may be implemented by software modules executed by the CPU 2 and stored in the content storage unit 8. In particular, it includes a module 15 for extracting features from video data received from a camera 16 and a module 17 for extracting descriptive features from audio data received from a microphone 18.

A module 19 for scene interpretation uses the video features to analyze the scene captured by the camera. In an embodiment, this involves recognizing the type of room or space in which the scene is set (living room, kitchen, etc.). It can also include recognizing users present in the scene. It can further include recognizing actions being undertaken by the users in the scene (e.g. cooking, eating, etc.). A module 20 for recognizing sound events uses the audio features extracted by the audio feature extraction module 17 to detect events. A further module 21 combines the outputs of the two analysis modules 19,20 to provide data representative of context information 22, which includes at least one of data representative of activities being undertaken by a user and data representative of characteristics of an environment of the user.

In one embodiment, the output 22 of the final module 21 is transmitted to a central server (not shown) for matching with similar output from a similar system in another user's home. In another embodiment, the output 22 is directly transmitted to the other user's home. In that embodiment, no central server is present.

Fig. 3 illustrates steps executed by two systems of the general type illustrated in Fig. 1. A first system is present in a home 23 of a first user 24. A second system is present in a home 25 of a second user 26. The method is illustrated here using two homes 23,25 of two remote users 24,26, but can take further homes 27 into account. That is to say that it can be scaled up to match contexts of more than two users.

Each home 23,25 is provided with at least one camera 28, at least one microphone 29 and, optionally, at least one sensor 30 sensitive to some variable related to at least one of a user and an environment of the user 24,26. A sensor 30 can in fact be carried by the user 24,26, and transmit data to a receiver (not shown) in the home. Thus, data representative of values of variables characterizing a physiological feature of the user can be captured. In other embodiments, the sensors 30 capture values representative of the temperature, humidity, etc. of at least part of the home 23,25.

In a first step 31,31' data is obtained from locations in the respective homes 23,25, and logged in databases 32,32'. In general, the system uses any or a combination of microphone(s), camera(s), and sensor firings (e.g. passive infra red, switches, etc., and even on-body sensors) to capture raw context data from a person's environment.

In a subsequent step 33,33', audio, video and sensor analysis and reasoning is applied to the raw data. Using audio, video, and sensor scene analysis certain events are detected in the environment (e.g. opening a door, closing a fridge, switching on a stove) based on the raw context data. Using a reasoning engine, based on the sequence and timing of these events, the system captures context information, such as the activities done by each party, e.g. cooking, eating, sleeping, etc. In each case, the raw data is analyzed and descriptive numerical features are extracted, which are then used to determine what events are taking place and in what environment in the home 23,25. In particular, at least one of audio data and video data from a camera is used to register locations of the respective users 24,26.

The step 33,33 is based on at least one of a set 34 of rules (default or predefined) and a set 35 of data trained by a user. The interpretation of a scene typically contains information about the presence of people in the scene, their activities, and possibly their interaction with other objects of the scene. The acoustic scene analysis focuses on identification of the sound events and possibly localization of sounds based on multi-microphone techniques. Both visual scene interpretation and recognition of sound events require training of the system using a database of pre-classifϊed examples. The final determination of the context based on integration of the visual scene interpretation and recognition of sound events also requires training with a database representing typically manually labeled contexts. Therefore, the training of the entire context determination system requires training of the system in the environment. This may require lot of work from the end-user (end-user programming) to make the system really functional.

After the events and environment have been identified, context information 22 is derived (step 36,36'). That is to say that information on the events and environment is aggregated to determine what is happening in what kind of environment of which user 24,26. The context information is shared between the systems associated with the two users 24,26 or transmitted to a central server (not shown). Assuming that it is shared between the two systems, then each system receives the context information from the other system and carries out a step 37 of assessing what to do, based on preference information 38, the context information generated by it and the context information received by it. In particular, the system detects when matches indicative of particular contextual similarities occur, including at least one of: a match between activities between the users and a match between environments of the users, based on the context information 22.

In the present method, the system looks for particular kinds of similarities, which may include looking into not only contextual similarities, but also contextual dissimilarities. Instead of matching similar situations, the system will match dissimilar situations. This can increase the parent's or child's curiosity towards each other to know what the other is doing, and thus feel connected.

The user is allowed to configure the system implementing the method of Fig. 3 with several preferences. For example, one of the preferences could be to have the system inform them whenever the remote person is doing a certain activity or make a match whenever they are doing the same activities, like both sides are cooking.

Once the system can identify the context at each party's location, users can configure the system in several ways to make a match: for example, "if I'm eating in the evening, I'm available. If I am available and mom is too, inform me." In addition, the user can program preferences as to what to share with remote parties, such that the user has full control.

In addition to matching activities or availability, other contextual similarities can be matched. For example, being in the 'same' area (both are in their kitchens or living rooms). This could be done with fairly simple sensor, audio, or video analysis of the environment.

Matching mood or emotions can be detected using existing technologies and methods, and conveyed to both parties. For example, the user might configure "If Fm happy and my parent is happy then inform me". Even minor instances such as wearing similar clothes (by using video analysis) could be detected and rendered.

In one embodiment, an absence of matches indicative of any of the particular contextual similarities, in particular an absence over a certain period of time, is also detected. This can be implemented by tracking the time elapsed since a last match. User preferences will not only specify what to match and when, but also how to render it or be informed about it.

This leads to a final step 39 of rendering awareness information expressing the fact that the contexts match (in the sense of being in a particular relation to each other). It is possible to automatically detect shared moments and experiences that will trigger dispersed family and friends to feel connected in 'some level' and think about each other. This trigger could encourage them to initiate communication, (e.g. a phone call, an email, a simple SMS).

To provide users with awareness of contextual similarity, the information would be rendered in some unobtrusive form or another. Some promising potential rendering forms include, for example, using lamps like ambient lighting devices that will display color patterns representing similarities from both sides. Alternatively, similarity cues can be provided on the screen 13 of the picture frame 11 (Fig. 1) and/or certain light patterns can be provided around it using the ambient lighting device 14.

In one embodiment, the user is allowed to pre-configure the output for matches corresponding to different levels of similarity . For example, a glow of yellow means both parties are cooking, and yellow flickering means cooking a similar item, glows of green means watching TV at the same time, and other patterns of green means watching the same program or genre, and so on.

An extension to the system is to have the possibility of equipping objects with a rendering mechanism (e.g. a strip of LED' s). Then, users could configure their system such that the object related to the activity/task/etc, to be matched is augmented, e.g. if both parties are eating at the same time and they augmented a spoon or part of the table with an LED strip as the rendering object, then these objects would light up, rendering the match.

Furthermore, an attractive extension to the method is the possibility to have users set up their system such that whenever a contextual similarity is matched and rendered, an automatic communication is made, e.g. phone 10 rings are triggered on both sides. Furthermore, different ring tones could indicate the context of the call (e.g. match of activity or TV program or location, etc.).

The rendered information in one embodiment distinguishes between different levels of similarity (eating, eating pasta, eating pasta at the table, etc, or watching TV and watching the same TV program, etc.).

All these rendering modalities could also be used in a similar manner in the embodiment in which an absence of matches indicative of any of the particular contextual similarities, in particular an absence over a certain period of time, is also detected. In particular, information can be rendered such as to provide information indicative of at least an aspect of a current context of at least one other of the remote users. For example, a rendering modality associated with the aspect of the current context of at least one other remote user can be selected, either in accordance with default settings or in accordance with user preferences. Information can also be rendered such as to provide information indicative of at least an aspect of a planned future context of at least one other of the remote users, in particular a future activity. For this purpose, an electronic diary can be consulted.

In an alternative system to one implementing the method of Fig. 3, the trained data 35 can be dispensed with so that users would not have to configure anything. This alternative is a location-based matchmaker (i.e. if both users 24,26 are in their kitchens, they are both informed about it).

Figs. 4 and 5 illustrate a further alternative that also dispenses with the training phase.

In this embodiment, to determine the similarity in the context between two locations or people, the visual scene interpretation, recognition of sound events and the final data fusion and reasoning steps are left out. Instead, this embodiment focuses on determining similarities in raw audio-visual feature data.

The system is based on the analysis and extraction of descriptive features from camera and a microphone signals. The same principle can also be used in systems consisting of different numbers of cameras and microphones, only cameras or microphones, or those combined with other type of sensors such as weather or temperature sensors, mechanical sensors, proximity or presence sensor technologies, biophysical sensors such as heart rate or EEG data, usage statistics from a home appliance such as a PC, telephone, or home theatre system, or any other sensor sensitive to some variable related to the user and the user's immediate environment.

Thus, the system includes at least a first camera 40 in an environment associated with a first user and a second camera 41 in an environment associated with a second user. The first and second users are at locations remote from each other. In the illustrated embodiment, there is also at each location at least one microphone 42,43. At each location, there is a system 44,45 for analyzing and extracting descriptive features from the video data and a system for analyzing and extracting features from the audio data 46,47. The analysis of the video and audio data is in this case no more than an algorithm for obtaining, e.g. by calculation, a more compact description of parts of the audio and/or video data. The features are provided to a context similarity analysis system 48.

In one embodiment, the system is bilateral and synchronous in such a way that the context similarity is determined between two or more locations in real time. However, the same idea can be also used in the asynchronous case where the data from one or more locations has been stored at one time and retrieved for the similarity analysis at another time. This also can be extended to a general case where the context similarity is determined between the local environment and a database of several remote environments where the database may contain additional information about the context, for example, based on manual segmentation and labeling of the data by other users, or a service provider.

The camera feature extraction may be used to determine the presence of human character in the camera picture, the number of humans and their locations, and their activity level determined from the motion vectors computed from the video image.

The acoustic features may be mel- frequency cepstrum coefficients (MFCC) computed from temporal segments of the microphone signal corresponding to audio events. Typically the segmentation of the audio events is based on signal energy and the properties of the background noise (cf. e.g. A. Harma, M. F. McKinney, J. Skowronek, "Automatic surveillance of the acoustic activity in our living environment", Proc. IEEE Int. Conf. Multimedia and Expo (ICME'2005), Amsterdam, The Netherlands, July 2005.)

In one embodiment the descriptive features computed from the camera and microphone signals at one end are sent to the other end and the determination of the similarity in the context between the two locations is determined there. In that case, there are two context similarity analysis systems 48. Another embodiment contains a service component; so that descriptive features computed at all locations are sent to a central device for the similarity analysis. The similarity between feature vector data is determined by comparing the similarities of individual feature vectors from two locations within a predetermined temporal window. For example, Fig. 5 illustrates the comparison of received feature vectors corresponding to individual audio events within a one minute observation frame at the two locations. There are many alternative embodiments for the fixed one minute window including various sliding -window formulations and adaptive observation windows based on segmentation of the observed event sequence.

In one embodiment, the similarity comparison would be based on computing the Euclidean distances between the feature vectors Sl_n and S2_m, where n=0,...,N-l and m=0,...,M-I, and N and M are the numbers of events in two event sequences, respectively. In the case of Fig. 5, this gives 4x5 = 20 distance measures. From the distance measures the best matching event pairs are determined and additional metrics are computed to finally derive one value which represents the probability that the two locations have a similar context.

Next, depending on a threshold value for the probability and other settings of the communication system, the system then indicates to the users at the two locations that they have a similar situation. This step is similar to the last step 39 in the method of Fig. 3.

The proposed context determination method is applicable to various home and business communication technologies including telephony, videotelephony, instant messaging in PCs and other appliances such as photo frames.

Figs. 6-9 illustrate a further embodiment of a method and system for prompting communication between remote users.

The proposed approach addresses all the disadvantages of the traditional technology. A first Status and Availability Communication (SAC) system 49 is proposed which includes a functional module 50 for compressing audio and video data from at least one microphone 51 and at least one camera 52 to abstract information about the activities and availability of a person. It is connected to another SAC system 53 at a remote location associated with another user. The connection is via a link through a network 54, e.g. the Internet. This two-way system can be open continuously because it is consuming little power in terminal devices, the transmitted bit rate is very low, and the abstract status/availability representation provides a good protection of privacy. An example of the entire communication system consisting of SAC systems 49,53 and first and second audiovisual communication (AVC) systems 55,56 is illustrated in Fig. 6. When the user is in active conversation with a remote person, the upper branch, which includes in the first AVC system 55 a module 57 for encoding video content from the camera 52, a module 58 for encoding audio data from the microphone 51 and a multiplexer 59, is active. The second AVC system 56 includes a demultiplexer 60, a video decoder 61 connected to a television set 62 or similar display device and an audio decoder 63 connected to at least one speaker 64. When the connection between the AVC systems 55,56 is not open, the SAC systems 49,53 run continuously. In Fig. 6, only one direction of communication is shown. In the preferred embodiment the system is a two-way system with identical return path from the far-end location.

The system comprising the SAC systems 49,53 can be seen as an audio-visual communication system which compresses the data from camera and microphone signals to a code representing the activity status and availability of the user. It is one of the features of the embodiment illustrated in Figs. 6-9 that activity and availability of a user are separate attributes related to the user and his/her environment. Availability naturally depends on the activity but it also depends on the contact. This makes the illustrated SAC system fundamentally different from availability systems used, for example, in popular IM (Instant Messenger) or VoIP (Voice over Internet Protocol) applications.

In one embodiment, the activity and availability information is continuously transmitted from the first SAC system 49 to the second SAC system 53, where it is rendered using a low-power rendering device. For example, the availability may be represented by light patterns in a connected photo frame device, such as the photo frame 11 of the embodiment of Fig. 1. It is also possible to show the availability data at the far-end only when the far-end user requests it, e.g., by trying to open the connection with the AVC systems 55,56. In one embodiment, when the far-end user tries to initiate a call to the user, the information available from the SAC systems 49,53 causes either a call to be requested or opened, or a call to be rejected because the user is not available for an active conversation. In one embodiment, when the SAC systems 49,53 in a two-way communication indicate that certain conditions related to the activity and availability at both ends are met, the system automatically opens the AVC connection between the two locations. In all embodiments, the system signals the users at both ends that opening a connection with the AVC system 55,56 is possible between the users due to a matching status and availability situation. Thus, in this case, the status and availability information is used to detect when matches indicative of particular contextual similarities occur. This is based on data derived by analyzing and extracting descriptive features from audio and/or video data, as will be explained.

The first SAC system 49 includes an audio-visual activity and availability coding module 50. The audio -visual activity and availability coding module 50 (see also

Fig. 7) implements a set of algorithms for compressing data received from the camera 52 and microphone 51 into a small number of code vectors representing predefined activity and availability features. For example, in the availability system for the kitchen environment the activity codes represent activities such as preparation of a meal, eating, washing dishes. Each activity is associated with a code representing the availability of the user for engaging in communication with the far-end user using the AVC system.

In one embodiment, the first SAC system 49 includes the functional components drawn in Fig. 7 and implements a method as illustrated in Fig. 8. An audio event detection system 65 detects audio events corresponding to activities of the user (step 66). One of the insights on which the methods detailed herein is based is that many interesting activities in user's environment produce sequences of recognizable audio events that can be detected from a microphone signal.

The microphone signal is the input to audio analysis algorithms which generate event tags of the audio events that were caused by the user. Examples of audio event tags include a tag associated with the sound of cutlery used in the kitchen, the sound of a pan, the sound of the tap, etc. The sequences of audio and video outputs are used by a reasoning algorithm to infer with what activity (or availability) the current sequence of audio and video outputs is associated.

The audio events are generated by recognizing sounds that often occur due to the activities of the user. To this end, first the microphone signal is partitioned in short segments from which features are selected. Although many types of features are conceivable for this purpose, Mel Frequency Cepstrum Coefficients have been shown to be very useful for recognizing common audio events.

In order to recognize common sounds, first a classifier model needs to be trained. For this, a k-means clustering method is applied to the feature vectors that are extracted while the user is performing activities. In this way clusters are generated in the feature space that resembles often occurring sounds. These clusters are characterized by a mean feature vector value, and a covariance of all the feature vectors that belong to the cluster. A Gaussian multivariate probability density function is defined according to these defining characteristics of the cluster.

When a sound that needs to be recognized is produced after the training, feature vectors per frame are compared against the probability density functions associated with each cluster, and the cluster that gives the largest average log-likelihood for a given observation interval is taken as the audio event that is generated. Note that the audio events are not given any name, but are presented at the output of the audio analysis just as numbers.

The main function of the camera 52 is the detection of the presence and the localization of the user. The camera 52 provides video input to the video analysis algorithm. The video analysis generates information about the location where the user is as an output. At least the location of the user is detected (step 67) by a detection module 68. In an embodiment, it also computes features characterizing the motion activity level, pose, clothing, and even facial gestures of the user.

Although the video analysis could derive the location of the user in many different ways, the method that has been chosen is based on detecting local changes in the video signal across time. When the change at a certain location exceeds a certain threshold it is marked. All marked locations are taken together and the smallest rectangular that encloses all marked locations is then taken as the area where the user is situated. In order to generate location information output, the current location of the user needs to be compared against the trained model.

The trained model for the video analysis can be build up in several ways. One approach that has been taken is to apply a k-means clustering algorithm to the training data. The training data consist of vectors with the coordinates and sizes of the location area that were previously obtained. The clusters that are generated in this way are representative of locations where the user is situated most typically. Thus the system has no knowledge of the nature of the location, but can distinguish between relevant locations.

To generate the location information that is input to the reasoning, the limited number of clusters that have been derived in the training phase are compared against the current location estimate. The cluster that generates the largest overlap is taken as the location that is used as output. If there is no overlap, the closest cluster is taken.

The sequences of audio events and synchronized video-based features are accumulated (step 69) in a buffer 70, which may correspond to the duration of a few seconds. Finally, the sequences of audio and video events are encoded (step 71) in a quantization and identification module 72 using a vector quantiser which gives one output symbol (activity label) for each sequence of events. In the preferred embodiment this vector quantiser is based on detection of that one of a number of pre-trained hidden Markov models that gives the highest likelihood with the presented event sequence, thus identifying (step 73) the activity. Based on the audio and video events that are generated, a sequence of

'change '-events is generated, i.e. a change-event is generated only when the audio or video classification changes. However, after a fixed amount (e.g. 1 second) of time in which there has been no change in the audio or the video, a change-event is generated automatically.

The sequence of change events that are generated are fed to a Hidden Markov Model (HMM). For training, labels need to be presented to be able to recognize the audio and video change event sequences that correspond to being available or not being available. For the training the change event sequences need to be presented for activities that correspond to being available, and for activities corresponding to not being available.

Then, the activity label is mapped (step 74) to an availability symbol by a module 75 for mapping activity labels to availability symbols. The mapping between activity label and availability symbol is typically set by the user. The mapping may also be different for different contacts such that a person may be available for one contact and at the same time unavailable for another contact. Finally, the activity label and availability symbol are multiplexed (step 76), encrypted (step 77), and packed in a transmission data packet, which is typically a TCP/IP datagram, by a multiplexing and encryption module 78. The datagrams are then transmitted (step 79) to subscribers. These may be a user's currently open contacts.

Fig. 9 illustrates a receiver side system of status and availability communication, in particular details of an SAC decoder 80 as included in the second SAC system 53 of Fig. 6. In the SAC decoder 80, the received status and availability information is decrypted in a decryption and demultiplexing module 81. Output is provided at at least one terminal device 82-84. These can include a television set, audio system or lighting system, for example. Depending on the receiver's terminal devices 82-84 the activity label and availability symbol communicated to the SAC decoder 80 may be rendered in different ways. In one embodiment, for example, the activity label is converted to one of a set of stylized images representing activities by a module 85 for mapping the activity label to an activity representation, and displayed in a digital photo frame. At the same time, the availability symbol may be used to control the ambilight color of the same digital photo frame. The availability information is mapped to an availability symbol by a module 86 for mapping an availability symbol to an availability representation. Several different embodiments are described in the following section. Some application scenarios using the SAC use information about the activities and availabilities of users at two or more locations. This can be performed in the Availability and Activity Cross- Connector (AACC) 87 illustrated in Fig. 8. In one embodiment, the overall SAC system of Figs. 6-9 is used continuously and the information about activities, availability, or both are represented in a display device, lighting device, audio reproduction device, robotic or electrochemical rendering device or a combination of them.

In another embodiment, the SAC systems 49,53 are used in a subscription mode where the information is only transmitted from the user's system when a far end user subscribes to the information. In the most familiar case this subscription can be an attempt to make a call which would lead to receiving information about the activity and availability of the user.

In one embodiment, the data representation for the rendering of the availability information is selected at the far end based on the activity labels and availability symbols. However, in an alternative embodiment some elements of the activity reproduction information can originate from the near end system. For example, the availability information can be presented in the form of pre-recorded voice message or an image of the user.

In another embodiment, if the user's availability symbol indicates that the user is unavailable, the subscription is continued until the time when the user becomes available. The integrated communication system based on the SAC systems 49,53 and AVC systems 55,56 then notifies the far end user about a change in the availability. In one embodiment, when the users at both ends become available (as determined by the AACC system 87) the connection is opened automatically. In one embodiment, a feature is available according to which the activity and availability information is used to choose in which way the AVC systems 55,56 are used when the communication session is opened. For example, the activity and availability could be used to choose whether the communication session is opened with a video connection or as an audio-only call. This selection could depend on the current activity label of any of the participants, the social relation between the users, time-of-the-day, or some additional status information related to any of the users.

In yet another embodiment only the activity information is transmitted.

The activity information that is transmitted may also depend on the availability symbol. For example, when the user is not available, the activity label may be replaced by a generic activity label which only discloses that the user is not available, or even claims that the user is not at home. These settings may depend on the contact and is typically set by the user.

A variant of this embodiment has been tested in the following setup: a video camera and a microphone were placed in a kitchen and connected to a computer. The computer was provided with a computer program developed to enable the computer to identify a large number of sound events related to kitchen activities. The camera signal was used simultaneously to register the location of the local user in the kitchen. The audio events and the locations were trained to the system off-line with pre-recorded training material. The information about detected audio events and locations was used in a reasoning system based on hidden Markov models (HMMs) to combine individual audio event sequences and location to larger activities which in the kitchen setup included, among others, preparation of a meal, eating, washing dishes, etc. Moreover, each of these activities was associated with an availability measure in relation to a telephone device such that, for example, if the user was preparing the meal he/she was labeled being unavailable for a call but when the person was, for example, eating he/she was labeled available for a telephone call. In the demonstration, the incoming calls were then diverted to an answering machine or to the user automatically depending on the user's current activity. In addition, the current availability was also displayed in real-time at the remote side with a red/green light and a graphical illustration of the activity displayed in a picture frame.

In many popular Internet communication and community services including Instant Messengers (IMs), the availability status is typically set by the user, or it is derived by the computer from the usage activity. For example, an availability application is known, where the selection of the terminal device, for example, for Instant Messenger or VOIP telephony application where the user is available (or away or busy) is based on presence information determined by different means which include detection of keyboard and mouse usage and the use of different applications in the PC. In this case the user's availability is determined based on automatic analysis of user's activity. However, this only applied to availability during active computer usage and does not, for example, cover even the case when the user is focused on reading something from the screen or working on paper next to the PC.

The demonstrated availability system is actually a special form of audio -visual communication system. This system can be used in many different ways but is mainly for use in supporting and complementing another communication system, which may be, for example, a voice-only telephone or a high-end videotelephone system. The use of captured audio and video content in communication is of course known art already for a very long time. Many different techniques to encode the data for transmission are also known and several standards exist. The most relevant standards related to the current invention is MPEG-4 which contains various tools for real-time transmission of audio and video content. The proposed method can be implemented, for example, using the standardized MPEG-4 coding tools such as the BIFS framework.

A continuously open, or persistent, communication channel between the AVC systems 55,56 has some obvious benefits: It is easy to have long relaxed communication sessions with fluctuating activity level which is similar to natural communication between people who are physically present. The threshold for starting, stopping, and resuming active conversations is low because little effort is needed to reactivate the call.

Continuously open connection makes it possible to be aware of the other person's context (activity, mental state, social situation, plans) which helps to determine the right moment to communicate. This works in both ways.

In principle, a telephone or video telephone connection could be kept open continuously but it is usually not preferred by the user for several reasons:

When the user is not having an active conversation with the other, people are not used to seeing or hearing a remote person or to being seen or heard by the remote person.

The capture, transmission, and rendering of audio-visual content without active ongoing conversation is considered a waste of electrical energy and unnecessary use of network bandwidth. Many users prefer switching off devices that are not in use.

The user does not want to be heard or seen or to engage in communication with a remote person in certain situations

There are privacy issues. The user may be involved in an activity which makes communication difficult. Communication may distract the user from some other activity which requires attention (e.g., cooking)

Having a call open with a telephone or a video -telephone typically means that the user is not fully available for other local or remote persons who want to communicate with the user.

The system and method illustrated in Figs. 6-9 provides prompts for activating connections at appropriate moments, so that the connection between the AVC systems 55,56 need not be open permanently. It should be noted that the above-mentioned embodiments illustrate, rather than limit, the invention, and that those skilled in the art will be able to design many alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps other than those listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage. For example, in one embodiment, the system according to the present invention is implemented in a single device. In another, it comprises a service provider and a client. Alternatively, the system may comprise devices that are distributed and remotely located from each other.

A processor may execute a software program to enable the execution of the steps of the method of the present invention. The software may enable the apparatus according to the present invention independently of where the apparatus is being run. To enable the apparatus, the processor may transmit the software program to other (external) devices, for example. The independent method claim and the computer program product claim may be used to protect the invention when the software is manufactured or exploited for running on the consumer electronics products. External devices may be connected to the processor using existing technologies, such as Blue -tooth, IEEE 802.1 l[a-g], etc. The processor may interact with the external device in accordance with the UPnP (Universal Plug and Play) standard.

The present invention may be implemented using any of various consumer electronics devices such as a television set (TV set) with a cable, satellite or other link, a videocassette- or HDD- recorder, a home cinema system, a portable CD player, a remote control device such as a universal remote control, a cell phone, etc.

Claims

CLAIMS:

1. Method of communicating information relating to remote users, including: obtaining data derived by analyzing and extracting descriptive features from data including at least one of audio and video data obtained at remote locations (23,25,27); detecting when matches indicative of particular contextual similarities occur, based on the data derived by analyzing and extracting descriptive features, the contexts including at least one of: activities of users at the remote locations, environments of users at the remote locations and moods of users at the remote locations; and - making at least one remote user at a remote location aware of synchronous matching contexts.

2. Method according to claim 1, wherein the at least one remote user is made aware of synchronous matching contexts by causing information to be rendered at at least one of the remote locations (23,25,27).

3. Method according to claim 1, including a step of providing a suggestion to change behavior so as to create a shared experience, and thus maintain a feeling of connectedness.

4. Method according to claim 1, including detecting when particular contextual similarities occur by determining similarities in the audio-visual feature data directly.

5. Method according to claim 1, including receiving from the users an indication of agreement that specific activities should be determined to match in future.

6. Method according to claim 1, including, when a user starts carrying out a certain activity, informing the user when another user was last involved in the same activity.

7. Method according to claim 1, wherein detecting when matches indicative of particular contextual similarities occur is based on user preferences.

8. Method according to claim 1, wherein the data derived by analyzing and extracting descriptive features from data including at least one of audio and video data obtained at remote locations (23,25,27) are received at a first location, associated with one of the users, from a second location, associated with the other of the users, and the determination of similarity is carried out at the first location.

9. Method according to claim 1, wherein, when a contextual similarity is matched, a communication link is made automatically.

10. Method according to claim 1, wherein at least one of audio and video data from a camera (16;28;40,41;52) are used to register a location of a user.

11. Method according to claim 1 , wherein the data from which descriptive features are extracted further include data from at least one sensor (30) sensitive to some variable related to at least one of a user and an environment of the user.

12. Method according to claim 1, including obtaining activity data and availability data from locations associated with two of the remote users and at least one of: signaling that opening a connection for communication between the two locations is possible to the users, and opening the connection for communication between the two locations.

13. System for communicating information relating to remote users, including: a system (l;15,17;44,45,46,47;50) for obtaining data derived by analyzing and extracting descriptive features from at least one of audio and video data obtained at remote locations (23,25,27), wherein the system for prompting communication is configured to detect when matches indicative of particular contextual similarities occur, based on the data derived by analyzing and extracting descriptive features, the contexts including at least one of: activities of users at the remote locations, environments of users at the remote locations and moods of the users at the remote locations; and an interface (9) to a device (10,11;62,64; 82;83;84) at at least one of the remote locations for making at least one remote user at a remote location aware of synchronous matching contexts.

14. System according to claim 13, configured to carry out a method according to any one of claims 1-12.

15. Computer program including a set of instructions capable, when incorporated in a machine -readable medium, of causing a system having information processing capabilities to perform a method according to any one of claims 1-12.