WO2020207041A1

WO2020207041A1 - System and method for dynamically recommending inputs based on identification of user emotions

Info

Publication number: WO2020207041A1
Application number: PCT/CN2019/122695
Authority: WO
Inventors: Sumit Kumar Tiwary; Manoj Kumar; Yogiraj BANERJI; Govind JANARDHANAN; Tasleem Arif
Original assignee: Guangdong Oppo Mobile Telecommunications Corp., Ltd.
Priority date: 2019-04-10
Filing date: 2019-12-03
Publication date: 2020-10-15
Also published as: CN113785539A

Abstract

The present invention relates to a system [100] and a method for dynamically recommending at least one input content to a user in a communication session, over a communication network. At least one user input and at least one gesture of the user is received in real‐time. Thereafter, at least one emotional data is identified based on: at least one expression associated with the at least one gesture of the user, and the at least one user input. Based on the at least one emotional data, the at least one input content is determined and thereby recommended to the user to use the same during the communication session.

Description

SYSTEM AND METHOD FOR DYNAMICALLY RECOMMENDING INPUTS BASED ON IDENTIFICATION OF USER EMOTIONS

TECHNICAL FIELD

The present invention relates generally to management of user inputs in electronic communication sessions, and more particularly, to a system and a method for dynamically recommending input contents based on identification of user emotions.

BACKGROUND

This section is intended to provide information relating to general state of the art and thus any approach/functionality described hereinbelow should not be assumed to be qualified as a prior art merely by its inclusion in this section.

With advance development in the electronic communication devices, it has become possible for humans to establish communication sessions, via various calling and messaging applications, and convey their emotions while communicating with each other. The various calling and/or messaging applications or ‘apps’ are either pre‐installed in said electronic devices or can be downloaded and installed according to user requirements and preferences.

A typical electronic communication device, herein after ‘device’ , such as a mobile phone, a tablet device, a smart phone, a laptop etc.; through the various ‘apps’ , facilitate user (s) to express his/her emotions by providing various options including images, videos, graphical images such as GIF, texts, icons etc. These options can be inputted by the users during any communication sessions according to their requirements and preferences. For example, a smart phone may be equipped with one or more electronic messaging applications that may be configured to typically receive the user inputs in form of texts, images and/or icons for facilitating the user to thereby convey their emotions to other users or recipients.

While the conventional communication methods and systems provide the aforementioned options to the users for conveying their emotions during any communication session, the users are typically required to manually insert the inputs to provide their emotional information. A major drawback of such systems is that the options available within the electronic messaging applications, are often unable to communicate emotional information that accurately describes the intent and mood of the users, and hence, such conventional communication methods and systems require user interference for effectively communicating the intended emotions of the user to the other users.

The above‐mentioned limitation of the conventional systems and methods is overcome by providing, automatically, the options based on the emotional information being detected from the users’ text‐based or voice‐based inputs. However, in such systems, the automatic options are not displayed to the user until a complete text‐based or voice‐based input, e.g. a complete sentence via text or speech, is received from the user. Thus, these systems are able to extract the emotional data only after the inputted sentence is completed by the user. This creates a break in communication experience of the users and in order to experience a better accuracy of emotion detection, the user must pause before completing the whole conversation for the right amount of texts to be captured by the system.

United States patent publications US20170147202 and US20030110450, disclose solutions for expressing any emotion information in a text message, based on the keyboard inputs and voice/audio inputs received from a user. However, various text input parameters and the voice input parameters such as typing speed and voice pitch may allow to detect only the high intensity emotions, and hence all types and all degrees of emotion is difficult to be predicted in the text and voice inputs. These types of inputs therefore require an additional input parameter of users’ gestures in order to accurately detect the emotion of the user in real time.

In various existing methods and systems for expressing emotions in users’ text messages, deep learning techniques are also used that detect emotional contexts within phrases or set of texts being inputted by the user. For example, the phrases “Go Away! ” and “Shut Up” etc. can be identified, by the state‐of‐the‐art deep learning techniques, as having an anger element, even if the word “anger” or any of its synonyms has not been used in said phrases. However, only a limited number of phrases pertaining to limited types of emotions can be detected by the deep‐learning techniques and may not detect all kinds of emotional contexts with accuracy.

Therefore, there is a need to provide a solution to the above problem for automatically detecting the users’ emotions by analysing the inputs received from the users in real‐time, and managing the received inputs therein for providing accurate input options to the users to express their emotions during a communication session.

SUMMARY

This section is provided to introduce certain aspects of the present invention in a simplified form that are further described below in the detailed description. This summary is not intended to identify the key features or the scope of the claimed subject matter.

In view of the afore‐mentioned drawbacks and limitations of the prior arts, it is an object of the present invention to provide methods and systems for automatically analysing the inputs received from the users in real‐time. Another object of the invention is to manage the received user inputs in a manner such that the accurate input options is provided to the users to express their emotions in any text messages during a communication session. Another object of the invention is to receive real‐time images having at least one gesture of the user. Yet another object of the present invention is to detect emotional data from the received inputs and the real‐time images. A further object of the present invention is to recommend to the users, accurate input contents based on the emotional data for expressing emotions. Yet another object of the present invention is to facilitate the users to use additional input options for expressing their emotions in the text messages during a communication session.

In view of these and other objects, one aspect of the present invention may relate to a method for dynamically recommending at least one input content to a user, in a communication session, over a communication network, the method comprising the steps of: receiving, via an input module, at least one user input in real‐time; receiving, via a camera module, at least one portion of an image of the user, the at least one portion indicating at least one gesture of the user in real‐time; identifying, using an expression identification module, at least one expression associated with the at least one gesture of the user; identifying, using an emotion detection module, at least one emotional data based on the at least one expression and the at least one user input; determining, using a processing module, the at least one input content, for the at least one user input, based on the at least one emotional data; and recommending to the user, using a display module, to select and use the at least one input content along with the at least one user input, during the communication session.

Another aspect of the invention may relate to a system for dynamically recommending at least one input content to a user in a communication session over a communication network. The system comprises an input module configured to receive at least one user input in real‐time; a camera module, configured to receive at least one portion of an image of the user, the at least one portion indicating at least one gesture of the user in real‐time; an expression identification module configured to identify at least one expression associated with the at least one gesture of the user; an emotion detection module configured to identify at least one emotional data based on the at least one expression and the at least one user input; a processing module configured to determine the at least one input content, for the at least one user input, based on the at least one emotional data; and a display module configured to recommend to the user, to select and use the at least one input content along with the at least one user input, during the communication session.

Another aspect of the invention may relate to a method and a system for tracking continuously the at least one gesture of the user in real‐time to identify any changes; update emotional data based on the changes identified in the at least one gesture; and recommending an updated at least one content to the user, based on the updated emotional data.

Another aspect of the invention may relate to a method and a system for marking a beginning and an end of the at least one gesture received from the user; automatically perform segmentation of the at least one user input to create one or more segments based on the beginning and the end of the at least one gesture; identify a corresponding emotional data for each segment of the one or more segments based on the at least one gesture and the at least one user input; and recommend to the user, the at least one input content based on the corresponding emotional data for each segment of the at least one user input.

Another aspect of the invention may relate to a method and a system for providing an indication to the user whether or not the at least one portion of the real‐time image is being adequately captured by the camera module.

BRIEF DESCRIPTION OF DRAWINGS

The accompanying drawings, which are incorporated herein, and constitute a part of this present invention, illustrate exemplary embodiments of the disclosed methods and systems in which like reference numerals refer to the same parts throughout the different drawings. Components in the drawings are not necessarily to scale, emphasis instead being placed upon clearly illustrating the principles of the present invention. Some drawings may indicate the components using block diagrams and may not represent the internal circuitry of each component. It will be appreciated by those skilled in the art that disclosure of such drawings includes disclosure of electrical components or circuitry commonly used to implement such components. The connections between the sub‐components of a component have not been shown in the drawings for the sake of clarity, therefore, all sub‐components shall be assumed to be connected to each other unless explicitly otherwise stated in the present invention herein.

Fig. 1 illustrates a system architecture [100] for dynamically recommending at least one input content to a user in a communication session over a communication network, in accordance with exemplary embodiments of the present invention.

Fig. 2 is a block diagram illustrating the system [100] elements for providing expression identification and tracking, in accordance with exemplary embodiments of the present invention.

Fig. 3 is a block diagram illustrating the system [100] elements for identifying emotional data, in accordance with exemplary embodiments of the present invention.

Fig. 4 is a block diagram illustrating the system [100] elements performing actions based on available history and profile information, in accordance with exemplary embodiments of the present invention.

Fig. 5 illustrates a scenario wherein the user is given an indication of the real‐time image being captured, in accordance with exemplary embodiments of the present invention.

Fig. 6 illustrates a scenario wherein the user is prompted to use an updated input content based on any change in the emotional data, in accordance with exemplary embodiments of the present invention.

Fig. 7 illustrates a scenario wherein the user is prompted to use multiple input content options based on multiple emotions in a single text content, in accordance with exemplary embodiments of the present invention.

Fig. 8 is a flowchart illustrating the method for dynamically recommending at least one input content to a user in a communication session over a communication network, in accordance with exemplary embodiments of the present invention.

Fig. 9 illustrates the framework supporting the execution of the exemplary embodiments of the present invention.

DETAILED DESCRIPTION

In the following description, for the purposes of explanation, various specific details are set forth in order to provide a thorough understanding of embodiments of the present invention. It will be apparent, however, that embodiments of the present invention may be practiced without these specific details. Several features described hereafter can each be used independently of one another or with any combination of other features. An individual feature may not address any of the problems discussed above or might address only one of the problems discussed above.

The present invention encompasses systems and methods for dynamically recommending input contents based on identification of user emotions, during a communication session between users over a communication network. At least one user input is received in real‐time from a user via an input module. Along with the at least one user input, at least one portion of an image of the user is also received using a camera module. The at least one portion of the user image indicates at least one gesture of the user in real‐time. Thereafter, at least one expression associated with the at least one gesture of the user is identified. At least one emotional data is determined based on the at least one expression and the at least one user input. Subsequently, using the at least one emotional data, the at least one input content is determined. The at least one input content is thereafter recommended to the user, for selecting and using the recommended at least one input content along with the at least one user input, during the communication session.

As used herein, “hardware” includes a combination of discrete components, an integrated circuit, an application specific integrated circuit, a field programmable gate array, other programmable logic devices and/or other suitable hardware as may be obvious to a person skilled in the art.

As used herein, “software” includes one or more objects, agents, threads, lines of code, subroutines, separate software applications, or other suitable software structures as may be obvious to a skilled person. In one embodiment, software can include one or more lines of code or other suitable software structures operating in a general‐purpose software application, such as an operating system, and one or more lines of code or other suitable software structures operating in a specific purpose software application.

As used herein, “application” or “applications” or “apps” are the software applications residing in respective electronic communication devices and can be either pre‐installed or can be downloaded and installed in said devices. The applications include, but are not limited to, contact management application, calendar application, messaging applications, image and/or video modification and viewing applications, gaming applications, navigational applications, office applications, business applications, educational applications, health and fitness applications, medical applications, finance applications, social networking applications, and any other application. The application uses “data” that can be created, modified or installed in an electronic device over time. The data includes, but is not limited to, contacts, calendar entries, call logs, SMS, images, videos, factory data, emails and data associated with one or more applications.

As used herein, “couple” and its cognate terms, such as “couples” and “coupled” includes a physical connection (such as a conductor) , a virtual connection (such as through randomly assigned memory locations of data memory device) , a logical connection (such as through logical gates of semiconducting device) , other suitable connections, or a combination of such connections, as may be obvious to a skilled person.

As used herein, “electronic communication device” includes, but is not limited to, a mobile phone, a wearable device, smart phone, set‐top boxes, smart television, laptop, a general‐purpose computer, desktop, personal digital assistant, tablet computer, mainframe computer, or any other computer implemented electronic device that is capable of making transactions of communication messages or data, as may be known to a person skilled in the art.

As used herein, ‘expression’ of the user is detected through facial expressions, hand movements, fingers movements, thumbs movements, head movement, leg movement etc. The various facial features of the user include movements of eyes, nose, lips, eyebrows, jaw movements etc. The expression can be classified into any particular category by using state‐of‐the‐art classifiers suitable for specific expression recognition.

As used herein, “emotional data” is any data pertaining to the expression of the users and emotions of the user being inputted through a text, an audio, an icon or any image. The emotional data is determined by analysing the received user input for example, words, sentences, phrases, GIF, images, icons, etc along with the user expression. The emotional data can be classified into any particular type and degree by using predefined categories of human emotions, wherein the predefined categories of human emotions may be stored locally on the device or on one or more remote servers.

As used herein ‘depth camera’ is a type of camera that is capable of capturing the depth information of any scene or a video frame that is being received as an input.

Fig. 1 illustrates a system architecture for dynamically recommending at least one input content to a user in a communication session over a communication network, in accordance with exemplary embodiments of the present disclosure. As shown in Fig. 1, the system [100] comprises a data managing module [102] , a profile managing module [104] , a dynamic content module [106] , a messaging application module [112] , a processing module [110] , a camera module [108] , an input module [118] , an emotion detection module [114] and an expression identification module [116] .

According to the embodiments of the present invention, the messaging application module [112] initiates any communication session between the users. The messaging application module [112] is configured to initiate one or more third party messaging applications, social networking applications, instant messenger applications, online chat applications running on any portals etc., that require the users to input text, voice, or image and accordingly convey their messages during the respective communication sessions. The messaging application module [112] may also be triggered by installation of any input devices including keyboard, mouse, joysticks etc. The messaging application module [112] may also be triggered by a touch input received from the user.

In one embodiment of the present invention, the messaging application module [112] is communicably coupled to the processing module [110] and makes a request to the processing module [110] for identifying the expression and emotions of the user. The processing module [110] is coupled to the expression identification module [116] and the emotion detection module [114] that respectively identifies the expression and the emotional data from the inputs being received. According to the embodiments of the present invention, the expression identification module [116] receives at least one portion of the image of the user that indicates at least one gesture of the user. The expression identification module [116] identifies the at least one expression associated with the at least one gesture of the user. Further, the emotion detection module [114] identifies at least one emotional data based on the combination of at least one expression and the at least one user input. The emotional data is further used by the processing module [110] to determine at least one input content that is recommended to the user via a display module. The at least one input content may include but is not limited to an icon, an emoticon, an image, a sticker, a text, a set of texts, a word, a set of words etc. Therefore, the user may implement the recommended at least one input content in the transactions of the messages, using the messaging applications being executed through the respective device.

As disclosed above, the messaging application module [112] invokes the processing module [110] by sending a request to analyze the at least one user input and the at least one portion of an image of the user. The processing module [110] thereafter seeks for the at least one portion of the image of the userfrom the camera module [108] and the at least one user input from the input module [118] and accordingly determine the at least one input content to be recommended to the user based on the emotional data. The processing module [110] therefore acts as a context analyser that analyses and processes the received inputs to identify the emotional context associated with the same. As discussed above, the processing module [110] is coupled to the camera module [108] and the input module [118] for receiving the user image and the at least one user input respectively. The camera module [108] may include a depth camera that is capable of capturing depth information of the real‐time image of the user.

The user image received using the camera module [108] , comprises the at least one portion indicating at least one gesture of the user in real‐time. The processing module [110] is communicably coupled to the expression identification module [116] and the emotion detection module [114] that are supported by Artificial Intelligence (AI) tools and frameworks to respectively identify expressions and emotions of the users while the user is typing any input, or making any gestures during any ongoing communication session.

In various embodiments of the present invention, the input module [118] provides the at least one user input when the third‐party messaging application are invoked. The at least one user input may include a word, a set of texts, sentences, phrases, voice data, stored images, live images, media etc. Further, the camera module [108] may receive the image of the user in real‐time, wherein at least one portion of the user image indicates the at least one gesture of the user in real‐time. The at least one gesture is analysed by the expression identification module [116] to identify the expression of the user being conveyed through his/her gesture.

Further, the emotion detection module [114] and the expression identification module [116] respectively receive the at least one user input and the at least one portion of the image and continuously analyse the received at least one user input and the at least one portion of the image to identify any change in the expression being conveyed by the user. In the event of any change in the expression in the received inputs, either via the input module [118] or via the camera module [108] , the processing module [110] simultaneously detects the same while receiving the at least one user input and the at least one portion of the image of the user. The change in the at least one user input and the at least one gesture of the image of the user corresponds to change in the mood of the user. For example, in an event of user typing a message sentence “My trip was not good” , the emotion detection module [114] analyses that the user’s mood is ‘sad’ . However, when the user changes the input by deleting the word ‘not’ and input the sentence as “My trip was good” ; then the emotion detection module [114] identifies that the user’s mood is ‘happy’ .

The changes can also be recorded in the event of any changes in user’s gestures being made in real‐time. For example, the user may change the facial expression from an angry face to a happy expression on his/her face. Therefore, any change in the expression is detected and accordingly analysed by the expression identification module [116] that may be supported by the artificial intelligence tools and mechanisms. According to the embodiments of the present invention, the expression identification module [116] s capable of identifying any change in the expression and accordingly determining the current mood of the user.

In various embodiments of the present invention, the processing module [110] also invokes the camera module [108] for receiving the image feed to be analyzed and processed and subsequently, interacts with expression identification module [116] for extracting the data and generating an expression‐based system event. Once the at least one user input is received and the at least one expression associated with the at least one gesture of the user is identified, the at least one emotional data is identified by the emotion detection module [114] . Thereafter, the processing module [110] determine the at least one input content by analyzing and processing the emotional data determined from the at least one user input along with the at least one expression. The processing module [110] also requests to emotion detection module [114] to obtain the type and degree of the at least one emotional data. For example, the type of the at least one emotional data may include Happy, Sad, Angry, Sleepy etc., and the degree of the emotional data type may include intensity of the mood type such as; Low, Medium, High, Very High etc) . The type and degree of the at least one emotional data may be processed further to accurately determine the at least one input content that can subsequently be recommended to the user. For an instance, the user is typing “How dare you? ” and simultaneously, the camera module [108] captures the image of the user with an angry face. In such a case, the expression of the user is “angry face” . Further, the “angry face” expression along with the “How dare you? ” input are analysed together to determine an emotion data. The emotional data may therefore be determined as “very angry” in this case. In this case, the at least one input content suggested to the user may be a red‐faced emoticon indicating an angry expression.

The data managing module [102] is configured to manage the on‐device data pertaining to the communication sessions conducted by the user through any electronic communication device (s) . The data managing module [102] may be located on the device. The data managing module [102] performs the function of managing the data, including storing the data, formatting any text inputs, etc. It also stores any pre‐processed data and performs the specific transactions of any data between corresponding modules within the system [100] . For example, the data managing module [102] also receives and stores the profiles of the user as well a plurality of user’s contact information. Each of the user’s contacts may have certain degree of affinity with the user, and based on the degree of affinity, the user may use different types of contents while inputting any message during a communication session. The data managing module [102] uses the users’ profile and contact information to accordingly manage the data pertaining to the communication sessions. The data managing module [102] also performs formatting of the at least one user input based on at least one of: the emotional data, the at least one expression and the at least one user input.

The data managing module [102] is communicatively coupled to the profile managing module [104] for receiving the profile information of the user. Further, the profile managing module [104] also provides the user profile information to various third‐party applications. The profile managing module [104] uses various data, for example, call log data, message content data, and other types of data to determine and create the user profile with respect to other senders or receivers. The profile information can also be used to customize the formatting options and accordingly generate personalized content for the user’s specific contacts and friends list. The profile information can also be used to filter any content with respect to the specific contacts and friends list.

The dynamic content module [106] is coupled to the data manging module [102] and the messaging application module [112] through the profile managing module [104] . The dynamic content module [106] dynamically [106] searches for any data or content from any in‐built storage module or database of the local device, and provides the same to the profile management module [104] as well to the data managing module [102] . The searched data is further used by the messaging application module [112] in the communication sessions being conducted between the users. The dynamic content module [106] also searches for any online data from various network servers located across any local or remote networks, for example; LAN, WAN, the Internet etc.

Thus, the system [100] according to the embodiments of the invention, is configured to receive at least one user input in real‐time and also a real‐time image of the user, wherein at least one portion of the image indicates gestures of the users. The expression identification module [116] , identifies at least one expression associated with the at least one gesture of the user. The emotion detection module [114] identifies at least one emotional data based on the at least one expression and the at least one user input. The emotional data is used to determine the at least one input content that is thereby recommending to the user via a display module. The user is also prompted to select and use the at least one input content in combination with the at least one user input, during the communication session. In various embodiments of the present invention, the display module comprises various elements including at least one of: touch screen, any display screen, a graphical user interface module etc.

Fig. 2 is a block diagram illustrating the system [100] elements for providing expression identification and tracking, in accordance with exemplary embodiments of the present disclosure. The messaging application module [112] initiates the communication session for the user, and invokes the processing module [110] by sending a request to analyze the user inputs. Subsequently, the processing module [110] seeks inputs from the camera module [108] and once the processing module [110] receives the image of the user, the processing module [110] transmits the image of the user to the expression identification module [116] . The image captured by the camera module [108] includes at least one portion of the user image that indicates at least one gesture of the user in real‐time. The expression identification module [116] continuously tracks the at least one gesture of the user as being indicated by the user image, and analyses the same to identify the expression of the user being conveyed through his/her gesture. The expression identification module [116] further analyses any change in the expression of the user. The expression identification module [116] also identifies the types and degrees of the expression. For example; happy, sad, angry etc and very happy, very sad, terribly angry etc.

In one embodiment of the present invention, the messaging application module [112] is an application framework that along with the camera module [108] and the processing module [110] , supports execution of various software applications including at least one messaging application or any other software applications. For example; the messaging application module [112] may support a gaming application installed in the user device in which the user as a game player, may express different expressions while playing a game and accordingly sends to the other players, the messages related to the ongoing game. The user may also may express their emotions by using the input module [118] comprising at least one of: keyboard, mouse, joystick etc.

Fig. 3 is a block diagram illustrating the system [100] elements for identifying emotional data, in accordance with exemplary embodiments of the present disclosure. The processing module [110] provides the received inputs, viz the at least one user input and the at least one portion of the user image to the emotion detection module [114] . As explained earlier, the emotion detection module [114] continuously tracks the user inputs received via the input module [118] and the camera module [108] , to identify any change in the expression being conveyed through said inputs. In an event of any changes being detected, the emotion detection module [114] analyses the change and accordingly identifies the current emotion of the user in the form of at least one emotional data. The processing module [110] requests to the emotion detection module [114] to provide the emotional data including type and degree of the user’s emotion. The emotion detection module [114] accordingly sends the at least one emotional data to the processing module [110] for determining the at least one input content that may be thereafter recommended to the user. The at least one input content, for example: an emoticon or a smiley, is used by the user during the communication sessions being conducted by the messaging application module [112] .

Fig. 4 is a block diagram illustrating the system [100] elements performing actions based on available history and profile information, in accordance with exemplary embodiments of the present disclosure. As illustrated in the figure, the profile managing module [104] , the data managing module [102] and the dynamic content module [106] interacts with each other to manage the data pertaining to the communication sessions between the users. The data is managed based on profile information and history of user actions and call logs during several communication sessions. Further, the messaging application module [112] interacts with the profile managing module [104] , the data managing module [102] and the dynamic content module [106] for obtaining the profile information and call log history of the user, and any information regarding any updated profile information and degree of affinity with the other users. Based on the data and the information provided to the messaging application module [112] the processing module [110] updates the at least one input content and recommend the same to the user. Therefore, the user may be recommended the at least one input content based on the available history and profile information.

Fig. 5a, 5b and 5c illustrate a scenario wherein the user is given an indication of the real‐time image being fully or partially captured by the camera module [108] , in accordance with exemplary embodiments of the present disclosure. An electronic communication device [506] having a camera module [108] , a display screen [508] and a user interface [510] is shown in Figure 5a, 5b and 5c. As disclosed earlier the at least one portion of an image [502] of the user is received by the camera module [108] . The at least one portion indicates at least one gesture of the user in real‐time.

Figure 5a illustrates the at least one gesture of the user being indicated by the facial expression of the user. The expression identification module [116] identifies the at least one expression associated with the at least one gesture of the user. However, in the event the at least one portion is not adequately captured by the camera module [108] , the expression of the user may not be detected. For example, as shown in Figure 5a, the face of the user is not fully covered by the shaded region [504] . The shaded region [504] indicates the coverage of the user body part by the camera module [108] . In the example as shown in Figure 5a, the face of the user is not adequately captured by the camera module [108] . Therefore, in such an event the camera module [108] may not detect the user’s expression. According to the embodiments of the present invention, an indication is providing on the display screen [508] , whether or not the at least one portion of the real‐time image [502] is being adequately captured by the camera module [108] . The user can accordingly adjust the communication device [506] to a suitable angle or control the camera angle such that adequate portion of the image is covered by the camera module [108] and thereby trigger the process of recommendation of input content by the processing module [110] .

Figure 5b shows an exemplary scenario, wherein depending upon the coverage region [504] of face area, the user is facilitated to control the initiation and execution of the features as disclosed in the proposed invention. Initially, when the user starts typing the input text via the input module [510] , the camera module [108] tries to capture face of the user indicating at least one gesture of the user in real‐time. The user may also make the at least one gesture through his/her at least one body part including facial expressions, hand movements, finger movements, etc. The at least one gesture captured by the camera module [108] is used to detect any expression of the user. In the scenario as shown in Figure 5b, the camera does not fully capture the at least one body part of the user, i.e. his face. Therefore, the camera module [108] is unable to detect the expression of the user as the complete face of the user is not captured. In this scenario, the user is given a first indication [512] via the display screen [508] or the display module that the complete face of the user is not being detected by the camera and also the user is prompted through the GUI (graphical user interface) to adjust the camera to accurate position or angle, to enable the identification of accurate emotions of the user. If the user moves the face while typing the user input (to cover the whole face for expression detection) , then the emotion‐based formatting can be performed accurately without breaking the typing flow. This provides an advantage over the existing systems and methods which require user intervention for formatting the text inputs manually.

Figure 5c shows complete coverage of the face of the user that enables the identification of accurate emotions of the user. A second indication [514] is given to the user for complete coverage of the at least one portion of the user image that is showing any gesture to indicate the current emotion of the user. Accordingly, the user types the inputs that are further analysed along with the expression of the face, and the real time input based on the user’s emotion is thereby recommended to the user.

Fig. 6 illustrates a scenario wherein the user is prompted to use an updated input based on any change in the emotional data, in accordance with exemplary embodiments of the present disclosure. As shown in the figure, the user is typing a text with multiple type of emotions in single text and also expressing different emotions. As the user changes his expression from one mood to another mood, the change in the expression is accordingly analysed by the expression identification module [116] in real‐time and the updated at least one input content is thereby determined. The user is automatically and dynamically prompted to use the updated at least one input content. The expression identification module [116] and the emotion detection module [114] continuously track the gesture of the user and the at least one user input in real‐time to identify any changes and subsequently determine the updated at least one input content for the user. The invention encompasses dynamically segmentation of the received user inputs for providing a smooth analysis of multiple emotions expressed by the user through the text input along with the gestures without breaking the experience of the user during the communication session.

Fig. 7 illustrates a scenario wherein the user is prompted to select from multiple options of the at least one input content based on the changes detected in the emotional data, in accordance with exemplary embodiments of the present disclosure. Based on the sequence of emotions being detected, different types of content can be filtered to replace the text. The multiple emotion content may include various options such as image, GIF or video etc. that can have two or more segments depicting different layers of expression. As for example, one GIF content may begin with a funny part and ending with an angry part. In another example, an image may depict a conversation where the first top conversation in the image may be sad, followed by the bottom part of the image ending with angry conversation. As shown in the figure, based on the text typed, user can use one finger swipe gesture over side icon to view single emotion content or two finger swipe gesture to view multiple‐emotion content from the side icon.

Fig. 8 is a flowchart illustrating the method for dynamically recommending at least one input content to a user, in a communication session, over a communication network, in accordance with exemplary embodiments of the present disclosure. At step 802, at least one user input is received in real‐time. The at least one user input includes but is not limited to a text input, a speech input, a video input and an image input. Further, at least one portion of an image of the user is also received, wherein the at least one portion indicates at least one gesture of the user in real‐time receiving. The at least one gesture includes but is not limited to facial expression and a behaviour pattern of the user. A processing module [110] is configured to receive the at least one user input and the at least one portion of the user image via the input module [118] and the camera module [108] respectively.

At step 804, at least one expression associated with the at least one gesture of the user is identified. The at least one expression is identified by the expression identification module [116] that is communicably coupled to the processing module [110] . Further, the processing module [110] is also communicably coupled to the emotion detection module [114] that is configured to identify at least one emotional data based on the at least one expression and the at least one user input. The emotional data pertains to at least one type of human emotion, wherein the at least one type of human emotion is having at least one degree. According to the embodiments of the present invention, the at least one user input may also be formatted based on at least one of: the emotional data, the at least one expression and the at least one user input. The formatting of the at least one input may also be based on the degree of affinity of the user with the other users. For example, the user may use different contents while communicating with his/her family and friends or with business colleagues. The information pertaining to the degree of affinity may be stored by the data managing module [102] and the profile managing module [104] . Further, an indication is also provided to the user, via the display module, whether or not the at least one portion of the real‐time image is being adequately captured by the camera module [108] . In the event the camera module [108] is unable to capture the gesture of the user, the user may adjust the angle of his/her device such that the at least one portion of the user image, that indicates the gesture of the user, may be adequately captured and inputted to the processing module [110] .

At step 806, the at least one input content is determined for the at least one user input, based on the at least one emotional data. The at least one input content is determined by the processing module [110] . According to the embodiments of the present invention, the at least one input content includes, but is not limited to, an icon, an emoticon, a video, an audio, a graphic interchange format (GIF) content, and an image.

At step 810, the at least one input content is recommended to the user via the display module, to select and use the in combination with the at least one user input, during the communication session. In one embodiment of the present invention the user is further recommended to replace the at least one user input with at least one content, in an event the user selects the at least one input content that is being displayed. According to the embodiments of the present invention, the at least one input content is recommended to the user while the user is typing the at least one user input in at least one text message during the communication session. Further, the at least one gesture of the user is continuously tracked in real‐time to identify any changes. Accordingly, the emotional data is updated based on the changes identified in the at least one gesture and the user is recommended an updated at least one content based on the updated emotional data.

The present invention encompasses the recommendation of the at least one user input further based on segmentation of the user inputs received from the user. A beginning and an end of the at least one gesture received from the user is marked. Thereafter, based on the beginning and the end of the at least one gesture, the at least one user input is automatically segmented to create one or more segments. A corresponding emotional data is identified for each segment of the one or more segments based on the at least one gesture and the at least one user input. Subsequently, the user is recommended the at least one input content based on the corresponding emotional data for each segment of the at least one user input.

Fig. 9 illustrates the framework [900] supporting the execution of the exemplary embodiments of the present invention. The system and the method of the present invention may be implemented on various compatible frameworks such as Android framework, iOS framework etc., that may include in‐built AI chips (NPU ‐Neural Processing Unit) and faster machine learning technologies. According to the embodiments of the present invention, the expression detection and emotion identification may be coupled to the AI service framework, the depth camera and one or more Dot projectors provided in the communication devices. The implementation of AI framework facilitates in easy identification of each of the minor and subtle changes in the expression of the user, for example; minor changes in the face movement or facial expression of the user. The implementation of AI framework also facilitates in obtaining better accuracy for tracking of the user’s expression.

The features of the present invention as described herein, thus offers the following technical advantages over the conventional systems and methods: (1) accurate identification of the emotions of the user while the user is typing the input, (2) usage of the expression of the user as well as the user input for determining emotions (i.e. emotional data) of the user, (3) segmentation of the user input based on a change in the gesture/expression of the user, (4) recommendation of more than one input content for the one user input while the user is typing, (5) usage of single wipe an double swipe gesture for ease of selection of input content by the user.

The various modules as disclosed herein, including the processing module [110] , may be associated with at least one processor configured to perform data processing, input/output processing, and/or any other functionality that enables the working of the system [100] in accordance with the present disclosure. As used herein, a “processor” refers to any logic circuitry for processing instructions. A processor may be a special purpose processor or plurality of microprocessors, wherein one or more microprocessors may be associated with at least one controller, a microcontroller, Application Specific Integrated Circuits (ASICs) , Field Programmable Gate Array (FPGAs) circuits, and any other type of integrated circuit (IC) , etc. The at least one processor may be a local processor present in the vicinity of the system [100] . The at least one processor may also a processor at a remote location that processes the method steps as explained in present disclosure. Among other capabilities, the processor is also configured to fetch and execute computer‐readable instructions and data stored in a memory or a data storage device.

According to the embodiments of the present invention, the database may be implemented using a memory, any external storage device, an internal storage device for storing instructions to be executed, any information, and data, used by the system [100] to recommend the input options to a user during a communication session. As used herein, a “memory” or “repository” refers to any non‐transitory media that stores data and/or instructions that cause a machine to operate in a specific manner. The memory may include a volatile memory or a non‐volatile memory. Non‐volatile memory includes, for example, magnetic disk, optical disk, solid state drives, or any other storage device for storing information and instructions. Volatile memory includes, for example, a dynamic memory. The memory may be a single or multiple, coupled or independent, and encompasses other variations and options of implementation as may be obvious to a person skilled in the art.

The processor, memory, and the system [100] are interconnected to each other, for example, using a communication bus. The “communication bus” or a “bus” includes hardware, software and communication protocol used by the bus to facilitate transfer of data and/or instructions. The communication bus facilitates transfer of data, information and content between these components.

While considerable emphasis has been placed herein on the disclosed embodiments, it will be appreciated that changes can be made to the embodiments without departing from the principles of the present invention. These and other changes in the embodiments of the present invention shall be within the scope of the present invention and it is to be understood that the foregoing descriptive matter is illustrative and non‐limiting.

Claims

A method for dynamically recommending at least one input content to a user, in a communication session, over a communication network, the method comprising the steps of:

‐ receiving, via an input module [118] , at least one user input in real‐time;

‐ receiving, via a camera module [108] , at least one portion of an image of the user, the at least one portion indicating at least one gesture of the user in real‐time;

‐ identifying, using an expression identification module [116] , at least one expression associated with the at least one gesture of the user;

‐ identifying, using an emotion detection module [114] , at least one emotional data based on the at least one expression and the at least one user input;

‐ determining, using a processing module [110] , the at least one input content, for the at least one user input, based on the at least one emotional data; and

‐ recommending to the user, using a display module, to select and use the displayed at least one input content in combination with the at least one user input, during the communication session.
The method further comprising: replacing the at least one user input with at least one content in an event the user selects the at least one content.
The method further comprising: changing the formatting of the at least one user input based on at least one of: the emotional data, the at least one expression and the at least one user input.
The method further comprising: suggesting the at least one input content to the user while the user is typing the at least one user input in at least one text message during the communication session.
The method further comprising:

‐ tracking continuously the at least one gesture of the user in real‐time to identify any changes;

‐ updating the emotional data based on the changes identified in the at least one gesture;

‐ recommending an updated at least one content to the user, based on the updated emotional data.
The method further comprising:

‐ marking a beginning and an end of the at least one gesture received from the user;

‐ automatically segmenting the at least one user input to create one or more segments based on the beginning and the end of the at least one gesture; and

‐ identifying a corresponding emotional data for each segment of the one or more segments based on the at least one gesture and the at least one user input; and

‐ recommending to the user, the at least one input content based on the corresponding emotional data for each segment of the at least one user input.
The method wherein the at least one user input includes at least one of: a text input, a speech input, a video input and an image input.
The method wherein the at least one input content includes at least one of: an icon, an emoticon, a video, an audio, a graphic interchange format (GIF) content, and an image.
The method wherein the at least one gesture includes but is not limited to facial expression and a behaviour pattern of the user.
The method further comprising: providing on the display screen, an indication to the user whether or not the at least one portion of the real‐time image is being adequately captured by the camera module [108] .
The method wherein the emotional data pertains to at least one type of human emotion, and wherein the at least one type of human emotion is having at least one degree.
The method further comprising the step of filtering, the at least one content based on the emotional data.
A system [100] for dynamically recommending at least one input content to a user in a communication session over a communication network, the system [100] comprising:

‐ an input module [118] configured to receive at least one user input in real‐time;

‐ a camera module [108] , configured to receive at least one portion of an image of the user, the at least one portion indicating at least one gesture of the user in real‐time;

‐ an expression identification module [116] configured to identify at least one expression associated with the at least one gesture of the user;

‐ an emotion detection module [114] configured to identify at least one emotional data based on the at least one expression and the at least one user input;

‐ a processing module [110] configured to determine the at least one input content, for the at least one user input, based on the at least one emotional data; and

‐ a display module configured to recommend to the user, to select and use the at least one input content in combination with the at least one user input, during the communication session.
The system [100] , wherein the processing module [110] is further configured to:

track continuously the at least one gesture of the user in real‐time to identify any changes;

update emotional data based on the changes identified in the at least one gesture; and

recommending an updated at least one content to the user, based on the updated emotional data.
The system [100] , wherein the processing module [110] is further configured to:

‐ mark a beginning and an end of the at least one gesture received from the user;

‐ automatically perform segmentation of the at least one user input to create one or more segments based on the beginning and the end of the at least one gesture; and

‐ identify a corresponding emotional data for each segment of the one or more segments based on the at least one gesture and the at least one user input; and

‐ recommend to the user, the at least one input content based on the corresponding emotional data for each segment of the at least one user input.
The system [100] wherein the at least one user input includes at least one of: a text input, a speech input, a video input and an image input.
The system [100] , wherein the at least one input content includes at least one of: an icon, an emoticon, a video, an audio, a graphic interchange format (GIF) content, and an image.
The system [100] , wherein the at least one gesture includes but is not limited to facial expression and a behaviour pattern of the user.
The system [100] , wherein the camera module [108] comprises a depth camera that is capable of capturing depth information of the real‐time image of the user.
The system [100] , wherein the processing module [110] is further configured to provide, using the display module, an indication to the user whether or not the at least one portion of the real‐time image is being adequately captured by the camera module [108] .
The system [100] , wherein the emotional data pertains to at least one type of human emotion, and wherein the at least one type of human emotion is having at least one degree.
The system [100] , the processing module [110] is further configured to, filter the at least one content based on the emotional data.
The system [100] , further comprising a messaging application module [112] configured to facilitate the communication session.
The system [100] , further comprises a data managing module [102] configured to manage at least one on‐device data pertaining to the communication session.
The system [100] , further comprising a profile managing module [104] configured to provide profile information of the user.
The system [100] , further comprising a dynamic content module [106] configured for searching any relevant data wherein the relevant data is stored in at least one local database and at least one online database across a network.