WO2024047403A2

WO2024047403A2 - Empathic artificial intelligence platform

Info

Publication number: WO2024047403A2
Application number: PCT/IB2023/000538
Authority: WO
Inventors: Alan COWEN; Christopher Gregory; Janet Ho; Lauren KIM; Jacob METRICK; Michael OPARA; Panagiotis TZIRAKIS; Christopher GAGNE; Jun Hwee OH; Alice BAIRD; Garrett BOSECK
Original assignee: Hume AI Inc.
Priority date: 2022-09-02
Filing date: 2023-08-31
Publication date: 2024-03-07
Also published as: WO2024047403A3

Abstract

Embodiments of the present disclosure include systems and methods for producing a user interface based on identified changes in expressions over time in a media content. The methods can comprise receiving, from a user, the media content corresponding to one or more individuals and displaying the user interface, where the user interface comprises a media region that presents the media content and an expression tracking region. The method may further include predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content, updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content, and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.

Description

EMPATHIC ARTIFICIAL INTELLIGENCE PLATFORM

CROSS-REFERENCE TO RELATED APPLICATION

[0001] This application claims the benefit of U.S. Provisional Application No. 63/403,418, filed on September 2, 2022, the entire contents of which are incorporated herein by reference for all purposes.

FIELD OF DISCLOSURE

[0002] The present disclosure relates generally to a platform for predicting the meanings and social significances of human expressive behaviors (expressions), including facial or bodily movements and vocal or linguistic expressions, such as their associated emotions, sentiments, tonalities, toxicity, user experience, and well-being measures, and/or using artificial intelligence to respond in a manner that generates or reduces specific emotions, sentiments, tonalities, toxicity, user experience, and well-being measures; and using such predictions as feedback to train generative Al models.

BACKGROUND

[0003] Current artificial intelligence (“Al”) systems for measuring and responding to expressive behaviors (“empathic Al algorithms”) based on audio, video, and/or text have a number of applications. For example, application developers may want to build nuanced expression awareness and responsiveness into a wide variety of products, such as digital assistants and patient monitoring systems. In order to build such applications, these developers may measure facial, bodily, vocal, and linguistic nonverbal expressions to determine the meanings or social significances expressions (e.g., emotions, sentiments, tonalities, and/or toxicity measures) based on these measurements. For example, a developer may record the speech of an individual, where linguistic verbal and/or nonverbal expressions may be indicative of frustration, e.g., based on the language and prosody of spoken queries to a digital assistant. In one or more examples, the developer may capture or facial and auditory cues indicative of pain in recordings of patients.

[0004] However, the developers will need a system to take these recorded expressions (e.g., of facial, bodily, vocal, and linguistic nonverbal expressions) of an individual to determine the expressions and use the expressions. Additionally, to the extent that the measurements include sensitive data that cannot be transferred from a user’s device, such as recordings of patients that constitute protected health information (PHI), developers may desire a platform that can determine the expressions on a user’s device, locally. Accordingly, there is a need for a platform (e.g., an application programming interfaces (API) or software development kit (SDK)) that can determine measures of facial, bodily, vocal, and/or linguistic expressions based on recorded audio, video, photos, and/or text.

BRIEF SUMMARY

[0005] Embodiments according to the present disclosure provide application platform that allows developers to easily access and call expression recognition models through API endpoints and build them into downstream product applications like digital assistants. In one or more examples, the platform can include a user interface that allows a user/developer to upload a media content. The UI includes tracking the emotions, sentiments, tonalities, toxicities, user experiences, and well-being measures, associated with different aspects of the media content (e.g., facial expressions, vocal bursts, speech prosody, and language associated with an individual in the media content) and an annotated media content that indicates one or more emotions displayed in the media tracker. In one or more examples, the nonverbal behavior measures tracked by the API and/or associated emotions, sentiments, tonalities, toxicities, user experiences, and well-being measures may be inserted into a generative model, which may generate responses that take into account the inserted measures. In one or more examples, the generative model may use the inserted measures to update its weights or architecture in accordance with specific objectives, such as the reduction of negative emotions. As used herein, the term “developer” may refer to an individual or user that may integrate the APIs or SDKs associated with the platform into a downstream application.

[0006] An exemplary method for identifying changes in expressions over time in a media content comprises: receiving, from a user, the media content corresponding to one or more individuals; displaying the user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time. [0007] An exemplary system for producing a user interface based on identified changes in expressions over time in a media content, comprises: one or more processors; and memory communicatively coupled to the one or more processors and configured to store instructions that when executed by the one or more processors, cause the system to perform a method comprising: receiving, from a user, the media content corresponding to one or more individuals; displaying the user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.

[0008] A non-transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of one or more electronic devices, cause the electronic devices to perform a method comprising: receiving, from a user, the media content corresponding to one or more individuals; displaying a user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.

[0009] In some examples, the method further comprises: receiving, via the expression tracking region, a selection of an expression of the one or more expressions; and displaying, based on the selection of the expression, one or more graphical representations of the selected expression, the one or more graphical representations associated with one or more of facial expressions, vocal bursts, vocal prosodies, and language.

[0010] In some examples, the method further comprises: receiving an indication that the user has initiated playback of the media content; and while playing back the media content, overlaying a representation of the one or more expressions on the media content, wherein the representation is associated with a timestamp of the media content. [0011] In some examples, receiving the media content comprises receiving a live stream of data.

[0012] In some examples, the method further comprises: determining whether the media content includes privacy data associated with the one or more individuals; and applying one or more data transformations to anonymize the privacy data associated with the one or more individuals if the media content is determined to include the privacy data.

[0013] In some examples, applying the one or more data transformations is performed prior to receiving the media content.

[0014] In some examples, the method further comprises: estimating an amount of time associated with predicting the one or more expressions of the media content; determining an amount of available processing time associated with the user; and if the amount of available processing time associated with the user exceeds the amount of time associated with predicting the one or more expressions, predicting the one or more expressions.

[0015] In some examples, the method further comprises: estimating an amount of time associated with predicting the one or more expressions of the media content; determining an amount of available processing time associated with the user; and if the amount of available processing time associated with the user is less than the amount of time associated with predicting the one or more expressions, forgoing predicting the one or more expressions.

[0016] In some examples, the media content comprises one or more selected from image data, video data, text data, or audio data.

[0017] In some examples, the one or more expressions comprise one or more emotions, one or more sentiments, one or more tonalities, one or more toxicity measures, or a combination thereof.

[0018] In some examples, the one or more emotions comprise one or more of admiration, adoration, aesthetic appreciation, amusement, anger, annoyance, anxiety, approval, awe, awkwardness, boredom, calmness, compulsion, concentration, confusion, connectedness, contemplation, contempt, contentment, craving, curiosity, delusion, depression, determination, disappointment, disapproval, disgust, disorientation, distaste, distress, dizziness, doubt, dread, ecstasy, elation, embarrassment, empathic pain, entrancement, envy, excitement, fear, frustration, gratitude, grief, guilt, happiness, hopelessness, horror, humor, interest, intimacy, irritability, joy, love, mania, melancholy, mystery, nostalgia, obsession, pain, panic, pride, realization, relief, romance, sadness, sarcasm, satisfaction, self-worth, serenity, seriousness, sexual desire, shame, spirituality, surprise (negative), surprise (positive), sympathy, tension, tiredness, trauma, triumph, warmth, and wonder.

[0019] In some examples, the one or more sentiments comprise one or more of positivity, negativity, liking, disliking, preference, loyalty, customer satisfaction, and willingness to recommend.

[0020] In some examples, the one or more toxicity measures comprise one or more of bigotry, bullying, criminality, harassment, hate speech, inciting violence, insult, intimidation, microaggression, obscenity, profanity, threat, and trolling.

[0021] In some examples, the one or more tonalites comprise one or more of sarcasm and politeness.

[0022] In some examples, the user interface further comprises an expression visualizer.

[0023] In some examples, the expression visualizer comprises a graphical representation of an embedding space.

[0024] In some examples, the embedding space comprises a static background, wherein the static background comprises a plurality of regions corresponding to a plurality of different expressions that the one or more neural networks are trained to predict. In some examples, the method further comprises displaying a dynamic overlay on the static background, wherein the dynamic overlay comprises a visualization of a plurality of embeddings representing the predicted one or more expressions associated with the one or more individuals based on the media content.

[0025] In some examples, an embedding of the plurality of embeddings is displayed at a region of the plurality of regions of the static background based on a predicted expression the embedding represents.

[0026] In some examples, the method further comprises: generating, using a generative machine learning model, at least one of new media data and text data, based on the predicted one or more expressions associated with the one or more individuals based on the media content; and displaying the generated at least one of new media data and text data.

[0027] In some embodiments, any one or more of the characteristics of any one or more of the systems, methods, and/or computer-readable storage mediums recited above may be combined, in whole or in part, with one another and/or with any other features or characteristics described elsewhere herein.

DESCRIPTION OF THE FIGURES

[0028] FIG. 1 illustrates an exemplary process for obtaining training data for machinelearning algorithms, according to embodiments of this disclosure.

[0029] FIG. 2 illustrates an exemplary system for online experimental data collection, in accordance with some embodiments of this disclosure.

[0030] FIG. 3 illustrates an exemplary process for obtaining training data for machinelearning algorithms, in accordance with some embodiments of this disclosure.

[0031] FIG. 4 illustrates an exemplary consent form, in accordance with some embodiments of this disclosure.

[0032] FIG. 5 illustrates an exemplary consent form, in accordance with some embodiments of this disclosure.

[0033] FIG. 6 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0034] FIG. 7 illustrates an exemplary questionnaire, in accordance with some embodiments of this disclosure.

[0035] FIG. 8 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0036] FIG. 9 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0037] FIG. 10 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure. [0038] FIG. 11 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0039] FIG. 12 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0040] FIG. 13A illustrates an exemplary flowchart for predicting an emotional rating based on a stimulus input, in accordance with some embodiments of this disclosure.

[0041] FIG. 13B illustrates an exemplary flowchart for generating an expression based on a stimulus input, in accordance with some embodiments of this disclosure.

[0042] FIG. 14 illustrates an exemplary diagram for a process, in accordance with some embodiments of this disclosure.

[0043] FIG. 15A illustrates an exemplary plot that shows a distributions of emotion ratings, in accordance with some embodiments of this disclosure.

[0044] FIG. 15B illustrates an exemplary visualization of the dimensions of facial expression, in accordance with some embodiments of this disclosure.

[0045] FIG. 16 illustrates an exemplary plot that shows the loading correlations across countries and dimensions of facial expression, in accordance with some embodiments of this disclosure.

[0046] FIGs. 17A-17C illustrates the distribution of facial expressions along various dimensions found to have distinct meanings across cultures, in accordance with some embodiments of this disclosure.

[0047] FIG. 18 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0048] FIG. 19 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0049] FIG. 20A illustrates an exemplary process flow, in accordance with some embodiments of this disclosure. [0050] FIG. 20B illustrates exemplary inputs at a terminal window associated with the process flow of FIG. 20A, in accordance with some embodiments of this disclosure.

[0051] FIG. 20C illustrates exemplary inputs at a terminal window associated with the process flow of FIG. 20A, in accordance with some embodiments of this disclosure.

[0052] FIG. 20D illustrates exemplary inputs at a terminal window associated with the process flow of FIG. 20A, in accordance with some embodiments of this disclosure.

[0053] FIG. 20E illustrates exemplary predictions displayed at a terminal window associated with the process flow of FIG. 20A, in accordance with some embodiments of this disclosure.

[0054] FIG. 21 illustrates an exemplary process flow, in accordance with some embodiments of this disclosure.

[0055] FIG. 22 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0056] FIG. 23 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0057] FIG. 24 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0058] FIG. 25 illustrates an exemplary process flow, in accordance with some embodiments of this disclosure.

[0059] FIG. 26 illustrates an exemplary process flow, in accordance with some embodiments of this disclosure.

[0060] FIG. 27 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0061] FIG. 28 illustrates an exemplary process flow, in accordance with some embodiments of this disclosure.

[0062] FIG. 29A illustrates an exemplary process flow, in accordance with some embodiments of this disclosure. [0063] FIG. 29B illustrates exemplary inputs at a terminal window associated with the process flow of FIG. 29A, in accordance with some embodiments of this disclosure.

[0064] FIG. 29C illustrates exemplary inputs at a terminal window associated with the process flow of FIG. 29A, in accordance with some embodiments of this disclosure.

[0065] FIG. 29D illustrates exemplary predictions displayed at a terminal window associated with the process flow of FIG. 29A, in accordance with some embodiments of this disclosure.

[0066] FIG. 30 illustrates an exemplary electronic device, in accordance with embodiments of this disclosure.

[0067] FIG. 31 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0068] FIG. 32 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0069] FIG. 33 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0070] FIG. 34 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0071] FIG. 35 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0072] FIG. 36 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0073] FIG. 37 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0074] FIG. 38 illustrates an exemplary user interface, in accordance with some embodiments of this disclosure.

[0075] FIG. 39 illustrates an exemplary neural network architecture, in accordance with some embodiments of the disclosure. DETAILED DESCRIPTION

[0076] Systems in accordance with embodiments of this disclosure may include one or more platforms, and/or one or more portals that utilize an artificial intelligence (“Al”) system for measuring emotional expression or generating better outputs by using emotional expressions as forms of feedback (“empathic Al algorithms”).

[0077] Current Al systems for the measurement of emotional expression capture only a fraction of the information that is intentionally or unintentionally conveyed by emotional expressions to perceivers. One of the challenges to training better empathic Al algorithms (e.g., machine-learning algorithms that predict participant emotion-related behavioral responses) is the quality and quantity of training data. Due to data limitations, empathic Al algorithms currently fail to recognize many dimensions of emotional expression and suffer from perceptual biases. As a result, generative Al models have been unable to utilize emotional expression as feedback during training.

[0078] For example, conventional Al systems for measuring emotional expressions are limited by the scope and generalizability of the training data. In images drawn from public sources, expressions such as the neutral face or posed smile are dramatically overrepresented, while expressions of genuine fear, anger, and many other emotions are very sparse. In datasets drawn from academia, the focus is generally on posed stereotypical facial expressions of six emotions (anger, disgust fear, happiness, sadness, and surprise), which represent only a tiny fraction of the emotional expressions found in everyday life. Consequently, machine-learning algorithms (e.g., empathic ML/ Al algorithms) trained on these data do not generalize well to most real-life samples.

[0079] Additionally, conventional Al systems for measuring emotional expressions are limited by perceptual biases in training data. Empathic Al algorithms inherit perceptual biases from training data. For instance, in ratings of images drawn from public sources, people with sunglasses are typically perceived as expressing pride and labeled (sometimes incorrectly) as expressing pride. As a result, algorithms trained on perceptual ratings of natural images may label people wearing sunglasses as expressing pride. Moreover, algorithms are biased by demographic imbalances in the expression of specific emotions within public sources of data. For example, young men are more often found expressing triumph than women or older individuals, due to the disproportionate representation of young men playing sports. By contrast, academic datasets attempt to represent people of different demographics expressing the same emotions, but as noted above these datasets are generally very small and focus on a narrow range of emotional expressions.

[0080] In addition to the novel accuracy of the empathic Al models, the design of the platform disclosed herein overcomes a number of challenges in serving these models, including the challenge of generating many outputs in parallel, tracking individuals in the recordings over time, visualizing complex outputs, real-time inference, and adapting the outputs to custom metrics, such as mental health and wellness predictions, as described further throughout.

[0081] Embodiments of the present disclosure provide one or more platforms for providing a user (e.g., developer) access to empathic Al algorithms. The platform(s) may refer to a series of backend services, frontend applications and developer SDKs that will allow developers to submit various types of data (e.g., media data such as photos, videos, audio, and text). This data will be evaluated for content related to expressive behaviors and the results will be returned to the developer. The data may be evaluated using the empathic Al algorithms, which may include facial expression models, speech prosody models, vocal burst models, and/or language models. Speech prosody may include nonverbal vocal expressions such as rhythm, melody, intonation, loudness, tempo, etc. and may convey emotional context, sarcasm, or other meaning beyond the superficial meaning of words or structured language. Vocal bursts may include nonverbal expressions such as laughs, sighs, groans, or other expressions that convey various emotions or intentions without the use of words or structured language.

[0082] To build systems that learn from and respond to nonverbal expressions, the developer may desire a platform that integrates measures of nonverbal expressions into generative Al models (e.g., large language models), and/or that trains the generative Al models using the nonverbal expressions in any recorded data, such as conversations between speakers and listeners, or specifically based on expressions responding to the outputs of a generative Al model. For instance, in some embodiments, a developer may train an Al model, such as a large language model, to generate responses that reduce the rate of users’ (e.g., patients) negative expressions (e.g., frustration, pain) and increase the rate of positive expressions (e.g., satisfaction, contentment) over variable periods of time (e.g., seconds, minutes, hours, or days). [0083] In one or more examples, the one or more portals may refer to a web application that includes the frontend components and/or services that a developer may use. In one or more examples, the portal can include a user interface that includes playground functions, user management functions, and API management functions. The playground interface solves the challenge of visualizing complex, correlated emotional expression measures by generating novel timeline visualizations that overlay the most frequently occurring expression measures, as well as novel embedding visualizations that enable a large number of samples to be compared simultaneously across a large number of expression dimensions (for instance, as shown in Fig. 19).

[0084] The following description is presented to enable a person of ordinary skill in the art to make and use the various embodiments. Descriptions of specific devices, techniques, and applications are provided only as examples. Various modifications to the examples described herein will be readily apparent to those of ordinary skill in the art, and the general principles defined herein may be applied to other examples and applications without departing from the spirit and scope of the various embodiments. Thus, the various embodiments are not intended to be limited to the examples described herein and shown, but are to be accorded the scope consistent with the claims.

[0085] Although the following description uses terms “first,” “second,” etc. to describe various elements, these elements should not be limited by the terms. These terms are only used to distinguish one element from another. For example, a first graphical representation could be termed a second graphical representation, and, similarly, a second graphical representation could be termed a first graphical representation, without departing from the scope of the various described embodiments. The first graphical representation and the second graphical representation are both graphical representations, but they are not the same graphical representation.

[0086] The terminology used in the description of the various described embodiments herein is for the purpose of describing particular embodiments only and is not intended to be limiting. As used in the description of the various described embodiments and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

[0087] The term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event],” depending on the context.

EMPATHIC Al MODEL AND TRAINING

[0088] FIG. 1 illustrates an exemplary process 100 for obtaining training data for machine-learning algorithms according to embodiments of this disclosure. Process 100 is performed, for example, using one or more electronic devices implementing a software platform. In some examples, process 100 is performed using a client-server system, and the blocks of process 100 are divided up in any manner between the server and a client device. In other examples, the blocks of process 100 are divided up between the server and multiple client devices. In other examples, process 100 is performed using only a client device or only multiple client devices. In process 100, some blocks are, optionally, combined, the order of some blocks is, optionally, changed, and some blocks are, optionally, omitted. In some examples, additional steps may be performed in combination with the process 100. Accordingly, the operations as illustrated (and described in greater detail below) are exemplary by nature and, as such, should not be viewed as limiting.

[0089] At Block 101, an exemplary system (e.g., an application executed on one or more electronic devices) can obtain an experimental stimulus and/or task from a database of stimuli and/or tasks and present the experimental stimulus to a participant. In one or more examples, the task may be randomly selected form the database. The system can then, prompt the participant to behave in a manner that depends on how they perceive or make sense of the stimulus or task. In some embodiments, the stimulus may include a textual description of the emotion to be acted out rather than media data, e.g., a photo or video. For example, participants may be asked to behave based on a textual instruction (e.g., “produce a facial expression that shows surprise”).

[0090] In one or more examples, the participant may be reminded that their compensation will depend on a comparison between ratings or measures behavior and ratings or measures of the stimulus. In this manner the system can provide a motivation to the participant to provide an accurate response to the prompt.

[0091] At Block 103, the exemplary system can receive a mimicry response and rating of the experimental stimulus from the user. In some embodiments, the system can receive a participant response upon receiving an indication that the participant is ready, e.g., the system can receive the participant response while or after receiving an indication that the participant pressed a button.

[0092] In one or more examples, before or after receiving the behavioral response recorded by the participant using the recording devices (e.g., webcam and/or microphone), the system can receive a rating of the experimental stimulus from the participant. In one or more examples, the system can present the user with buttons, sliders, text input, and/or other forms of input such as a free response (e.g., a textual response describing the stimulus), for the participant provide a response corresponding to a rating of the experimental stimulus. In some embodiments, the system may not present a ratings survey to the participant to enable the participant to rate the stimulus.

[0093] In some embodiments, the system can perform validation of the recorded behavioral response of the user. In some embodiments, the validation can be based upon one or more predefined thresholds. For instance, if a predetermined parameter (e.g., file size, duration, and/or signal intensity) of the recording is not within a predefined threshold, the system can indicate that the recording is not accepted. In some embodiments, a user may not proceed with the data collection process if the recording is not validated.

[0094] In some embodiments, the validation can be performed based on other participants’ ratings. For instance, a participant’s recorded behavioral responses may be used as stimuli in surveys responded to by other participants. The other participants’ responses may be used to validate the original recordings for use as machine-learning training data as well as to determine the original participants’ compensation. [0095] As an example, the system can present a first stimulus to a first participant, receive a selection of a first set of emotion tags and corresponding emotion intensities associated with the first stimulus, and capture a first recording by the first participant mimicking the first stimulus. The first recording of the first participant may later be presented as a second stimulus to a second participant. Accordingly, the system can present the second stimulus to the second participant and receive a selection of a second set of emotion tags and corresponding emotion intensities associated with the second stimulus. If the second set of emotion tags and intensities associated with the first recording/second stimulus are identical or similar to the first set of emotion tags and intensities associated with the first stimulus, the first recording may be validated. Measures of similarity for purpose of validating the first recording may include a percentage overlap in emotion tags, the correlation (such as the Pearson product-moment correlation coefficient) between emotion intensities associated the first recording and those associated with the first stimulus, or the inverse of the distance (such as Euclidean distance) between emotion intensities associated the first recording and those associated with the first stimulus.

[0096] At Block 105, the system can activate the recording device located on a participant’s computer or personal device. In some embodiments, the recording device can be either activated immediately upon the execution of the experimental trial or activated through a button that the participant is told to press when ready to begin their response. The recording device may either continuously record during the full duration of the trial, record only one image or for a fixed duration of time, or record until some event occurs, such as the cessation of a video stimulus. As discussed above with respect to Block 103, the recording may capture a mimicry response of the participant to an experimental stimulus.

[0097] At Block 107, the system can transfer the recorded data and survey data to a server. In some embodiments, participants recorded behaviors, metadata such as current time and the filename of the stimulus, and/or ratings of the stimulus can be transferred through the Internet to a data storage server.

[0098] While FIG. 1 has been described with respect to emotions, a skilled artisan will understand that the stimulus can be associated with various expressive behaviors including, but not limited to, emotions, sentiments, tonalities, or toxicities. In one or more examples, sentiments can include, but is not limited to, liking, disliking, positivity, negativity, loyalty, customer satisfaction, and willingness to recommend. In one or more examples, tonalities can include, but is not limited to, sarcasm, politeness, and the like. In one or more examples, toxicity measures can include, but is not limited to bigotry, bullying, criminality, harassment, hate speech, inciting violence, insult, intimidation, microaggression, obscenity, profanity, threat, and trolling, and the like.

[0099] FIG. 2 illustrates an exemplary system 200 for obtaining training data for machine-learning algorithms according to embodiments of this disclosure. The exemplary system can include a personal computing device 203, a local recording device 201, a data collection application 209, the internet 205, and a data storage server 207. In one or more examples, the exemplary system for obtaining training data for machine-learning algorithms can optionally, omit and/or combine one or more of these elements.

[0100] In some embodiments, the personal computing device 203 can correspond to the device through which a participant accesses the data collection application 209 via the Internet 205. In one or more examples, the personal computing device 203 can include, but is not limited to a laptop or a smartphone.

[0101] In some embodiments, the personal computing device 203 can be equipped with or connected to local recordings devices 201. In one or more examples, the local recordings devices 201 can include, but are not limited to webcams, and/or microphones, that are accessed by the data collection application 209.

[0102] In some embodiments, the data collection application 209, can be in the form of a website, web application, desktop application, or mobile application. In one or more examples, the data collection application 209 can present participants with instructions, consent processes, and experimental trials. In one or more examples, the data collection application 209 can collect data based on participants’ responses, e.g., participant’s mimicry responses and survey responses.

[0103] In some embodiments, the data collected using the data collection application 209 can be transferred via the Internet 205 to a data storage server 207. The data may be stored in a file server or database with associations between participants’ self-reported demographic and personality data, recordings (media files), metadata, and labels. Self-reported demographic and personality data may include gender, age, race/ethnicity, country of origin, first and/or second language, and answers to psychological survey questions that indicate personality, well-being, mental health, and/or subjective socioeconomic status. Recordings may include images, audio, or video of participants’ behaviors in response to experimental stimuli and/or tasks. Metadata may include the participant identifier, upload time, survey identifier, stimulus filename and/or task description, country /location in which data collection application was accessed, experimental trial number, and/or trial duration. Labels may include emotions selected from a list by the participant and/or intensity ratings provided for each emotion, text input, and/or answers to questions provided in the form of Likert scale ratings.

[0104] FIG. 3 illustrates an exemplary process 300 for obtaining data training data to be used for training machine-learning algorithms according to embodiments of this disclosure. Process 300 is performed, for example, using one or more electronic devices implementing a software platform. In some examples, process 300 is performed using a client-server system, and the blocks of process 300 are divided up in any manner between the server and a client device. In other examples, the blocks of process 300 are divided up between the server and multiple client devices. In other examples, process 300 is performed using only a client device or only multiple client devices. In process 300, some blocks are, optionally, combined, the order of some blocks is, optionally, changed, and some blocks are, optionally, omitted. In some examples, additional steps may be performed in combination with the process 300. Accordingly, the operations as illustrated (and described in greater detail below) are exemplary by nature and, as such, should not be viewed as limiting.

[0105] At Block 301, the system can receive an indication from a participant to open an application corresponding to the data collection application used to obtain the training data from the participant. In one or more examples, the system can prompt the participant to download the application. Once the application is downloaded, the system can receive an indication from the participant to open the application. In one or more examples, the data collection application can be a website, a web application, a desktop application, or a mobile application.

[0106] At Block 303, the system can present instructions for completing the survey. In one or more examples, the system can prompt the participant to provide informed consent to participate. FIG. 4 illustrates an exemplary consent form 400 provided to a user. In some embodiments, the system receives a completed consent form prior to receiving a recording from the participant. [0107] In one or more examples, the system can further obtain consent from a participant regarding data to being collected using local recording devices, data transferred to a data storage server, and data shared with third parties and/or used to train empathic Al algorithms. FIG. 5 illustrates an exemplary audio release form 500 provided to a user. In some embodiments, the system receives a completed consent form prior to receiving a recording from the participant.

[0108] In some cases, the instructions explain that the participant will receive performance-based compensation in the form of bonuses that depend on a comparison between ratings or measures of participant recorded behaviors and ratings (e.g., measures) of the original stimulus.

[0109] At Block 305, the system can activate local recording devices, e.g., a webcam or microphone, of the participant’s device. In one or more examples, the local recording device can be activated either immediately after consent is obtained or after the participant provides subsequent permission to activate the device by clicking on a designated button. In some examples, the system receives a “test” recording captured by the local device, which is validated automatically by the data collection application (based on file size, duration, and/or signal intensity) and/or upon inspection by the participant.

[0110] FIG. 6 illustrates an exemplary recording user interface 600 according to one or more embodiments of this disclosure. The exemplary recording user interface can be presented on a user device, e.g., a mobile device, tablet, laptop, etc. The exemplary recording user interface can include instructions 601, a microphone control 603, a recording user interface control 605, and a progression control 607. The microphone control 603 can be activated via a user input, e.g., click, to turn on the microphone of the mobile device. The recording user interface control 605 can be activated via a user input to capture a recording of the user via the microphone. In some examples, this UI can be used to capture a “test” recording by the user, for example, to ensure that the capture device is functioning properly. The progression control 607 can be activated via a user input to indicate that the user has completed a recording and/or advance to a subsequent UI. Although the figure is shown in relation to a microphone control, a skilled artisan will understand the user interface can include other capture devices, e.g., camera, that can be activated to capture a recording, according to embodiments of this disclosure. Further, this particular configuration of the UI is merely exemplary and other configurations can be implemented without departing from the scope of this disclosure.

[0111] At Block 307, the system can obtain answers to a demographic and personality questionnaire completed by the participant. In some embodiments, a participant can respond to a questionnaire collecting demographic and personality data, such as gender, age, race/ethnicity, country of origin, first and/or second language, and answers to psychological survey questions that indicate personality, well-being, mental health, and/or subjective socioeconomic status. This data may later be used to test for bias in trained machine-learning models, to calibrate trained machine-learning models in order to remove bias, and/or to incorporate methods of personalization into trained machine-learning models.

[0112] FIG. 7 illustrates an exemplary demographic and personality questionnaire 700 according to one or more embodiments of this disclosure. As shown in the figure, the demographic and personality questionnaire can include one or more prompts and/or questions and a corresponding user-selection control. In some embodiments, the demographic and personality questionnaire can receive one or more selections, e.g., via the user-selection control, indicative of one or more characteristics of a user.

[0113] At Block 309, the system can present stimuli (e.g., an audio file, an image of a facial expression, etc.) and/or tasks (e.g., “imitate the facial expression”, “act like a person who just won the lottery”) to the participant in a pseudorandom order. The system can also prompt the participant to behave in a manner that depends on what they perceive in, or how they make sense of, the stimuli and/or tasks. As the participant responds to the stimuli and/or tasks, the system can activate local recording devices to record the participant’s behavior.

The stimuli and/or tasks may be selected to span dimensions of emotion identified in the psychology literature as well as newly hypothesized dimensions of emotion. For example, stimuli such as images, audio, and/or video have been used to evaluate emotions such as sexual desire, aesthetic appreciation, entrancement, disgust, amusement, fear, anxiety, interest, surprise, joy, horror, adoration, calmness, romance, awe, nostalgia, craving, empathic pain, relief, awkwardness, excitement, sadness, boredom, triumph, admiration, satisfaction, sympathy, anger, confusion, disappointment, pride, envy, contempt, and guilt. For additional examples please refer to Cowen & Keltner, 2020, Trends in Cognitive Sciences, which is herein incorporated by reference. [0114] In some embodiments, the system can present the participant with annotations of their recorded behaviors. In one or more examples a machine-learning model can be used to annotate the recorded behaviors. The system can also prompt the participant to generate behaviors that will meet a specific annotation criterion. For example, participants may be asked to behave such that the annotations include a specific emotion label (e.g., “produce a facial expression that is labeled surprise”), or participants may be asked to behave in such a way that the annotations produced by the machine-learning appear to be incorrect (“produce a facial expression that appears to be incorrectly labeled by the model”).

[0115] FIG. 8 illustrates an exemplary mimicry user interface (UI) 800 according to one or more embodiments of this disclosure. In one or more examples, the mimicry UI 800 can be presented to a participant at Block 309. The exemplary mimicry UI 800 can include instructions 801, a playback control 803, a recording user interface control 805, and a plurality of predefined emotion tags 807. The playback control 803 can be activated via a user input, e.g., click, to play a media content, e.g., audio recording, video recording, image, etc. The recording user interface control 805 can be activated via a user input to capture a recording of the user via a microphone and/or camera of the user’s device. The predefined emotion tags 807 can include a plurality of emotions that can be selected by a user and associated with the media content and/or recording as ratings. Although FIG. 8 is shown in relation to an audio content, a skilled artisan will understand the user interface can include other media content including, video content and one or more images, according to embodiments of this disclosure. Accordingly, the mimicry UI 800 can be be provided to enable a participant to playback an experimental stimulus, record an imitation of the experimental stimulus and select one or more predefined emotion tags to characterize the experimental stimulus.

[0116] FIG. 9 illustrates an exemplary mimicry user interface (UI) 900 according to one or more embodiments of this disclosure. As discussed with respect to the FIG. 8, the exemplary mimicry UI can include instructions 901, a media content playback control 903, a recording user interface control 905, and a plurality of predefined emotion tags 907. As shown in mimicry UI 900, the UI can further include a recording playback control 909 and an emotion scale 911 corresponding to the selected emotions. The recording playback control 909 can be activated via a participant input to play a recording captured by the participant. For example, the system can receive an indication from the participant to playback the recording to determine if the recording is satisfactory. If the participant determines the recording is not satisfactory, the participant may capture a second recording using the recording user interface control 905. The emotion scale 911 can be provided for a participant to indicate a level of intensity of the emotion corresponding to the selected emotion tags associated with the media content.

[0117] FIG. 10 illustrates an exemplary mimicry user interface (UI) 1000 according to one or more embodiments of this disclosure. The exemplary mimicry UI 1000 can include instructions 1001, a media content 1003, a recording user interface control 1005, a recording 1009, and a plurality of predefined emotion tags 1007. The media content 1003 can be provided to the participant as a stimuli, whereby the participant attempts to imitate the media content 1003. The recording user interface control 1005 can be activated via a user input to capture a recording of the participant via a microphone and/or camera of the participant’ s device. In some embodiments, the recording user interface control 1005 can provide a preview of the data captured by a capture device. The recording 1009 can correspond to an image captured by the image capture device of the participant’s device. As discussed above, the predefined emotion tags 1007 can include a plurality of emotions that can be selected by a user and associated with the media content and/or recording. Although exemplary mimicry UI 1000 is shown in relation to image content, a skilled artisan will understand the user interface is not limited to image capture and can include other media content including, audio and/or video.

[0118] FIG. 11 illustrates exemplary feedback UI 1100 according to one or more embodiments of this disclosure. The exemplary feedback UI 1100 can be presented to a participant to collect participant feedback regarding the data collection process.

[0119] FIG. 12 illustrates illustrates an exemplary user interface (UI) 1200 according to one or more embodiments of this disclosure. The exemplary user interface 1200 can be used to provide a completion code to a user for their personal records and for obtaining compensation.

[0120] At Block 311, the system can transfer the recorded data and survey data to a storage server. In one or more examples, the data can be transferred as participants undergo the experimental mimicry trials or upon completion of the survey. [0121] At Block 313, the system can determine supplementary compensation based on a subsequent comparison between ratings or measures of participants’ behavioral responses and ratings or measures of the original stimuli. Compensation can be provided to participants through a payment platform. For instance, a participant’s recorded behavioral responses may be used as stimuli in surveys responded to by other participants. The other participants’ responses may be used to determine the original participants’ compensation.

[0122] As an example, the system can present a first set of stimuli to a first participant, receive a selection of a first set of emotion tags and corresponding emotion intensities associated with the first set of stimuli, and capture a first set of recordings by the first participant mimicking the first set of stimuli. The first set of recordings of the first participant may later be presented as a second set of stimuli to a second participant.

[0123] Accordingly, the system can present the second set of stimuli to the second participant and receive a selection of a second set of emotion tags and corresponding emotion intensities associated with the second set of stimuli. To the extent that the second set of emotion tags and intensities associated with the first recording/second stimulus are similar to the first set of emotion tags and intensities associated with the first stimulus, the participant can be rewarded a higher performance-based compensation. Measures of similarity for purpose of determining the performance-based compensation may include the percentage overlap in emotion tags, the correlation (such as the Pearson product-moment correlation coefficient) between emotion intensities associated the first recording and those associated with the first stimulus, or the inverse of the distance (such as Euclidean distance) between emotion intensities associated the first recording and those associated with the first stimulus.

[0124] In some embodiments, the system can determine the performance-based compensation based on a competition between multiple participants. For instance, the participant whose recordings are determined to be the most similar to the original stimuli may receive a reward. The performance-based compensation may also be informed by automated measures of a participant’s responses by a machine-learning algorithm. For instance, participants who are presented with machine-leaming-based annotations of their recorded behaviors during the experimental mimicry trials, and who are asked to behave in such a way that the annotations “appear to be incorrect”, may be rewarded based on the degree to which the machine-learning-based annotations of their recorded behavioral response differ from other participants’ ratings of their recorded behavioral response. [0125] FIGs. 13A and 13B illustrates an exemplary processes 1300A and 1300B for training machine-learning algorithms according to embodiments of this disclosure. Processes 1300A and 1300B are performed, for example, using one or more electronic devices implementing a software platform. In some examples, processes 1300A and 1300B are performed using a client-server system, and the blocks of processes 1300A and 1300B are divided up in any manner between the server and a client device. In other examples, the blocks of processes 1300A and 1300B are divided up between the server and multiple client devices. In other examples, processes 1300A and 1300B are performed using only a client device or only multiple client devices. In processes 1300A and 1300B, some blocks are, optionally, combined, the order of some blocks is, optionally, changed, and some blocks are, optionally, omitted. In some examples, additional steps may be performed in combination with the processes 1300A and 1300B. Accordingly, the operations as illustrated (and described in greater detail below) are exemplary by nature and, as such, should not be viewed as limiting.

[0126] FIG. 13A illustrates a process for predicting an emotional rating based on a stimulus input. As shown in the figure, the stimulus input can include one or more of a recorded vocal expression, an image of a facial and/or body expression, a recorded speech sample, or a recorded multimodal video. In one or more examples, these stimulus inputs can be input into a trained machine-learning model to predict a reported emotional rating or a perceived emotional rating. In one or more examples, the stimulus inputs can be input into a trained machine-learning model to predict similarities or differences between emotional experiences.

[0127] For example, in some embodiments, the data collected using the data collection application, e.g., data collection application 309, can be used to train empathic Al algorithms that predict participant emotion-related behavioral responses (ratings and recorded responses) from the properties of an experimental stimulus or task, from participants’ other responses to the stimulus or task, or to make comparisons between different participants’ responses to the same experimental stimulus or task.

[0128] In some embodiments, an algorithm is trained to predict emotion(s) based on image data (e.g., an image, a video) in which a person is producing a facial expression. The training data can comprise a plurality of images. Each training image can be a participant’s imitation recording and is labeled with the participant’s ratings of emotion. The predicted emotion(s) are consequently the emotions that the person would attribute to the expression they have produced. Other algorithms may be similarly trained to predict the emotions a person would attribute to their own speech, nonverbal vocalization (e.g., laugh or sigh), or the combination of facial, bodily, and/or vocal expressive behaviors captured in video.

[0129] Such empathic Al algorithms can provide an unbiased measure of user-generated emotional expressions that are useful for a wide range of downstream applications. For instance, such Al algorithms can be used in mental health diagnosis or treatment applications, wherein it is critical to obtain unbiased measures of the emotions a person experiences or conveys during a therapy session or in other contexts. For example, a therapy session and/or a medical appointment can be recorded. The recorded media can be provided to the empathic Al algorithm. Based on the recorded media, including one or more patient emotional expressions, the empathic Al algorithm may predict one or more emotions expressed by the patient during the therapy session and/or medical appointment. The predicted emotions can be used to supplement the physician’s diagnosis and/or treatment.

[0130] Empathic Al algorithms that provide an unbiased measure of user-generated emotional expressions can also be used to build digital assistants that respond appropriately to the user queries provided through audio or video recording devices, wherein unbiased measures of emotional expression are critical to understanding the implicit meaning of the query and generating a satisfactory response. For example, a user may pose a spoken query to a digital assistant. The digital assistant may use an empathic Al algorithm to predict (infer) the user’s intended or implicit emotional intonation from the nonverbal aspects (speech prosody) of the recorded query. By considering the user’s inferred emotional intonation, the digital assistant may generate a more context- appropriate response than would be afforded by the lexical (language) content of the query alone. For instance, if the inferred emotional intonation is one of frustration, and the user’s query follows from a previous response by the digital assistant, then the digital assistant may deduce that the previous response was unsatisfactory, and it may consequently generate a new response that strongly differs from the previous response.

[0131] Empathic Al algorithms that provide an unbiased measure of user-generated emotional expressions can also be used in augmented or virtual reality applications, wherein emotional expressions are transferred onto virtual characters or used to create emotionally responsive interfaces or environments. For example, a model can be trained to predict emotion(s) based on a recording of a user and the system can create/modify virtual characters, or aspects of an AR/VR environment, based on the predicted emotion(s).

[0132] In some embodiments, the data collected using the data collection application 309 can also be used to train algorithms that predict the emotions a perceiver would attribute to an emotional expression of another individual, rather than the emotions someone would attribute to their own expression. The training data can comprise a plurality of images. Each training image can be an image that is presented to a participant (e.g., the stimuli) and is labeled with the participant’s ratings of emotion. In some embodiments, such algorithms can be implemented in application areas such as the development of digital assistants that produce facial or vocal expressions that are maximally understandable to a particular user. For example, a user may pose a query to a digital assistant. In response, the digital assistant may generate a response that includes one or more emotion-based expressions using an empathic Al algorithm that is trained to produce emotionally intoned speech from text (z.e., a generative model of emotional intoned speech). In order to enhance the user’s understanding of the digital assistant’s response, the generative model may incorporate background information on the user in order to predict the user’s individual propensity to perceive specific emotions in various patterns of speech prosody. The response provided to the user may then be calibrated based upon these predictions. For instance, if the user expresses frustration in their query, the digital assistant may elect to respond using an emotional intonation that the user would be likely perceived as apologetic.

[0133] In some embodiments, in addition to predicting ratings or labels of emotional expressions, the data collected using the data collection application can also be used to train empathic Al algorithms that compute similarities or differences between emotional experiences in a manner that does not rely on ratings or labels. For instance, an algorithm can be trained to predict whether a given facial expression is an imitation of another facial expression (e.g., a classification, regression, or clustering/embedding model that receives a given facial expression and determines whether it is an imitation of a predefined facial expression), or whether two different facial expressions are imitations of the same facial expression (e.g., a classification, regression, or clustering/embedding model that receives two facial expressions and determine if they are imitations of the same expression).

[0134] In some embodiments, algorithms could identify subtle qualities emotional expressions that cannot easily be verbalized or extracted from human ratings. This may be preferable in application areas where it is important to account for relatively subtle and difficult to verbalize distinctions between expressions, such as in digital coaching for actors, wherein a user may be tasked with reproducing a targeted emotional expression. For example, a user may be provided with a prompt configured to solicit an expressive response from the user. For instance, the user may be asked to imitate a short video or audio clip presenting a subtle emotional expression. The user may provide an expressive response based on the prompt that can be fed into the empathic Al algorithm. The Al algorithm may be configured to receive the user’s response and determine whether the user’s response is an accurate imitation of the original emotional expression. The system can provide feedback to the user accordingly.

[0135] FIG. 13B illustrates a process for generating an emotional expression based on a stimulus input and an emotional rating. As shown in the figure, the stimulus inputs can include one or more of a recorded vocal expression, an image of a facial and/or body expression, a recorded speech sample, a recorded multimodal video, an animation rig, or transcribed text. Additional inputs can include an emotional rating input selected from one or more of a reported emotion rating and a perceived emotional rating. The stimulus input and the emotional rating inputs can be input into a machine-learning model to generate one or more outputs corresponding to a generated emotional expression. As shown in the figure, the generated emotional expression can include one or more of a generated vocal expression, a generated facial and/or body expression, a generated speech sample, and a generated multimodal response.

[0136] In some embodiments, the data collected using the data collection application can also be used to train models that take emotion labels as input and generate facial or vocal expressions. This is useful, for example, in developing digital assistants that produce speech with contextually appropriate emotional intonations.

[0137] In some embodiments, the data collected using the data collection application can also be used to train semi-supervised models. As discussed above, the participant may be asked to provide a recording without providing emotion tags. Recordings collected based on the same stimulus can be used to train the semi-supervised model.

[0138] The machine-learning models described herein include any computer algorithms that improve automatically through experience and by the use of data. The machine-learning models can include supervised models, unsupervised models, semi- supervised models, selfsupervised models, etc. Exemplary machine-learning models include but are not limited to: linear regression, logistic regression, decision tree, SVM, naive Bayes, neural networks, K- Means, random forest, dimensionality reduction algorithms, gradient boosting algorithms, etc.

[0139] In some embodiments, the demographic and personality data collected using the data collection application can be used to test for bias in trained machine-learning models, to calibrate trained machine-learning models in order to remove bias, and/or to incorporate methods of personalization into trained machine-learning models. For example, to test for bias in a trained machine-learning model, the model may be evaluated on data from participants of differing demographic groups to determine the differential effects of emotional expressions within each group on the predictions of the model. More specifically, a model trained to select images to surface to a user in an application (for instance, to preview a video or curate content within a social media feed) may be evaluated on data from participants of different genders. The differential effects of dominant expressions (e.g., anger, pride) and submissive expressions (e.g., embarrassment, gratitude) on the behavior of the model may then be examined within each gender group. One possible outcome, as an illustrative example, may be that the model is 25% more likely to surface images of femalepresenting individuals with submissive expressions than with dominant expressions and 20% more likely to surface images of male -presenting individuals with dominant expressions than submissive expressions. Consequently, if the application is widely adopted, the model may reinforce harmful biases or stereotypes at a large scale, negatively affecting society. To remove this bias, the model may subsequently be calibrated to remove differences between gender groups in its relative treatment of dominant versus submissive expressions.

[0140] In some embodiments, models can be trained using supervised, semi- supervised, or unsupervised methods. As discussed above, when using supervised methods, participants’ ratings are used as labels for the stimuli and/or participants’ recorded behavioral responses. When using semi-supervised or unsupervised methods, links are drawn between stimuli and participants’ responses, or among different participants’ responses to the same stimuli, or among each participants’ responses to different stimuli. EMPATHIC Al MODEL - EXAMPLES

[0141] As discussed above, due to data limitations, empathic Al algorithms currently fail to recognize many dimensions of emotional expression and suffer from perceptual biases. For example, conventional Al systems for measuring emotional expressions are limited by the scope and generalizability of the training data. In images drawn from public sources, expressions such as the neutral face or posed smile are dramatically overrepresented, while expressions of genuine fear, anger, and many other emotions are very sparse. In datasets drawn from academia, the focus is generally on posed stereotypical facial expressions of six emotions (anger, disgust fear, happiness, sadness, and surprise), which represent only a tiny fraction of the emotional expressions found in everyday life. Consequently, machine-learning algorithms (e.g., empathic ML/ Al algorithms) trained on these data do not generalize well to most real-life samples. Additionally, conventional Al systems for measuring emotional expressions are limited by perceptual biases in training data. Moreover, algorithms are biased by demographic imbalances in the expression of specific emotions within public sources of data. For example, academic datasets attempt to represent people of different demographics expressing the same emotions, but as noted above these datasets are generally very small and focus on a narrow range of emotional expressions.

[0142] By using methods and techniques described herein for collecting training data for empathic Al overcomes, these challenges can be overcome by using experimental manipulation to collect data that represents a richer, more balanced, and more diverse set of expressions, avoiding perceptual biases and confounds by gauging participants’ own representations, beliefs, and/or ratings of the meanings of their expressions, and further avoiding perceptual biases and confounds by systematically collecting recordings of a balanced set of emotional expressions from different demographic or cultural groups.

[0143] As an example, according to embodiments of this disclosure, a study was performed using large-scale controlled mimicry-based data to determine the meaning of various facial expressions for tens of thousands of people across six countries. This generated data suitable for both machine-learning and psychological inference. A deep neural network was configured to predict the culture- specific meanings people attributed to their own facial movements while disregarding physical appearance and context discovered 28 dimensions of facial expression with distinct meanings. Based on the collected data (e.g., facial expressions and attributed meanings), the study determined that the dimensions of facial expression were 63% preserved in meaning across the six countries and four languages, with 21 dimensions showing a high degree of universality and the remainder showing subtle to moderate cultural specificity. This is not an exhaustive catalog or taxonomy of distinct emotion concepts or anatomically distinct facial expressions. However, these findings indicate that the distinct meanings that facial expressions can reliably convey in a wide range of countries.

[0144] This study employed an experimental approach to address the limitations of previous large-scale studies of human facial expression. Perceptual ratings of naturally occurring facial expressions are generally confounded by the physical appearance and context of the person making the expression. Here, experimental randomization and mimicry were used to decouple facial movements from these confounds. In addition, while previous algorithms for measuring facial expression have been trained on ratings in a single culture using predetermined taxonomies of expression, and have captured a more limited range of facial expressions. The present study found a broader set of dimensions of facial movement that reliably mapped to culture- specific meanings, and used a deeply inductive approach to explore how these meanings converge and diverge across cultures.

[0145] FIG. 14 is a diagram that illustrates the process 1400 for this experimental study. In the first phase of data collection (henceforth “mimicry phase”), a total of 5,833 participants from China (n = 602; 371 female), Ethiopia (n = 149; 26 female), India (n = 478, 74 female), South Africa (n = 2,131; 970 female), the United States (n = 2,576; 1,346 female), and Venezuela (n = 344; 110 female) completed a facial expression mimicry task (e.g., cross- cultural mimicry task), imitating subsets of 4,659 images of facial expressions (e.g., seed images) and rating what each expression meant to them as they were imitating it (e.g., selfreport judgments).

[0146] The six countries were selected due to being widely diverse in terms of culture- related values — e.g., individualism vs. collectivism, power distance, autonomy — of interest in cross-cultural comparisons. The seed images (e.g., experimental stimulus) included 4,659 images of individuals’ facial expressions, extracted from naturalistic datasets of emotional stimuli, expressive behavior found online using hundreds of search queries for emotion concepts and emotional contexts, and responses to 1,707 emotionally evocative films. Based on past estimates of the reliability of observer judgments of facial expressions, the study collected responses to each seed image from an average of 15.2 separate participants in each culture. This amounted to a total of 423,193 experimental trials with associated mimic images and judgments on the meaning of the expression.

[0147] During the mimicry phase of data collection, study participants were instructed to participate in experimental mimicry trials as discussed above in processes 100 and 300. For example, prior to engaging in the mimicry task, participants were instructed to use their computer webcam to photograph themselves on each trial. During each trial, the system presented the participants with a user interface (e.g., user interfaces 1000) that presented the user with a target facial expression and instructed the participants to mimic the expression in the image such that their imitation would be rated similarly to the original image. This paradigm leverages the ability of most humans to mimic facial expressions (facial mimicry), which is observed early in development and often occurs spontaneously during social interaction.

[0148] On the same survey page, the system prompted participants to determine what they thought the person was feeling by selecting from forty-eight terms for emotions and rating each selection from 1-100, with values reflecting the perceived intensity of the emotion. Participants were prompted to select a value on a rating scale for at least one category. English terms were used in the three out of six countries where English is an official language (India, South Africa, and the United States). In China, ratings were collected in Chinese; in Ethiopia, ratings were collected in Amharic; and in Venezuela, ratings were collected in Spanish. This mimicry phase of data collection resulted in a large number of participant-generated “mimic” images in each culture [China (n = 60,498), Ethiopia (n = 29,773), India (n = 58,054), South Africa (n = 107,364), the United States (n = 170,013), and Venezuela (n = 47,734)] and self-report ratings corresponding to each mimic image that can simultaneously be considered perceptual ratings of each seed image. The mimic images are facial expression stimuli of high psychological validity: while posed, they have corresponding granular self-report judgments of the emotions the person in the image believes others will perceive.

[0149] The study used these stimuli generated in the mimicry phase in the second phase of data collection (henceforth “perceptual judgment phase”), in which an independent set of participants from each culture [China (n = 542; 349 female), Ethiopia (n = 78; 18 female), India (n = 1,101; 507 female), South Africa (n = 2,465; 1,565 female), the United States (n = 3,419; 1,983 female), and Venezuela (n = 352; 118 female)] were recruited to complete an emotion perception task in which they rated mimic images from the first phase of data collection that were generated by participants within the same country. As in the mimicry phase of data collection, participants were asked to judge each face along 48 terms for emotions and provide intensity ratings ranging from 1-100. On average, participants in this phase of the experiment completed 77.1 trials. This amounted to a total of 534,459 judgments of all mimic images. FIG. 15A shows a distributions of emotion ratings for the mimic images from the perceptual judgment phase.

[0150] To identify the cross-cultural dimensions of perceived emotion using judgments of the seed images collected during the mimicry phase of data collection a generalized version (G-PPCA) of a principal preserved components analysis (PPCA) was applied to the collected data.

[0151] PPCA can be used to identify shared dimensions in the latent structure of two datasets measuring the same attributes. Like more established methods such as partial leastsquares correlation analysis (PLSC) and canonical correlation analysis (CCA), PPCA examines the cross-covariance between datasets rather than the variance-covariance matrix within a single dataset. However, whereas PLSC and CCA derive two sets of latent variables, a and p, maximizing Cov(Xai,Ypi] or Corr[Xai,Ypi], PPCA derives only one variable: a. The goal is to find dimensions of perceived emotion that reliably co-vary across both datasets X and Y.

[0152] Given the dataset measuring the same attributes comes from six different cultures in the study, a generalized version of the PPCA algorithm (G-PPCA) that extracts linear combinations of attributes that maximally co-vary across multiple datasets (in this case, emotion judgments from six countries) was developed. In particular, G-PPCA maximizes the objective function Sum(Cov(a*X,a*Y) for X,Y in S) where S is the set of all possible pairwise combinations of datasets. The resulting components are ordered in terms of their level of positive covariance across all six datasets.

[0153] This G-PPCA was applied in a leave-one-stimulus-out manner to extract components from the judgments of all but one stimulus, and then projected each country’s ratings of the left-out stimulus onto the extracted components, resulting in cross-validated component scores for each country and stimulus. [0154] For example, the G-PPCA was applied to judgments of the 4,659 seed images across the six countries. Based on this application, the system identified thirty-one semantic dimensions, or distinct kinds of emotion, preserved across all six cultures in emotion judgments of the seed images as shown in FIG. 15A. This extends prior work showing that perceivers reliably distinguish at least 28 dimensions of meaning in facial expression within a culture and that a high-dimensional semantic space organizes other components of emotional experience and perception across cultures. This work also converges with the high dimensional structure of emotion observed in more top-down studies of emotion production and recognition.

[0155] However, despite the scale of the dataset, the findings from this first analysis could be explained in part by contextual influences on emotion perception such as visual context and demographic factors like gender and culture. In particular, the ratings that went into this analysis could be influenced by the subtle contexts visible in the periphery of each image as well as the demographic characteristics of the individuals forming expressions in the images, rather than just the information conveyed by the underlying facial movements.

[0156] To control for the effects of visual information beyond facial movement, the study trained a deep neural network (DNN) to predict the meanings attributed to facial movements while ignoring physical appearance and context, as discussed above. This permitted a structural taxonomy of expressive facial movement within and across cultures to be derived.

[0157] Referring to process 1400, the data collected from the mimicry phase and the perceptual judgment phase were input into a deep neural network (DNN). The DNN was configured to predict the average emotion judgments of each seed image in each culture from the images of participants mimicking each seed image. Because the seed images were each shown to a random set of participants, this method forced the DNN to ignore the physical appearance and context of the person making the expression (factors such as gender, age, clothing, and lighting that were randomized relative to the expression being imitated). The average emotion judgments within each culture (made in four separate languages) were treated as separate outputs. Thus, the DNN was not provided any prior mapping between emotion concepts and their use across countries or attempted translations across languages (English, Amharic, Chinese, and Spanish). The study used MTCNN to extract the faces from each mimic image, so only the face was used as input to the model. [0158] After training, the study applied the DNN to the seed images (experimental stimulus to participants in the mimicry phase. The DNN was not exposed to these seed images during training. The study also applied a multidimensional reliability analysis method to distill the significant shared and culture- specific dimensions of facial expression uncovered by the DNN. For example, the study applied principal preserved components analysis (PPCA) between the DNN’s culture- specific annotations of the seed images and the emotions actually inferred from the seed images by participants in each culture. Given that no prior was built into the model linking the words from different languages to one another, any relationship uncovered between the emotion concepts across languages using this method implies that the concepts were used similarly to describe the same facial movements.

[0159] For example, to identify dimensions of facial expression captured by the DNN that were reliably associated with distinct meanings in one or more cultures, we applied PPCA between the 288 outputs of the DNN applied to the seed images, which directly measure facial movement in the seed images, and the 288 averaged perceptual judgments of the seed images (ratings of 48 concepts in each of six countries). This analysis captures the dimensions along which country-specific perceptual judgments of naturalistic facial expressions are influenced by facial movement.

[0160] To assess the significance of the dimensions extracted using PPCA, the study used a leave-one-out cross-validation method. Specifically, the study iteratively performed PPCA between the DNN outputs and the averaged perceptual judgments of all but one of the seed images, and computed the scores of each dimension extracted by PPCA on the DNN outputs and averaged perceptual judgments of the held-out images. Finally, the study concatenated and correlated the PPCA scores of the held-out DNN outputs and judgments. To control for non-linear monotonic dependencies between extracted dimensions, the study used partial Spearman correlations, where for each PPCA dimension we controlled for the PPCA scores on all previous dimensions. To determine the significance of each dimension, the study used a bootstrapping method, iteratively repeating the correlation procedure while randomly resampling the seed images (1000 iterations with replacement). P-values were taken as one minus the proportion of times that the correlation exceeded zero across resampling iterations.

[0161] After computing p-values, we used a conservative method of correction for false discovery rate (FDR) that combined multiple FDR-correction methods. Specifically, the study used Benjamini-Hochberg FDR correction across the first 48 PPCA dimensions (as we were interested in variations of 48 potentially distinct emotion concepts and their translations across countries) at an alpha of .05. The study also separately performed a ForwardStop sequential FDR correction procedure. Finally, the study determined the signal-to-noise ratio (SNR) of the correlations corresponding to each PCA dimension (the correlation divided by the standard deviation computed using bootstrapping), and applied a threshold of 3 to the SNR to extract more stable dimensions. Dimensions that met all three of these criteria were retained. The study applied factor rotation using the varimax criterion to these dimensions.

[0162] To assess the significance of the individual loadings of emotion concepts on the extracted dimensions, the study used a bootstrapping method. Specifically, the study performed the entire PPCA analysis repeatedly after resampling the seed images with replacement, extracting the significant dimensions, and performing factor analysis each time. For each dimension, the study then tested the significance of the top N loadings, with N varying from 1 to 288, by determining how often, across resampling the iterations, there existed a dimension with all of these top N loadings pointing in the same direction. This estimates the proportion of times a dimension with these co-loadings would be extracted if we repeated the entire study. The study took one minus this proportion as the p-value. As N varies from 1 to 288, the p-value can only increase because more loadings are included in the test (and therefore the probability of all loadings pointing in the same direction decreases monotonically). For each dimension, the study applied a ForwardStop FDR-correction procedure at an alpha of .05 to determine the number of significant loadings.

[0163] Using this method, (e.g., using both PPCA and DNN), the study identified twentyeight significant dimensions of facial expression that were reliably associated with distinct meanings, as shown in FIG. 15A. As shown in FIG. 15A, the meaning of the 28 dimensions of facial expression that were reliably predicted by the model is captured by loadings on the 48 predicted emotion concepts that people used to judge their own expressions (y-axis) in each of the 6 countries. Each rectangle is composed of 6 squares that represent the 6 countries (as indicated in the bottom left comer). Squares with dark outlines reflect statistically significant correlations between human judgments in that country and DNN model annotations. The model was trained to predict judgments in each country (and language) separately. Thus, when multiple countries share statistically significant loadings on similar concepts, it indicates that the dimension of facial expression has a similar meaning across the countries. [0164] The study determined that each of the twenty-eight dimensions corresponds to a pattern of facial movement that is reliably associated with a distinct set of emotion concepts in at least one country or language. Some facial expressions were found to have shared meanings across all six countries. For instance, the facial expressions corresponding to twelve different emotion concepts in English — “anger,” “boredom,” “concentration,” “disgust,” “fear,” “joy,” “pain,” “sadness,” “sexual desire,” “surprise (positive),” “tiredness,” and “triumph” — were categorized with what was previously determined were their most direct translations across all six countries and four languages. In four other cases, the correspondence was not exact, but very close: expressions that were associated with “contemplation” in some countries were associated with “doubt” in others, as was the case with “love” and “romance,” “satisfaction” and “contentment,” and “surprise (negative)” and the closest Amharic translation of “awe” (which is close in meaning to “surprise” in English). For another five dimensions — “calmness,” “confusion,” “disappointment,” “distress,” and “interest” — loadings were consistent in all six countries but not statistically significant in Ethiopia. Thus, a total of twenty-one dimensions of facial expression showed a high degree of cultural universality in meaning across the 6 countries.

[0165] For some dimensions of facial expression, the study identified subtle cultural differences in meaning. FIG. 15B illustrates the structural dimensions of facial expression that emerged as having distinct meanings within or across cultures. For example, the expression associated with “awkwardness” in three countries was associated with “determination” in Ethiopia and “craving” in Venezuela. The expression associated with “determination” in three countries was associated with “anger” and “joy” elsewhere. For example, a dimension associated with “calmness” and “satisfaction” in most countries (“Y” in FIGs. 15A and 15B) was associated with “realization” in Ethiopia. There were stronger cultural differences in the meaning of the remaining four dimensions (“A,” “G,” “J,” and “V”).

[0166] The study found that twenty-eight dimensions of facial expression — facial movements found to have reliable meanings in at least one country — were 63% preserved in both meaning and translation across the 6 countries (r = .80, r²= .63, countrywide dimension loadings explained by the average loading), leaving the remaining 37% to be accounted for by either differences in meaning across cultures, imperfect translation across languages, and sampling error. This is shown in FIG. 16, which depicts loading correlations for each country and dimension (e.g., emotional meaning). Where there appeared to be cultural differences, the facial movements were nonetheless imitated very similarly across cultures, confirming that the findings from this study reflected differences in the meanings attributed to the underlying facial movements rather than in the ability to perceive or produce a given set of facial movements.

[0167] Further, the study found that the emotions attributed to facial movements were not discrete in terms of clear boundaries between categories, but were heterogeneous and varied, reflecting continuous blends of meaning that were reliably associated with subtle combinations of facial movement. These results are reflected in FIGs. 17A-17C. For example, FIGs. 17A and illustrates a t-distributed stochastic neighbor embedding (t-SNE) to visualize the distribution of 17B emotion ratings along the 28 structural dimensions of facial expression that we found to have distinct shared or culture- specific meanings. Projected DNN annotations are shown to the left, and projected average human intensity ratings are shown to the right (for visualization purposes; note that our main analyses did not average across all countries). FIG. 17C illustrates a comparison between DNN predictions and mean human intensity ratings reveals continuity in the meaning of expressions. As the intensity of the expression, measured using the DNN, shifts from minimum to maximum along any given dimension (x-axis), the meaning perceived in the expression varied smoothly. Expressions lying in the middle of the dimension were not more ambiguous, perceived one way or the other, but were rather perceived reliably as blends or gradations of expressions (z.e., the standard deviation in intensity ratings is not significantly higher around the 50th percentile than around the 25th or 75th).

EMPATHIC Al PLATFORM

[0168] Embodiments of the present disclosure provide one or more platforms for providing a user (e.g., developer) access to these empathic Al algorithms. The platform(s) may refer to a series of backend services, frontend applications and developer SDKs that will allow developers to submit various types of data (e.g., media data such as photos, videos, audio, and text). This data will be evaluated for content related to language and expressive behaviors and the results will be returned to the developer. The results may include measures of expression, predictions of user behavior, and generated media data (such as photos, videos, audio, and text). For instance, the results may be generated by predictive models and/or generative models. The predictive models may be trained to generate expressive predictions (e.g., predicted expressions that may include emotion, sentiment, tonality, or toxicity). The predictions may be provided to generative models (such as large language models) to generate media. The generative models may be configured based on the predictive model outputs to respond appropriately (e.g., self-correct when the user appears frustrated), and guiding health and wellness recommendations, among many other uses.

[0169] In one or more examples, the one or more portals may refer to a web application that includes the frontend components and/or services that a developer may use. In one or more examples, the portal can include a user interface that includes playground functions, user management functions, and API management functions.

[0170] FIG. 18 illustrates an exemplary user interface associated with the portal. As shown in the figure, the user portal can include a documentation tab, an examples, tab, and a playground tab. In one or more examples, the documentation tab may provide a user interface that presents documentation related to the model that is used to predict the expressions based on data received from the user. In one or more examples, the examples tab can provide a user interface that includes examples of the model analyzing expressions based on exemplary media data. In one or more examples, the playground tab can provide a user interface that allows the user to upload media data to determine expressions associated with the media data.

[0171] In one or more examples, the playground may correspond to an API sandbox that allows the user to apply one or more machine learning models to media data to predict one or more expressions associated with the media data. In some examples, the playground may include a web page within the portal that provides an interactive visualization tool that allows users to interact with the models without having to install the models or other development tools on a local device (e.g., the user’s electronic device).

[0172] FIG. 19 illustrates an exemplary playground as presented to the user. For example, the playground may be presented to a user via a webpage associated with the portal. In one or more examples, the playground may correspond to an API sandbox that allows the user to apply one or more machine learning models to media data to predict one or more expressions associated with the media data. In one or more examples, the user or developer may use the playground to develop a downstream product application. In one or more examples the playground and/or APIs associated with the playground may be integrated into the user’s downstream application, e.g., digital assistant software.

[0173] In one or more examples, the platform may include a plurality of backend services. In one or more examples, the backend services may be run on remote servers (e.g., servers remote from the user device). In one or more examples, the remote servers may be used to evaluate the media data for the user, but otherwise may not otherwise be directly accessible to the user.

[0174] In one or more examples, the backend services can include: one or more ingress services, one or more authorization services, one or more pubsub services, one or more model worker services, and one or more persistence services.

[0175] The ingress services may correspond to the first servers and/or entry point that a developer would access on their way to the backend services. The ingress services can route the developer’s request to the appropriate backend service.

[0176] The authorization services can validate the identity of one or more users. The authorization services may also be used to disambiguate which user is making calls to the APIs.

[0177] The pubsub service may be used to publish and receive messages for data prediction requests for later processing. In one or more examples the pubsub service may be used by the batch processing API, which can be used when an application analyzes saved videos, audio, or image files.

[0178] The model worker service may use one or more models to analyze media data received from the user and conduct the data annotation. In one or more examples, the one or more models can include one or more trained algorithms that measures behaviors related to emotion, sentiment, tonality, or toxicity in the media data and/or predicts the meanings of the expressions of individuals associated with the media data.

[0179] The persistence service may include backend services that maintains state and data for the platform. Platform Developer User Flow

[0180] FIG. 20A illustrates a user flow for the platform. The user can first create an account on the platform, receive an API key, and send data to be analyzed along the API key to the servers. The platform can return predictions and allow developers to utilize those predictions in a downstream application, such as a digital assistant. Embodiments of the present disclosure provide a platform that analyzes expressive behavior for a user to use in a downstream application. Accordingly, the platform is designed to enable a user to analyze expressive behavior-based media data. In some embodiments, the present disclosure provides a platform that generates media data (e.g., audio, video, images, or text) for use in a downstream application, having been previously trained or being continuously trained to generate media data that evokes desired expressive behavior in users in response to the media data. Accordingly, the platform may be configured to enable a user to generate media data that evokes desired emotions.

[0181] The platform can automatically identify and measure multimodal expressive behavior, including facial or bodily movements and vocal or linguistic expressions. The platform provides a number of advantages over current methods for measuring expressive behavior, for example, (1) the platform generates a much wider range of measurements of expressive behavior than current methods; (2) the platform is compatible with a wide range of data classes, including images, audio recordings, or video recordings; and (3) the platform can process both files and streaming inputs. In this manner, the platform can provide a wide range of nuanced measures of expressive behavior, such as tones of sarcasm, subtle facial movements, cringes of empathic pain, laughter tinged with awkwardness, or sighs of relief. Additionally, the platform can categorize and measures both speech sounds and nonlinguistic vocal utterances, such as laughs, sighs, chuckles, “oohs,” and “ahhs.”

[0182] Referring to FIG. 20A, at the arrow Developer account created: An account is created for a developer, either by self-sign up or by platform engineers.

[0183] At the arrow Create API key: A developer requests that an API key is created which will later be used to identify a particular developer’s application and separate it from other developers’ applications.

[0184] At the arrow Develop an application: A developer makes an application which calls the platform’s APIs. [0185] At the arrow Send data to be analyzed (along with API key): The application sends the data it would like to be evaluated for content related to expressive behavior. The API key is required to identify the application, or the platform will reject the request.

[0186] At the arrow Use deep neural networks to analyze data: The platform’s servers will use their deep learning, neural network models to evaluate the data passed from the application.

[0187] At the arrow Return predictions: The predictions that the platform’s servers created using the application’s inputted data are returned back to the calling application.

[0188] FIGS. 20B-20E illustrate portions of the developer flow of FIG. 20A using inputs received from a developer at a terminal window. FIG. 20B illustrates exemplary inputs at a terminal window for sending data to be evaluated for content related to expressive behavior. In some embodiments, the developer provides the API key generated by the platform, the data to be evaluated, and the model the platform should run, for instance, the facial expression model as shown in the exemplary illustration of FIG. 20A, a vocal burst model, a language model, a speech prosody model, and/or combinations thereof. In return, the platform may cause a job ID to be displayed at the terminal window, as shown.

[0189] FIG. 20C illustrates inputs at the terminal window that the developer can use to request a status of the platform’s evaluation of the data provided by the developer (i.e., a “job status”). In some embodiments, the developer provides their API key and job ID along with a request for their job status, and the platform causes a job status to be displayed at the terminal window. If the job is complete, the platform may cause a job status of “COMPLETED” to be displayed. If the job is not complete, the platform may cause a job status of “QUEUED” or “IN PROGRESS” to be displayed. If the job is queued or in progress, a developer can re-run the request for the job status, as desired, until the job status is complete. In some embodiments, the developer must wait a certain time period before re-running the job status request.

[0190] FIG. 20D illustrates inputs at the terminal window that the developer can use to request the predictions resulting from the platform’s evaluation. In some embodiments, the developer provides their API key and job ID along with a request for the predictions and the platform may return the predictions. In some embodiments, the platform returns the predictions directly at the terminal window, as shown in FIG. 20E. Platform Sandbox/Playground

[0191] FIG. 21 illustrates an exemplary playground or API sandbox. The playground can be an interactive tool that will allow users to interact with the models without having to install the models or development tools associated with the models on their own devices. Once a user logs in to the platform/portal, the user can upload a video, audio, or image file and see predictive measures of the meaning or social significance of human behaviors captured in the media file visualized on a dashboard.

[0192] Referring to FIG. 21, at the arrow Developer logs into portal: Using developer account credentials, a developer logs into the portal.

[0193] At the arrow Allow developer access to portal: If the credentials are valid, the developer may be allowed to access the portal.

[0194] At the arrow Browse to API sandbox/Play ground: On the portal webapp, the user can navigate to a section or page that will evaluate data.

[0195] At the arrow Show fields and descriptions of API: The API sandbox/Playground can display fields and descriptions of individual APIs. There may be an UI element which can allow users to enter in example values for them, including allowing uploading video and audio.

[0196] FIG. 22 illustrates an exemplary user interface associated with the playground. The playground may include one or more APIs. As shown in FIG. 22, the exemplary playground can include a media player API, an expression tracking API, and an output visualizer API. The playground also includes one or more user affordances (e.g., icons) for allowing a user to upload media data (e.g., audio data, video data, image data, text data). In some embodiments, a user may upload data to the playground any conventional techniques (e.g., drag and drop, selection from a file manager, etc.). For instance, a user may drag and drop the file at a user affordance on the interface associated with the playground and/or a user may open a file manager by selecting one of the user affordances and select a file to upload to the playground. In some embodiments, a user may instead select an example file from a library of examples files using the interface associated with the playground. In some embodiments, as described above, a user may transmit data to the platform using a terminal window as an alternative to the playground. [0197] FIG. 23 illustrates an exemplary user interface associated with the playground that allows a user to select and upload media data to the platform via the playground of the portal.

[0198] Referring back to FIG. 21, once the user has uploaded media, at the arrow Send data and parameters along with API key: The data and parameters the developer selects can be sent back to the portal. The API key will be populated by the portal to send to the platform service, which requires a valid API key. A developer can select any of their API keys to be sent along with the sandbox/play ground portal’s request.

[0199] At the arrow Dispatch request to relevant service: The portal and platform can dispatch the request to the appropriate backend service to fulfill the request that the developer sent.

[0200] At the arrow Use deep neural networks to analyze data: The platform’s servers can use their deep learning, neural network models to evaluate the data passed from the application.

[0201] At the arrow Return predictions to portal: The predictions generated by the backend services can be returned to the portal.

[0202] At the arrow Display predictions to developer: In turn, the portal will return the results of the predictions to the developer who requested to see an example of this API’s execution.

[0203] FIG. 24 illustrates an exemplary playground displaying the results of the expressive (e.g., emotion, sentiment, tonality, or toxicity) predictions of the models associated with the platform. As shown in the figure, the playground may include a media region (named “Media Player” in FIG. 22 and FIG. 24, and named “File Review” in FIG. 37), an expression tracking region (named "Emotion Tracking” in FIG. 22 and FIG. 24, and named “Expression Timelines” in FIG. 37), and an output visualizer region (named “Output Visualizer” in FIG. 22 and FIG. 24, and named “Embedding Plot” in FIG. 37). In one or more examples, the expression tracking region may correspond to an expression tracking region that allows a user to track one or more emotions, sentiments, tonalities, or toxicity measures. The expression tracking region may illustrate how one or more expressions predicted based on the media data vary over time. In one or more examples, the expression tracking region may include one or more graphical representations illustrating the predicted expressions over time. As shown in the figure, the graphical representations can include, but are not limited to facial expressions, vocal bursts, language. In one or more examples, the graphical representations may also correspond to vocal prosodies. As shown in the figure, the expression tracking region may include a drop-down menu or other affordance that allows a user to select one or more expressions (emotions, sentiments, tonalities, and/or toxicity measures) to be displayed in the expression tracking region. As shown in FIG. 24, the dropdown menu indicates that “Happiness” has been selected.

[0204] In some embodiments, the output visualizer (also referred to herein as an expression visualizer, “embedding plot” and/or “embedding space plot”) may include a graphical representation of an embedding space. In some embodiments, the embedding space comprises a static background. In some embodiments, the static background comprises a plurality of regions corresponding to a plurality of different expressions that the one or more neural networks are trained to predict. In some embodiments, the static background comprises a visualization of embeddings representing all of the expressions that the one or more neural networks are trained to predict. In some embodiments, the embedding space comprises dynamic region. In some embodiments, the dynamic region comprises a visualization of embeddings representing one or more predicted expressions associated with one or more individuals based on media content. In some embodiments, the dynamic region comprises a dynamic overlay on the static background, wherein the dynamic overlay comprises a visualization of a plurality of embeddings representing the predicted one or more expressions associated with the one or more individuals based on the media content. In some embodiments, an embedding of the plurality of embeddings is displayed at a region of the plurality of regions of the static background based on a predicted expression the embedding represents. Further description of the embedding space is provided below.

[0205] As noted above, in some embodiments, a static background region of the output visualizer and may illustrate the emotions the emotions, sentiments, tonalities, toxicity, user experience, and well-being measures that the predictive models (e.g., facial expression model, speech prosody model, vocal burst model, language model) associated with the platform are trained to predict. In some embodiments, the static background region is a gradient plot that illustrates continuous transitions between different emotions, sentiments, tonalities, toxicity, user experience, and well-being measures. The gradient/continuous transitions may be illustrated, for instance, using gradual color transitions between different emotions, sentiments, tonalities, toxicity, user experience, and well-being measures, as shown. In some embodiments, colors represent expression dimensions, and the gradient transition between colors may represent how expression dimensions can be combined in different ways. For example, as shown, triumph and joy may be closely related emotions and are thus spaced near one another in the background region of the embedding plot. In the exemplary output visualizer shown, joy is represented by a first color (yellow), and triumph is represented by a similar color (orange). The gradient between joy and triumph is illustrated on the plot as a gradual transition from yellow to orange, representing various combinations of the dimensions associated with emotions joy and triumph.

[0206] In some embodiments, the dynamic region of the output visualizer includes a representation of expressions (emotions, sentiments, tonalities, toxicities, etc.) predicted based on the media data and displayed as a dynamic overlay on the static background. In some embodiments, the overlay region includes a visualization of a plurality of embeddings in the embedding space. In some embodiments, each embedding is respectively associated with a frame of the media data. For instance, one or more embeddings for each individual identified based on the media data may be generated, and each embedding may include a lower dimensional representation of predicted emotions, sentiments, tonalities, toxicities, etc. of the respective user for each frame in the media data. The embeddings may be overlaid on the static background region of the output visualizer at a location of the output visualizer that is determined based on the predicted emotions, sentiments, tonalities, toxicities, etc. of the respective user for that frame. For instance, an embedding generated based on an individual’s expression of joy may be overlayed, highlighted, or otherwise displayed on the static background region of the output visualizer that corresponds to the emotion joy. The embeddings overlayed on the static background region may illustrate a trajectory of a respective individual’s predicted emotions, sentiments, tonalities, toxicities, etc. across each frame of the media data. For instance, each generated embedding associated with a user may be displayed at a different location of the output visualizer, each location associated with a predicted expression based on the media data. In some embodiments embeddings generated for a selected individual may be displayed on the output visualizer, and in some embodiments, embeddings generated for all individuals may be displayed at once. [0207] In some embodiments, the output visualizer may include one or more dynamic icons each associated with a respective individual identified in the media data. The dynamic icons may traverse the embeddings illustrated on the output visualizer during playback of the media data using the media region. Accordingly, the dynamic icons may be displayed at different positions on the output visualizer at different times during playback of the media data corresponding to an embedding representing expressive predictions of an individual at each respective frame. In some embodiments, the embeddings may be user selectable on the output visualizer, and when selected may cause the dynamic icon associated with a respective individual to move to the selected embedding. In some embodiments, selection of an embedding on the output visualizer may cause the media data to begin playback at a frame associated with the selected embedding, or may cause the media data to transition forward to the selected frame without continuing or beginning playback of the media data. In some embodiments, selection of an embedding on the output visualizer associated with one individual’s expressive predictions at a frame of the media data may also cause the dynamic icons associated with other individuals identified in the media data to transition to positions on the output visualizer corresponding to embeddings representing their expressive predictions at that respective frame.

[0208] As noted above, in some embodiments, each embedding may include a lower dimensional representation of predicted emotions, sentiments, tonalities, toxicities, etc. of the respective individual for each frame in the media data. In some embodiments, the output visualizer may display a plurality of probabilistic indicators associated with respective embeddings indicative of expressive predictions associated with the embedding. For example, an embedding associated with a first individual at a respective frame of the media data may include a probabilistic indication of 0.69 for concentration, 0.62 for interest, and 0.49 for calmness. An embedding associated with a second individual at the same respective frame of the media data may include a probabilistic indication of 0.69 for calmness, 0.54 for interest, and 0.45 for amusement. Accordingly, the probabilistic indicators may provide an intuitive score-type indication for each individual that can be readily compared during playback of the media data. The probabilistic indicators associated with each embedding may be displayed when the embedding is selected and/or when playback of the media reaches a frame associated with the embedding. Accordingly, the displayed scores may also be associated with and change based on the position of the dynamic icons on the output visualizer. [0209] In some embodiments, the output visualizer includes a dynamic graphical representation of embeddings associated with the predictive expressions generated based on the media data. The output visualizer may dynamically plot embeddings associated with all predicted emotions, sentiments, tonalities, toxicities, etc. at each frame (or segment) of the media data as the media data is processed by one or more of the predictive models (e.g., the facial expression model, speech prosody model, vocal burst model, language model). Accordingly, the embedding plot may grow and change to include new expressive predictions predicted at each frame or segment of the media data. Embeddings may shift within the embedding space as new expressive predictions are generated and new relationships are illustrated in the embedding space between different expressive predictions, for instance, a close relationship between joy and triumph.

[0210] In some embodiments, the output visualizer includes a user selectable affordance that allows a user to select from any of the expressive prediction models (e.g., the facial expression model, speech prosody model, vocal burst model, and language model). As illustrated in FIG. 24, the affordance may be a drop down menu. In the exemplary illustration of FIG. 24, “Facial Expression” is selected on the drop down menu, indicating that the predictions displayed on the output visualizer are generated using a facial expression model. In response to detecting a user selection of any one of the models, the playground may cause the media region, the expression tracking region, and/or the output visualizer region to display information associated with expressive predictions generated by the selected model.

[0211] The output visualizer provides an intuitive interface that a user can interact with to monitor and track predicted emotions for each individual identified based on input media data. A user can utilize the output visualizer to discern how conversations and interactions impact the emotional state of various individuals differently and identify interactions that are generally produce positive or negative emotions across a majority of individuals, among a variety of additional analytical tasks.

[0212] In some embodiments, the media region includes a graphical representation of the media data received from the user. In one or more examples, the media region can permit playback of the media data, e.g., if the media data corresponds to video data or audio data. As shown in the figure, a graphical representation of the predicted emotions may be overlaid on the media data. For example, the graphical representation indicates that the woman on the left is predicted to be experiencing 35% happiness and 65% excitement, while the woman on the right is predicted to be experiencing 23% confusion and 77% happiness. As shown in the figure, third individual to the far right side of the figure is facing away from a camera such that the individual’s face is not visible for facial expression analysis. In examples where an individual’s facial expressions are not visible for analysis, the platform may still return and the expressive predictions based on voice data and the graphical representation may still display expressive predictions associated with the individual based on the voice data. In some embodiments, the system identifies individuals in the media data based on voice recognition techniques, facial recognition techniques, or a combination thereof. Accordingly, the system can discriminate between different individuals in the media data and return expressive predictions based on either or both of facial expression data and/or voice data.

[0213] In one or more examples, particularly for video data and/or audio data, the predicted emotions may vary based on a playback time associated with the data. For example, as shown in the figure the predicted emotion of the individual on the left is 35% happiness and 65% excitement at a playback time of 3:05. The predicted emotions may be different at a different time, e.g., the percentage of emotions associated with the individual may vary with time or the emotions may change altogether. In one or more examples, while the media data is played back, the graphical representations associated with the predicted emotions may change accordingly.

[0214] In some embodiments, the media data may be processed using one or more acoustic models and/or one or more object detection models to identify individuals based on the media data. Accordingly, media data can include labels associated with each visible individual and/or each individual that produces audio included in the media data. For instance, in some examples, the media data may be processed using facial detection and voice recognition models to discern between one or more individuals identifiable based on the media data. Specifically, in some examples the facial detection models may identify one or more individuals based on video data in which an individual’s face is visible and the voice recognition models may identify one or more individuals based on audio data that includes the individual’s voice (e.g., when the individual is speaking). In some examples, the facial detection models may be trained using labeled image data including labeled images of human faces. In some examples, the voice recognition models may be trained using any conventional technique for training an acoustic model, for instance, using training data including a sequence of speech feature vectors. It should be understood that the aforementioned training methods are meant to be exemplary and any suitable methods for training facial detection and/or voice recognition models may be used.

[0215] In some examples, after processing the media data to identify one or more individuals (e.g., based on the facial detection models and/or voice recognition models), the media data is input into one or more expressive prediction models (e.g., a facial expression model, a speech prosody model, a vocal burst model, a language model, and/or combinations thereof). In some examples, the expressive prediction models are trained to predict expressions (e.g., emotion, sentiment, tonality, or toxicity) for each of the individuals identified based on the medial data. In some examples, the media data is processed frame-by- frame (or segment-by-segment) by the one or more expressive prediction models. The one or more expressive prediction models may return predictions for each frame, or segment of the audio file. Accordingly, the expressive predictions returned by the expressive prediction models may vary for each identified individual on a frame-by-frame or segment-by-segment basis, thus allowing for time-variant analysis and generation of expressive predictions at each point in time in the media data by the one or more expressive prediction models.

Platform/Backend

[0216] FIG. 25 illustrates the sign up and authentication user flow as well as the interaction between the major components of the platform backend. Components include ingress, authentication module, persistence layer, and machine learning model.

[0217] User sign up: user submits sign up details: a web request is sent to create a new user account within the platform. Request a new user be created: the system receives a web request is made to register a new authentication record that can be used to authenticate one user. Response with successful new user: new authentication details are retrieved. Create new user: the authentication details are persisted for later use by the system. Respond with successful sign up: some authentication details are returned in a web response including whether the sign up was successful.

[0218] User request for session authentication: Request session auth token: a user requests authentication details for one session with which they may access personal or restricted platform data and services. As used herein, the auth token may also be referred to as an API key. Respond with session auth token: return session auth token to user. [0219] User request for emotion data analysis: Send emotion data with auth token: a web request is made containing or linking to existing data. Request auth with token: the platform checks that the session auth is valid for the current user. Respond with successful auth: the platform verifies that the session auth is valid for the current user. Check user quota: the platform checks that the currently authenticated user has sufficient quota to execute the request. The quota may correspond to a pre-determined amount of processing time for running the model associated with an account of the user. The quota may vary depending on an access tier associated with a user but can range from 100 minutes of processing a day to tens of thousands of minutes. Verify existing user quota: the platform verifies that the currently authenticated user has sufficient quota to execute the request. Analyze emotion data with deep neural network: an internal process analyzes data to produce a report of emotions contained within or expressed by data supplied to the system. This process may include but is not limited to application of machine learning techniques such as deep neural networks. Update user quota: persistence services register an update to user quota according to the cost of the operation requested. Respond with emotion analysis: a web response is returned containing an analysis of emotional content within the data from the request.

Streaming API Backend

[0220] FIG. 26 illustrates a user flow for streaming emotion analysis from the platform streaming API. Various technical obstacles were overcome to develop the platform streaming API. For instance, the platform streaming API is configured to ensure low latency while handling large volumes of data. The platform streaming API is further configured to scale to support varying throughput without over-provisioning resources, efficiently manage and store state data across distributed systems, ensure correct ordering and consistency of messages. The platform streaming API is also configured for real-time error detection and recovery.

[0221] Request websocket conn: a web request is made to establish a websocket connection. Respond with websocket conn: a web response is returned containing an active websocket connection. Send emotion data for analysis: a web request is made containing or linking to existing data. Analyze emotion data with deep neural network: an internal process analyzes data to produce a report of emotions contained within or expressed by data supplied to the system. This process may include but is not limited to application of machine learning techniques such as deep neural networks. Respond with emotion analysis: a web response is returned containing an analysis of emotional content within the data from the request. [0222] In one or more examples, the streaming API may be used when an application analyzes live expression outputs, such as a digital assistant that responds in real time to the user’s queries and takes into account signals of frustration, or a patient monitoring application that alerts the doctor when a patient is expressing pain. On the playground, there is an interface to test the streaming API by giving the web app access to one’s webcam and microphone. FIG. 27 illustrates an exemplary playground associated with a streaming API. The exemplary playground illustrated in FIG. 27 may include similar features to the playground illustrated in FIG. 24. For example, the exemplary playground illustrated in FIG. 27 may include a media player region, an emotion or expression tracking region, and an output visualizer region. In one or more embodiments, the playground illustrated in FIG. 27 may further include one or more user affordances for receiving an indication from a user to receive a live stream, e.g., from a local recording device associated with the user such as (but not limited to) a microphone or video camera. A developer can use the emotion tracking output and the output visualizer to test hypotheses about how emotional behaviors relate to important outcomes in their application. The insights derived from the emotion tracking output and the output visualizer can guide how the developer integrate the streaming API with their application. Within the application, streaming measurements of users’ expressive behaviors can be used to improve user experiences by presenting content in accordance with users’ preferences, inputting the expression data into generative Al models that respond appropriately (e.g., self-correct when the user appears frustrated), and guiding health and wellness recommendations, among many other uses.

Batch Processing API Backend

[0223] FIG. 28 illustrates how the user interacts with the batch processing API through the Playground. The batch processing API may be used when an application analyzes saved videos, audio, or image files. Various technical obstacles were overcome to create the batch processing API. For instance, the batch processing API is configured to handle extremely large data volumes in single batch operations, manage dependencies between various stages of processing, and provide graceful degradation (and/or fault tolerance) to avoid overall system failure. Further, the batch processing API is configured to schedule and manage large batch jobs to avoid conflicts and optimal resource usage.

[0224] Typical use cases for the batch processing API include the analysis of recorded calls by a call center analytics company or the analysis of patient videos by a telehealth company. For example, call center recordings may be analyzed using the expression API to determine common causes of frustration for users and identify the solutions that resulted in the greatest user satisfaction. As another example, recordings of telehealth calls may be used to analyze the effectiveness of treatments based on patient expressions of positive emotions such as happiness and contentment and negative emotions such as pain and sadness. The UI associated with the batch processing may be similar to the UI illustrated in FIG. 24.

[0225] The user authenticates through the authorization system and receives a session auth token in response. The developer then submits the media data via an API of the portal for prediction. The developer must submit their provided session auth token along with the prediction request in order for the platform to accept the request. The persistence backend service can then verify that the developer has not surpassed their usage quota on the platform.

[0226] Once the developer’s platform usage status is verified, the user’s prediction job request is then published to the pubsub service. Model workers are subscribed to the pubsub service, constantly listening for newly published jobs that they can start working on.

[0227] Once a model worker service receives a prediction job message from the pubsub service to which it is subscribed, it processes the job’s input data through Hume Al’s deep neural network model. The model worker then sends the model prediction results directly back to the developer who originally submitted the data for prediction.

SDK— >API Communication

[0228] FIG. 29A illustrates a communication pattern between an SDK (software development kit) and the web API (application programming interface). The SDK could be implemented by a user of the platform or supplied as part of the Hume platform software ecosystem. The SDK may preprocess user data for privacy. That processing may be differential and provide algorithmic guarantees of anonymity.

[0229] Preprocess data for differential privacy: data transformations may optionally be applied within the SDK. These transformations may provide privacy and anonymity guarantees or enhance performance through techniques including but not limited to data compression. For instance, a telehealth company may use an on-device SDK to extract expression measures from the live webcam footage of a patient. Prior to being transferred from the patient’s device, these measures may be compressed or adjusted to provide a mathematical guarantee that the patient cannot be reidentified based on the measures alone.

[0230] Request emotion analysis: a web request is made containing or linking to existing data. For instance, the web request may contain or link to the expression measures from the live webcam footage of a patient that were compressed or adjusted to provide a mathematical guarantee that the patient cannot be reidentified based on the measures alone. Expressive predictions may be generated using any of the models (e.g., speech prosody, language, vocal burst, facial expression) described herein.

[0231] Response with emotion analysis: a web response is returned containing an analysis of emotional content within the data from the request.

[0232] FIGS. 29B-29D illustrate communications between a developer and web API to initiate and use the SDK. FIG. 29B illustrates exemplary user commands for submitting a new batch job. FIG. 29C illustrates exemplary user commands for requesting predictions, and FIG. 29D illustrates exemplary predictions returned to the user.

[0233] FIG. 30 illustrates an example of a computing device in accordance with one embodiment. Device 3000 can be a host computer connected to a network. Device 3000 can be a client computer or a server. As shown in FIG. 30, device 3000 can be any suitable type of microprocessor-based device, such as a personal computer, workstation, server, or handheld computing device (portable electronic device) such as a phone or tablet. The device can include, for example, one or more of processor 3010, input device 3020, output device 3030, storage 3040, and communication device 3060. Input device 3020 and output device 3030 can generally correspond to those described above and can either be connectable or integrated with the computer.

[0234] Input device 3020 can be any suitable device that provides input, such as a touch screen, keyboard or keypad, mouse, or voice-recognition device. Output device 3030 can be any suitable device that provides output, such as a touch screen, haptics device, or speaker.

[0235] Storage 3040 can be any suitable device that provides storage, such as an electrical, magnetic, or optical memory including a RAM, cache, hard drive, or removable storage disk. Communication device 3060 can include any suitable device capable of transmitting and receiving signals over a network, such as a network interface chip or device. The components of the computer can be connected in any suitable manner, such as via a physical bus or wirelessly.

[0236] Software 3050, which can be stored in storage 3040 and executed by processor 3010, can include, for example, the programming that embodies the functionality of the present disclosure (e.g., as embodied in the devices as described above).

[0237] Software 3050 can also be stored and/or transported within any non-transitory computer-readable storage medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a computer-readable storage medium can be any medium, such as storage 3040, that can contain or store programming for use by or in connection with an instruction execution system, apparatus, or device.

[0238] Software 3050 can also be propagated within any transport medium for use by or in connection with an instruction execution system, apparatus, or device, such as those described above, that can fetch instructions associated with the software from the instruction execution system, apparatus, or device and execute the instructions. In the context of this disclosure, a transport medium can be any medium that can communicate, propagate or transport programming for use by or in connection with an instruction execution system, apparatus, or device. The transport readable medium can include, but is not limited to, an electronic, magnetic, optical, electromagnetic or infrared wired or wireless propagation medium.

[0239] Device 3000 may be connected to a network, which can be any suitable type of interconnected communication system. The network can implement any suitable communications protocol and can be secured by any suitable security protocol. The network can comprise network links of any suitable arrangement that can implement the transmission and reception of network signals, such as wireless network connections, T1 or T3 lines, cable networks, DSL, or telephone lines.

[0240] Device 3000 can implement any operating system suitable for operating on the network. Software 3050 can be written in any suitable programming language, such as C, C++, Java or Python. In various embodiments, application software embodying the functionality of the present disclosure can be deployed in different configurations, such as in a client/server arrangement or through a Web browser as a Web-based application or Web service, for example.

Platform Sandbox/Playground

[0241] FIG. 31 illustrates an exemplary homepage interface of the platform. As shown in the figure, the homepage includes a plurality of user selectable affordances that enable a user to navigate to different pages of the platform. For instance, the homepage may include affordances that, when selected, enable a user to navigate to the playground (affordance named “Visit Playground” in FIG. 31) (the playground is described in detail above with reference to FIG. 24 and described further below with reference to FIGS. 34-38), and/or to view documentation associated (affordance named “View Documentation” in FIG. 31) with, for instance, instructions for using the platform’s APIs, descriptions of the platform’s facial expression, vocal burst, speech prosody, and language models, and/or instructions for responsible usage of the platform. In some embodiments, a unique affordance is provided for selecting any one of the aforementioned instructions. In some embodiments, the homepage may include affordances associated with sample video (named “Video Example” in FIG. 31), audio (named “Audio Example” in FIG. 31), and/or image data (named “Image Example” in FIG. 31) that when selected, cause the platform to display expressive predictions associated with the sample data generated by the models described herein. In some embodiments, the homepage includes user selectable affordances that enable a user to connect to a webcam (named “Connect Your Webcam” in FIG. 31) and/or to upload a media file (named “Upload Your Own File” in FIG. 31). The homepage may also include user identifying information and an API key assigned to the user. A user selectable affordance may be provided enabling a user to copy the API key to a clipboard. Additionally, the homepage may include a user selectable affordance that when selected enables a user to add credits (i.e., funds) to their account associated with the platform.

[0242] FIG. 32 illustrates an exemplary file upload and analysis page of the playground. As shown in the figure, the file upload and analysis page may include a plurality of user selectable affordances that enable users to upload files for analysis using the models associated with the platform. The plurality of user selectable affordances may include a drag- and-drop field configured to receive a media file with a user “drags and drops” the file into the field (labeled “Drag a file here or click to browse” in FIG. 32). The plurality of user selectable affordances may include an affordance that, when selected, causes a file manager window to open enabling a user to select a media file for upload to the portal (labeled “Drag a file here or click to browse” in FIG. 32). The plurality of user selectable affordances may include affordances associated with sample (named “Video Example” in FIG. 32), audio (named “Audio Example” in FIG. 32), and/or image data (named “Image Example” in FIG. 32) that when selected, cause the portal to display expressive predictions associated with the sample data generated by the models described herein, for instance, using the playground.

[0243] FIG. 33 illustrates an exemplary user account page displaying usage data associated with a user account. As shown in the figure, the user account page may include a graphical illustration of a user’s API usage associated with a respective time period (e.g., on each day of the last month, 14-day period, week, etc., named “API Usage” in FIG. 33), a breakdown of categories of API usage (e.g., video and audio, audio only, video only, images, text, etc., named “API Usage Breakdown” in FIG. 33), and a credit balance of the user account (named “Credit Balance” in FIG. 33). In some embodiments, the account page includes a plurality of user selectable affordances that enable a user to navigate to various pages of the portal. For instance, respective affordances included on the user account page, when selected, may cause the platform to display, a user profile page, a user preferences page, a user API key, a jobs page, a payment information page, an add credits page, a pricing page, a playground page, a documentation page, a file upload page and/or a home page. According to some embodiments, each of the aforementioned pages may also include user selectable affordances that enable a user to navigate to any one or more of the aforementioned pages. As shown in FIG. 33, the aforementioned affordances may be named in accordance with the respective page that is displayed when the affordance is selected.

[0244] FIG. 34 illustrates an exemplary webcam page of the playground displaying the expressive predictions of a model generated based on a real-time analysis of streamed video captured by a webcam and transmitted to the platform using a streaming API. The webcam page may include an affordance that, when selected, causes the platform to connect or disconnect to a webcam and/or microphone at a user device (e.g., a laptop, mobile phone, etc.). When connected, the platform may continuously process the streaming video and/or audio data to generate expressive predictions using a selected model. In some embodiments, the webcam page includes a media region that displays a stream of video data captured by a webcam. In some embodiments, the media region includes an affordance that enables a user to stop the streaming video (shown as a square near a time indicator in the streaming webcam video shown in FIG. 34) In some embodiments, the webcam page may include affordances that, when selected, cause expressive predictions generated by the model to be displayed or hidden (labeled as

in FIG. 34). The expressive predictions may be displayed on the webcam page as, for instance, a list of predicted expressions predicted by the selected model based on the streaming video data (named “Top Emotions List” in FIG. 34), a confidence score associated with the prediction of each of the predicted expressions, and/or a predicted expression level associated any of a plurality of expressive predictions (provided in region labeled “Expression Levels” in FIG. 34). As illustrated in FIG. 34, the displayed expressive predictions are expressive predictions generated by a facial expression model. In some embodiments, the webcam page may include a plurality of user selectable affordances that when selected, respectively cause the webcam page to display expressive predictions based on any of a facial expression model, a vocal burst model, a language model, and/or a speech prosody model. In some embodiments, an affordance that when selected causes the webcam page to display expressive predictions a facial expression model is named “Facial Expression” in FIG. 34. In some embodiments, an affordance that when selected causes the webcam page to display expressive predictions a vocal burst model is named “Vocal Burst” in FIG. 34. In some embodiments, an affordance that when selected causes the webcam page to display expressive predictions a speech prosody model is named “Speech Prosody” in FIG. 34.

[0245] FIG. 35 illustrates the exemplary webcam page of the playground displaying the results of the expressive predictions of a speech prosody model based on a real-time analysis of streamed audio data transmitted to the platform using a streaming API. As illustrated, the displayed results of the expressive predictions may include a display of the top predicted expressions (named “Top Emotions List” in FIG. 35) and/or a display of all predicted expressions predicted by the speech prosody model based on the streaming audio data (named “Speech Prosody Activity” in FIG. 35). The displayed predictions may include a time indicator corresponding to a time the expression was predicted in the streamed media. The displayed predictions may include an expression selector that allows users to display the extent to which a selected predicted expression output of the speech prosody model was present over the course of the media file.

[0246] FIG. 36 illustrates an exemplary playground displaying the results of the expressive (e.g., emotion, sentiment, tonality, or toxicity) predictions of the vocal burst model associated with the platform using a streaming API. As illustrated, the displayed results of the expressive predictions may include a display of top predicted expressions predicted (named “Top Emotions List” in FIG. 36) and/or a display of all predicted expressions based on vocal burst activity predicted by the model based on the streaming audio data (named “Vocal Burst Activity” in FIG. 36). The displayed predictions may include a predicted expression (e.g., realization, calmness, anxiety, etc.) and time indicator corresponding to a time the expression was predicted in the streamed media. The displayed predictions may include an expression selector that allows users to display the extent to which a selected predicted expression output of the vocal burst model was present over the course of the media file.

[0247] Any of the webcam pages of FIGS. 34-36 may include an output visualizer (shown labeled in the figure as “Embedding Space Plot”). The output visualizer may illustrate a plot of embeddings generated by the selected model associated with each webcam page (i.e., the facial expression model, the vocal burst model, and/or the speech prosody model), as described above with reference to FIG. 24 and below with reference to FIG. 37. Any of the webcam pages of FIGS. 34-36 may include a user affordance associated with the embedding space plot that, when selected, causes the output visualizer to be hidden and/or displayed. Further detail regarding the output visualizer is provided below with reference to FIG. 37.

[0248] FIG. 37 illustrates an exemplary playground displaying the results of the expressive (e.g., emotion, sentiment, tonality, or toxicity) predictions of the models associated with the platform. As shown in the figure, the playground may include a media region (named “File Review” in FIG. 37), an expression tracking region (named “Expression Timelines” in FIG. 37, and including a facial expression plot, a speech prosody plot, a vocal burst plot, and a language plot), and an output visualizer region (named “Embedding Plot” in FIG. 37). In one or more examples, the expression tracking region may allow a user to track one or more emotions, sentiments, tonalities, or toxicity measures. The expression tracking region may include an affordance (for instance, a drop down menu) that allows a user to select from a plurality of expression tracking options. The expression tracking region may display a graphical representation of a selected emotion, sentiment, tonality, or toxicity measure at various points in time throughout the media file. The expression tracking region may further include a plurality of user selectable affordances that allow a user to track selected expressions predicted for a specific individual based on the media data (e.g., by selecting an icon associated with the respective individual). For instance, as shown, when a “top 5 expressions” is selected using the affordance that allows a user to select from a plurality of expression tracking options and “P3” is selected using the affordance that allows a user to track selected expressions predicted for a specific individual, as shown in FIG. 37, the expression tracking region may display a graphical representation of the top five expressions predicted at various times throughout the media file for an individual associated with the “P3” icon. In some embodiments, the expression tracking region includes a graphical representation of expressive predictions generated by each of a facial expression model, a speech prosody model, a vocal burst model, and/or a language model.

[0249] In some embodiments, the expression tracking region includes an affordance that when selected allows a developer to embed code to integrate external content into the playground (labeled as “< > Embed” in FIG. 37) . In some embodiments, the code may enable an external application to intake and visualize the information displayed in the expression tracking region. For instance, a telehealth application could display the extent to which a given expression or custom prediction was present over the course of a telehealth session, for example, to derive insights about important doctor-patient interactions during the session.

[0250] As shown, the media region includes a graphical representation of the media data received from the user. In one or more examples, the media region can permit playback of the media data, e.g., if the media data corresponds to video data or audio data. In some embodiments, a graphical representation of the predicted emotions may be overlaid on the media data, for instance, as shown above in FIG. 24. In some embodiments, the media region also includes a transcript of the audio in the media file. The transcript may be generated using a language model (e.g., a heuristic models such as a phonetic rule based model or a machine learning model such as a Hidden Markov Model, Deep Neural Network, Recurrent Neural Network, or other machine learning model trained for automatic speech recognition and transcription) and displayed in the media region. In some embodiments, the media region may display an indication of which individual is speaking at the current time point in the media file. Individuals may be identified using various techniques for speaker diarization and/or segmentation such as acoustic feature extraction, voice activity detection, clustering, and so on.

[0251] The output visualizer region (labeled in FIG. 37 as “Embedding Plot,” and also referred to herein as an expression visualizer and/or output visualizer) may include a graphical representation of an embedding space. In some embodiments, the embedding space comprises a static background. In some embodiments, the static background comprises a plurality of regions corresponding to a plurality of different expressions that the one or more neural networks are trained to predict. In some embodiments, the static background comprises a visualization of embeddings representing all of the expressions that the one or more neural networks are trained to predict. In some embodiments, the embedding space comprises dynamic region. In some embodiments, the dynamic region comprises a visualization of embeddings representing one or more predicted expressions associated with one or more individuals based on media content. In some embodiments, the dynamic region comprises a dynamic overlay displayed on the static background, wherein the dynamic overlay comprises a visualization of a plurality of embeddings representing the predicted one or more expressions associated with the one or more individuals based on the media content. In some embodiments, an embedding of the plurality of embeddings is displayed at a region of the plurality of regions of the static background based on a predicted expression the embedding represents. Further description of the embedding space is provided below.

[0252] As noted above, in some embodiments, a static background region of the output visualizer and may illustrate the emotions, sentiments, tonalities, toxicity, user experience, and well-being measures that the predictive models (e.g., facial expression model, speech prosody model, vocal burst model, language model) associated with the platform are trained to predict. In some embodiments, the static background region is a gradient plot that illustrates continuous transitions between different emotions, sentiments, tonalities, toxicity, user experience, and well-being measures. The gradient/continuous transitions may be illustrated, for instance, using gradual color transitions between different emotions, sentiments, tonalities, toxicity, user experience, and well-being measures, as shown. In some embodiments, colors represent emotion dimensions, and the gradient transition between colors may represent how emotion dimensions can be combined in different ways. For example, as shown, triumph and joy may be closely related emotions and are thus spaced near one another in the background region of the embedding plot. In the exemplary output visualizer shown, joy is represented by a first color (yellow), and triumph is represented by a similar color (orange). The gradient between joy and triumph is illustrated on the plot as a gradual transition from yellow to orange, representing various combinations of the dimensions associated with emotions joy and triumph. [0253] In some embodiments, the dynamic region of the output visualizer includes a representation of expressions (emotions, sentiments, tonalities, toxicities, etc.) predicted based on the media data and displayed as a dynamic overlay on the static background. In some embodiments, the dynamic overlay region includes a plurality of embeddings in the embedding space, each respectively associated with a frame of the media data. For instance, one or more embeddings for each individual identified in the media data may be generated, and each embedding may include a lower dimensional representation of predicted emotions sentiments, tonalities, toxicities, etc. of the respective user for each frame in the media data. The embeddings may be overlayed on the static background region of the output visualizer at a location of the output visualizer that is determined based on the predicted emotions, sentiments, tonalities, toxicities, etc. of the respective user for that frame. The embeddings overlayed on the static background region may illustrate a trajectory of a respective user’s predicted emotions, sentiments, tonalities, toxicities, etc. across each frame of the media data. For instance, each generated embedding associated with a user may be displayed at a different location of the output visualizer, each location associated with a predicted expression based on the media data. In some embodiments embeddings generated for a selected individual may be displayed on the output visualizer, and in some embodiments, embeddings generated for all individuals may be displayed at once.

[0254] In some embodiments, the output visualizer may include one or more dynamic icons (shown as the five relatively large circular icons in the embedding plot of FIG. 37) each associated with a respective individual identified in the media data. The dynamic icons may traverse the embeddings illustrated on the output visualizer during playback of the media data using the media region. Accordingly, the dynamic icons may be displayed at different positions on the output visualizer at different times during playback of the media data corresponding to an embedding representing expressive predictions of an individual at each respective frame. In some embodiments, the embeddings may be user selectable on the output visualizer, and when selected may cause the dynamic icon associated with a respective individual to move to the selected embedding. In some embodiments, selection of an embedding on the output visualizer may cause the media data to begin playback at a frame associated with the selected embedding, or may cause the media data to transition forward to the selected frame without continuing or beginning playback of the media data. In some embodiments, selecting of an embedding on the output visualizer associated with one individual’s expressive predictions at a frame of the media data may also cause the dynamic icons associated with other individuals identified in the media data to transition to positions on the output visualizer corresponding to embeddings representing their expressive predictions at that respective frame.

[0255] As noted above, in some embodiments, each embedding may include a lower dimensional representation of predicted emotions, sentiments, tonalities, toxicities, etc. of the respective user for each frame in the media data. In some embodiments, the output visualizer may display a plurality of probabilistic indicators associated with respective embeddings indicative of expressive predictions associated with the embedding. For example, an embedding associated with a first individual at a respective frame of the media data may include a probabilistic indication of 0.69 for concentration, 0.62 for interest, and 0.49 for calmness. An embedding associated with a second individual at the same respective frame of the media data may include a probabilistic indication of 0.69 for calmness, 0.54 for interest, and 0.45 for amusement. Accordingly, the probabilistic indicators may provide an intuitive score-type indication for each individual that can be readily compared during playback of the media data. The probabilistic indicators associated with each embedding may be displayed when the embedding is selected and/or when playback of the media reaches a frame associated with the embedding. Accordingly, the displayed scores may also be associated with and change based on the position of the dynamic icons on the output visualizer, described above.

[0256] In some embodiments, the output visualizer includes a dynamic graphical representation of embeddings associated with the predictive expressions generated based on the media data. The output visualizer may dynamically plot embeddings associated with all predicted emotions, sentiments, tonalities, toxicities, etc. at each frame (or segment) of the media data as the media data is processed by one or more of the predictive models (e.g., the facial expression model, speech prosody model, vocal burst model, language model). Accordingly, the embedding plot may grow and change to include new expressive predictions predicted at each frame or segment. Embeddings may shift within the embedding space as new expressive predictions are generated and new relationships are illustrated in the embedding space between different expressive predictions, for instance, a close relationship between joy and triumph.

[0257] In some embodiments, the output visualizer includes a user selectable affordance (shown as a drop-down menu in FIG. 37 with “Facial Expression” selected) that allows a user to select from any of the expressive prediction models (e.g., the facial expression model, speech prosody model, vocal burst model, and language model). In response to detecting a user selection of any one of the models, the playground may cause the media region, the expression tracking region, and/or the output visualizer region to display information associated with expressive predictions generated by the selected model.

[0258] FIG. 38 illustrates an exemplary text editor page of the playground displaying the results of the expressive (e.g., emotion, sentiment, tonality, or toxicity) predictions of generated by a language model associated with the platform. In some embodiments, the text editor page allows for text input directly into a text editor (shown with text “I’m confident we will get this patent” in FIG. 38) (e.g., using a keyboard, touchscreen, etc.), or uses speech-to- text transcription methods to populate the text editor based on user speech captured using a microphone. Transcription based on audio input may be accomplished using a heuristic model (e.g., phonetic rule-based models, grammar and syntax rule-based models, etc.) and/or machine learning models (e.g., hidden Markov models, recurrent neural networks, etc. trained for transcription tasks).

[0259] In some embodiments, expressive predictions are generated by a language model based on text input into the text editor. In some embodiments, expressive predictions are generated in real time as the text input is typed by a user and/or transcribed based on audio input from the user. In some embodiments, one or more expressive predictions are generated for each word, phrase, and/or sentence of text. In some embodiments, one or more overall expressive predictions indicating the most prominent sentiments, tonalities, and/or toxicities for all of the input text are generated. In some embodiments, individual words and/or phrases are highlighted in the text editor based on expressive predictions generated based on the words and/or phrases. For instance, the phrase “I’m confident” is highlighted/shaded to represent an expressive prediction most closely associated with calmness, and the phrase “we will get this patent” is highlighted to indicate an expressive prediction most closely associated with determination.

[0260] In some embodiments, the text editor page includes a plurality of user selectable affordances that allow a user to select between modes of analysis. For instance, selection of a first mode (named “Emotions” in FIG. 38) may cause the platform to process text input into the text editor using a language model trained to predict emotions. Selection of a second mode (named “Sentiment” in FIG. 38) may cause the platform to process text input into the text editor using a language model trained to predict sentiment. Selection of a third mode (named “Toxicity” in FIG. 38) may cause the platform to process text input into the text editor using a language model trained to predict toxicity. In some embodiments, the text editor page includes an affordance that allows a user to download data from the text editor page (e.g., including text input into the text editor and/or expressive predictions generated based on the text inputs).

Exemplary Neural Network Architecture

[0261] FIG. 39 illustrates an exemplary neural network architecture according to some embodiments. It should be understood that the architecture illustrated in FIG. 39 is provided only for illustrative purposes and the expressive prediction and/or generative models may be trained and configured in a variety of different manners without deviating from this disclosure, as described throughout. The exemplary neural network architecture illustrated includes a speech prosody model, a vocal burst model, and a facial expression model integrated with a language model. In some embodiments, the speech prosody model, a vocal burst model, and a facial expression model are trained using unsupervised learning methods to generate expressive predictions. The measures of nonverbal expressions generated by the speech prosody model, the vocal burst model, and the facial expression model may be integrated into the language model (e.g., a large language model) to, for instance, train the language model to generate responses that reduce the rate of users’ (e.g., patients) negative expressions (e.g., frustration, pain) and increase the rate of positive expressions (e.g., satisfaction, contentment) over variable periods of time.

Continuous Improvement of Models

[0262] According to some embodiments, the models described herein are trained and retrained using unsupervised learning methods. In accordance with receiving permission from a customer/user of the platform, data submitted to the APIs described herein can be used to improve/retrain the models (e.g., facial recognition models, speech prosody models, vocal burst models, language models). Specifically, from the data alone (no additional labels), the models can learn to better represent the interplay of language and expression by training on a next-word-prediction and next-expression-prediction task. The representations the model learns from this task can be used to make better downstream predictions of any outcome (emotions, a customer Net Promoter Score, mental health diagnoses, and more) and to learn to generate words and expressions that promote desired user expressions, such as signs of user satisfaction.

Custom Models

[0263] According to some embodiments, an API is provided that allows users to upload data labels (e.g., customer Net Promoter Scores) along with their files. A custom model may be trained to predict the provided labels by training additional layers on top of embedding s/outputs generated by the models described herein (e.g., facial recognition models, speech prosody models, vocal burst models, language models).

[0264] Although the disclosure and examples have been fully described with reference to the accompanying figures, it is to be noted that various changes and modifications will become apparent to those skilled in the art. Such changes and modifications are to be understood as being included within the scope of the disclosure and examples as defined by the claims.

[0265] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the disclosure to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain the principles of the techniques and their practical applications. Others skilled in the art are thereby enabled to best utilize the techniques and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for identifying changes in expressions over time in a media content, the method comprising: receiving, from a user, the media content corresponding to one or more individuals; displaying the user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.

2. The method of any of claim 1, further comprising: receiving, via the expression tracking region, a selection of an expression of the one or more expressions; and displaying, based on the selection of the expression, one or more graphical representations of the selected expression, the one or more graphical representations associated with one or more of facial expressions, vocal bursts, vocal prosodies, and language.

3. The method of any of claims 1-2, further comprising: receiving an indication that the user has initiated playback of the media content; and while playing back the media content, overlaying a representation of the one or more expressions on the media content, wherein the representation is associated with a timestamp of the media content.

4. The method of any of claims 1-3, wherein receiving the media content comprises receiving a live stream of data.

5. The method of any of claims 1-4, further comprising: determining whether the media content includes privacy data associated with the one or more individuals; and applying one or more data transformations to anonymize the privacy data associated with the one or more individuals if the media content is determined to include the privacy data.

6. The method of claim 5, wherein the applying the one or more data transformations is performed prior to receiving the media content.

7. The method of any of claims 1-6, further comprising: estimating an amount of time associated with predicting the one or more expressions of the media content; determining an amount of available processing time associated with the user; and if the amount of available processing time associated with the user exceeds the amount of time associated with predicting the one or more expressions, predicting the one or more expressions.

8. The method of any of claims 1-7, further comprising: estimating an amount of time associated with predicting the one or more expressions of the media content; determining an amount of available processing time associated with the user; and if the amount of available processing time associated with the user is less than the amount of time associated with predicting the one or more expressions, forgoing predicting the one or more expressions.

9. The method of any of claims 1-8, wherein the media content comprises one or more selected from image data, video data, text data, or audio data.

10. The method of any of claims 1-9, wherein the one or more expressions comprise one or more emotions, one or more sentiments, one or more tonalities, one or more toxicity measures, or a combination thereof.

11. The method of claim 10, wherein the one or more emotions comprise one or more of admiration, adoration, aesthetic appreciation, amusement, anger, annoyance, anxiety, approval, awe, awkwardness, boredom, calmness, compulsion, concentration, confusion, connectedness, contemplation, contempt, contentment, craving, curiosity, delusion, depression, determination, disappointment, disapproval, disgust, disorientation, distaste, distress, dizziness, doubt, dread, ecstasy, elation, embarrassment, empathic pain, entrancement, envy, excitement, fear, frustration, gratitude, grief, guilt, happiness, hopelessness, horror, humor, interest, intimacy, irritability, joy, love, mania, melancholy, mystery, nostalgia, obsession, pain, panic, pride, realization, relief, romance, sadness, sarcasm, satisfaction, self-worth, serenity, seriousness, sexual desire, shame, spirituality, surprise (negative), surprise (positive), sympathy, tension, tiredness, trauma, triumph, warmth, and wonder.

12. The method of any of claims 10-11, wherein the one or more sentiments comprise one or more of positivity, negativity, liking, disliking, preference, loyalty, customer satisfaction, and willingness to recommend.

13. The method of any of claims 10-12, wherein the one or more toxicity measures comprise one or more of bigotry, bullying, criminality, harassment, hate speech, inciting violence, insult, intimidation, microaggression, obscenity, profanity, threat, and trolling.

14. The method of any of claims 10-13, wherein the one or more tonalites comprise one or more of sarcasm and politeness.

15. The method of any of claims 1-14, wherein the user interface further comprises an expression visualizer, wherein the expression visualizer comprises a graphical representation of an embedding space.

16. The method of claim 15, wherein the embedding space comprises a static background, wherein the static background comprises a plurality of regions corresponding to a plurality of different expressions that the one or more neural networks are trained to predict.

17. The method of any of claims 15-16, wherein the method further comprises displaying a dynamic overlay on the static background, wherein the dynamic overlay comprises a visualization of a plurality of embeddings representing the predicted one or more expressions associated with the one or more individuals based on the media content.

18. The method of claim 17, wherein an embedding of the plurality of embeddings is displayed at a region of the plurality of regions of the static background based on a predicted expression the embedding represents.

19. The method of any of claims 1-18, further comprising: generating, using a generative machine learning model, at least one of new media data and text data, based on the predicted one or more expressions associated with the one or more individuals based on the media content; and displaying the generated at least one of new media data and text data.

20. A system for producing a user interface based on identified changes in expressions over time in a media content, the system comprising: one or more processors; and memory communicatively coupled to the one or more processors and configured to store instructions that when executed by the one or more processors, cause the system to perform a method comprising: receiving, from a user, the media content corresponding to one or more individuals; displaying the user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.

21. A non- transitory computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by one or more processors of one or more electronic devices, cause the electronic devices to perform a method comprising: receiving, from a user, the media content corresponding to one or more individuals; displaying a user interface comprising: a media region that presents the media content; and an expression tracking region; predicting, using one or more neural networks, one or more expressions associated with the one or more individuals based on the media content; updating the expression tracking region based on the predicted one or more expressions to identify changes in the one or more expressions over time based on the media content; and annotating the media region of the user interface based on the identified changes in the one or more expressions over time.