WO2024026393A1

WO2024026393A1 - Methods and apparatus for ensemble machine learning models and natural language processing for predicting persona based on input patterns

Info

Publication number: WO2024026393A1
Application number: PCT/US2023/071100
Authority: WO
Inventors: Josh BLACKWELL; Israel ELLIS; Michael Reed; Pooja Kohli
Original assignee: Indr, Inc.
Priority date: 2022-07-27
Filing date: 2023-07-27
Publication date: 2024-02-01

Abstract

An apparatus includes ensemble machine learning and natural language processing for prediction and generation of user personas. The apparatus includes a processor configured to transmit actionable data to a first respondent and receive respondent data from the first respondent, including natural language feature descriptions. The processor is configured to encode each natural language feature description to produce feature data based on a feature type that is associated with a natural language query for that natural language feature description and train, using the feature data, a set of persona machine learning models. The processor is configured to execute one or more trained persona machine learning models to produce a qualitative entity identifier representing a group of archetypal entities associated with a second respondent.

Description

METHODS AND APPARATUS FOR ENSEMBLE MACHINE LEARNING MODELS

AND NATURAL LANGUAGE PROCESSING FOR PREDICTING PERSONA BASED

ON INPUT PATTERNS

CROSS-REFERENCE TO RELATED APPLICATOINS

[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/392,775, filed July 27, 2022 and titled “Methods And Apparatus For Persona Generation Using Ensemble Machine Learning For Cognition Of Crowd-Sourced Data”, which is incorporated herein by reference in its entirety.

FIELD

[0002] The present disclosure generally relates to the field of ensemble machine learning and natural language processing. In particular, the present disclosure is directed to methods and apparatus for ensemble machine learning models and natural language processing for future persona pattern forecasting and prediction based on input patterns.

BACKGROUND

[0003] Some known methods for generating, predicting and/or forecasting future personas and/or patterns rely on human interaction, which poses several problems in speed, cost, accuracy, and bias. Some known methods relying on human-based research do not scale well and require researchers to identify individuals, schedule interviews, and accumulate sufficient data to derive insights. Moreover, some known methods may use data collected in non-standardized formats, in which human operators are used to analyze such nonstandardized data.

[0004] Accordingly, a need exists for machine learning models and natural language processing that can effectively, accurately, and efficiently collect, cleanse, transform, and/or standardize data to predict, forecast, create and/or define personas based on input patterns.

SUMMARY

[0005] In some embodiments, an apparatus includes an ensemble of machine learning models for prediction, forecasting and/or generation of a persona using input patterns. The apparatus includes a processor and a memory operatively coupled to the processor, where the memory stores instructions for execution by the processor. The instructions cause the processor to transmit actionable digital data to a first respondent, where the actionable digital data includes a plurality of natural language queries. The memory further stores instructions to cause the processor to receive respondent data from the first respondent, where the respondent data includes a plurality of natural language feature descriptions. The memory stores instructions to further cause the processor to encode each natural language feature description from the plurality of natural language feature descriptions to produce feature data from a plurality of feature data based on a feature type that is associated with a natural language query from the plurality of natural language queries for that natural language feature description. The memory also stores instructions to cause the processor to train, using the plurality of feature data, each persona machine learning model from a plurality of persona machine learning models to produce a plurality of trained machine learning models. The memory further stores instructions to cause the processor to execute one or more trained persona machine learning models from the plurality of trained persona machine learning models to produce a qualitative entity identifier. The qualitative entity identifier represents a group of archetypal entities associated with a second respondent.

[0006] In some embodiments, a method includes execution of an ensemble of machine learning models for prediction, forecasting and/or generation of a persona using input patterns. The method includes transmitting, by a processor operatively coupled to a memory, actionable digital data to a respondent. The actionable digital data includes a plurality of natural language queries. The method further includes receiving respondent data from the respondent. The respondent data includes a plurality of natural language feature descriptions. The method further includes encoding each natural language feature description from the plurality of natural language feature descriptions to produce feature data from a plurality of feature data based on a feature type that is associated with a natural language query from the plurality of natural language queries for that natural language feature description. The method further includes executing one or more trained persona machine learning models from a plurality of trained persona machine learning models using the plurality of feature data as an input to produce a qualitative entity identifier. The qualitative entity identifier represents a group of archetypal entities associated with the respondent. BRIEF DESCRIPTION OF THE DRAWINGS

[0007] The drawings show aspects of one or more embodiments. However, it should be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, wherein:

[0008] FIG. l is a diagrammatic illustration of a persona generator compute device using ensemble machine learning, according to an embodiment.

[0009] FIG. 2 is a flowchart of a system for persona generation using ensemble machine learning, according to an embodiment.

[0010] FIG. 3 A is another flowchart of a system for persona generation simplified into separate system components, according to an embodiment.

[0011] FIG. 3B is a flowchart for a feature collector system component, according to an embodiment.

[0012] FIG. 3C is a flowchart for a clustering engine system component, according to an embodiment.

[0013] FIG. 4 is a screenshot of a webpage illustrating a description of a persona, according to an embodiment.

[0014] FIG. 5 is a screenshot of a webpage illustrating a survey invitation for a respondent based on a challenge statement, according to an embodiment.

[0015] FIGS. 6A-D are screenshots of webpages illustrating context questions for a survey based on a challenge statement, according to an embodiment.

[0016] FIG. 7 is a screenshot of a webpage illustrating a network externality to invite respondents, according to an embodiment.

[0017] FIG. 8 is a screenshot of a webpage illustrating a cluster grouping for generating a persona, according to an embodiment.

[0018] FIG. 9 is a screenshot of a webpage illustrating a dendrogram of the cluster groupings of FIG. 8, according to an embodiment.

[0019] FIG. 10 is a flow diagram of an example method for persona generation using ensemble machine learning, according to an embodiment.

[0020] FIG. 11 is a screenshot of a webpage illustrating an interface in which a user can set parameters for handling missing data, according to an embodiment. DETAILED DESCRIPTION

[0021] In some implementations, ensemble machine learning models and natural language processing for future persona pattern forecasting and prediction based on input patterns can be used to generate and/or predict personas. User “personas” can be used to represent a target audience or a segment of a customer base. Personas can be archetypical users whose goals and characteristics represent the needs of a broader group of users.

[0022] In some implementations, a persona generator can use an ensemble of machine learning models to cleanse, normalize, and/or standardize data collected from contextspecific interview questions from multiple users. For instance, the data collected can include labeled features that are binary (e.g., yes/no), continuous (e.g., 1-100), and/or categorical (e.g., strongly disagree, somewhat disagree, somewhat agree, strongly agree, etc.). Unlike some known machine learning algorithms, the persona generator can process data having different labeled features directly. For instance, machine learning models that are unable to process labeled data can be limited to require input variables and/or output variables to be in a standard format such as, for example, numerical. The persona generator can process data that have different ranges and/or formats. For example, one feature can include a value from a numerical range between 0-100 while another feature can include a value form a numerical range of 0-1000. In some cases, the persona generator can rescale the values from variable ranges via normalization such that the values correspond to a value between a common range of, for example, 0-1. In some cases, the persona generator can scale the values from variable ranges via standardization such that the values are centered around a mean with a unit standard deviation.

[0023] In some implementations, the persona generator can process natural language data. For example, Natural Language Processing (NLP) such as, for example, text extraction, key phrase determination, general sentiment analysis, and/or the like, can be used to transform natural language data (e.g., text) to numerical features for training and/or execution of the ensemble of machine learning models to enhance predictive performance. In some cases, the persona generator can perform word vectorization to map words or phrases from the natural language data to corresponding vectors of numerical values that can be used to predict words, word similarities/semantics, and/or user sentiment. [0024] In some implementations, the persona generator can process missing data (e.g., unanswered interview questions) by imputing the missing data with the mean or median of the feature associated with that missing data. For example, if an interview question included a scale-based question in which a user left the question unanswered, the persona generator can input a calculated mean or median in place of that missing data. In other words, instead of a human operator determining how to interpret unanswered questions or instead of a model indiscriminately processing missing data, the persona generator can efficiently and/or accurately impute data to fill missing data or omit the missing data if the resulting imputed data (e.g., calculated mean/median) does not fall within a predefined threshold for completeness.

[0025] In some known solutions, a researcher can typically take months to find and interview respondents and throughput can be capped by how many researchers are employed and how long interviews take. The cost of human-based research is also typically conducted through high priced research consulting engagements that can take years and have significant costs. Moreover, digital transformation consulting engagements tend to have a poor return of investment with over a 70% failure rate. Human-based research also has no standardized interview questions. Accuracy is dependent on the skill of the researcher. If the researcher asks ineffective questions, interviews the wrong people, or interviews in an inconsistent way, the results are typically poor. Furthermore, human-based research can incorporate bias into the process of collecting information. The questions asked as well as the classification techniques used are subjective to the researcher. A computer-based method using an ensemble of machine learning models for persona prediction and/or generation can be inherently faster, can be unbiased, can scale better, can be more accurate, and can be less expensive than known traditional human-based research or algorithmic methods.

[0026] In some embodiments, an apparatus can include a persona generator that can automate an interview process through various platforms and digital surveys and automate persona prediction and/or generation based on input patterns. In some implementations, the apparatus can be or include software that can be integrated with existing communication channels of users and/or organizations (e.g., a website, mobile device, Slack®, Microsoft Teams®, etc.). Instead of using multiple human operators (with varied methods of interview style and questions), the persona generator can use a standardized format for surveys to collect data from users via a network (e.g., the Internet), to identify patterns efficiently and accurately for classification of users. For instance, the persona generator can enable a user to define a context-specific statement (e.g., “challenge statement”) that describes challenges such as, for example, how to improve a process, how to implement new software, how to increase sales, and/or the like. In some cases, instead of a user curating context-specific interview questions, the persona generator can generate context-specific interview questions based on the challenge statement that can be sent to various individuals relevant to the context-specific interview questions to enable an ensemble of machine learning models of the persona generator to cluster and classify data of the individuals associated with the challenge statement from the context-specific interview questions. The persona generator can also enable individuals to invite other individuals, including individuals associated with the challenge statement, to participate and answer the context-specific interview questions. In other words, the persona generator can enable efficient collection of data via identification of users that can provide relevant user data instead of manually searching for interviews. In some implementations, the persona generator can enrich data collected from out-of-band resources such as, for example, customer relationship management (CRM) tools, Microsoft Active Directory®, Salesforce®, and/or the like.

[0027] In some implementations, the persona generator can select a predefined number of interview questions (and corresponding answers) to avoid computational overhead of processing surveys with too many interview questions. For instance, surveys with more interview questions can have higher feature dimensionality than surveys with fewer interview questions (e.g., longer surveys can include a greater combination of different types of questions (binary, continuous, categorical, etc.) while a shorter survey can include a smaller combination) via principal-component-analysis (PCA), missing value ratio, high correlation filter, and/or the like. The persona generator can selectively reduce feature dimensionality to produce accurate personas. Similarly stated, the persona generator can process surveys including various lengths and types of questions by selecting/reducing the dimensionality (or complexity) of the surveys to efficiently collect, cleanse, transform, and/or standardize data to generate and/or predict personas.

[0028] The persona generator can be configured to send context-specific surveys to multiple (e.g., hundreds, thousands, etc.) users, enabling the users to further invite other users relevant to the challenge statement associated with each context-specific survey across a network. The persona generator can be further configured to identify labeled features from the context-specific surveys and/or label features from the context-specific surveys. The persona generator can be further configured to normalize and/or standardize labeled data via machine learning such that an ensemble of machine learning models can process missing data. The persona generator can be further configured to reduce dimensionality of features (e.g., data from surveys) to produce accurate personas for each specific user group and/or context-specific group in substantially real-time. The persona generator can train an ensemble of machine learning models such that the output of each machine learning model can be combined to improve predictive performance for execution of the ensemble of machine learning models.

[0029] FIG. 1 is a diagrammatic illustration of a persona generator compute device 100 that uses ensemble machine learning, according to an embodiment. The persona generator compute device 100 can be or include a hardware-based computing device and/or a multimedia device, such as, for example, a computer, a server, a cluster of servers, a desktop, a laptop, a smartphone, and/or the like. The persona generator compute device 100 includes a memory 104 operatively coupled to a processor 108, where the memory 104 stores instructions to be executed by the processor 108. The memory 104 and the processor 108 can communicate with each other, and with other components, via a bus (not shown). The bus can include any of several types of bus structures including a memory bus, a memory controller, a peripheral bus, a local bus, and any combinations thereof, using any of a variety of bus architectures.

[0030] The processor 108 can be or include, for example, a hardware based integrated circuit (IC), or any other suitable processing device configured to run and/or execute a set of instructions or code. For example, the processor 108 can be a general -purpose processor, a central processing unit (CPU), an accelerated processing unit (APU), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), a programmable logic array (PLA), a complex programmable logic device (CPLD), a programmable logic controller (PLC) and/or the like. In some implementations, the processor 108 can be configured to run any of the methods and/or portions of methods discussed herein.

[0031] The memory 104 can be or include, for example, a random-access memory (RAM), a memory buffer, a hard drive, a read-only memory (ROM), an erasable programmable read-only memory (EPROM), and/or the like. In some instances, the memory 104 can store, for example, one or more software programs and/or code that can include instructions to cause the processor 108 to perform one or more processes, functions, and/or the like. In some implementations, the memory 104 can include extendable storage units that can be added and used incrementally. In some implementations, the memory 104 can be a portable memory (e.g., a flash drive, a portable hard disk, and/or the like) that can be operatively coupled to the processor 108. In some instances, the memory 108 can be remotely operatively coupled with the persona generator compute device 100, where the persona generator compute device 100 can include a remote database device.

[0032] The persona generator compute device 100 can also include a communication interface (not shown). The communication interface can be a hardware component of the persona generator compute device 100 to facilitate data communication between the persona generator compute device 100 and external devices (e.g., a network, a compute device, a server, etc.; not shown). The communication interface can be operatively coupled to and used by the processor 108 and/or the memory 104. In some instances, the communication interface can be used for connecting the persona generator compute device 100 to one or more of a variety of networks and/or one or more compute devices (e.g., user compute device) used by users such as the respondent and/or changemaker (as described in further detail herein) connected thereto. The communication interface can be or include, for example, a network interface card (NIC), a Wi-Fi® module, a Bluetooth® module, an optical communication module, a network interface card (e.g., a mobile network interface card, a LAN card), a modem, and any combination thereof, and/or any other suitable wired and/or wireless communication interface. The network can include, for example, a wide area network (e.g., the Internet, an enterprise network), a local area network (e.g., a network associated with an office, a building, a campus or other relatively small geographic space), a telephone network, a data network associated with a telephone/voice provider (e.g., a mobile communications provider data and/or voice network), a direct connection between two computing devices, and any combinations thereof.

[0033] The persona generator compute device 100 can include a computer system which can also include a storage device (not shown). Examples of a storage device can include, for example, a hard disk drive, a magnetic disk drive, an optical disc drive in combination with an optical medium, a solid-state memory device, and any combinations thereof. In some implementations, the storage device can be removably interfaced with and/or connected to the persona generator compute device 100 (e.g., via an external port connector; not shown). In some implementations, the storage device can include an associated machine-readable medium which can provide nonvolatile and/or volatile storage of machine-readable instructions, data structures, program modules, and/or other data for the persona generator compute device 100. In some cases, the persona generator compute device 100 can include software that can reside, completely or partially, within the machine-readable medium and/or the processor 108.

[0034] In some cases, a user, such as a respondent, can also provide commands and/or other information to the persona generator compute device 100 via, for example, a user input device (e.g., a user compute device in communication with the persona generator compute device 100) and the input commands can be stored in the memory 104 and/or the storage device (e.g., a removable disk drive, a flash drive, and/or the like).

[0035] The memory 104 of the persona generator compute device 100 can store instructions to cause the processor to transmit a survey(s) 112 to the respondent (e.g., to a compute device of a respondent via a network; not shown) for the respondent to complete. The survey(s) 112 can also be referred to as “interview(s)” or “questionnaire(s).” The respondent can also be referred to as “respondent(s)” or “user(s).” The respondent can include a human user and the survey(s) 112 can be transmitted via the network to a compute device and/or a user interface used by the respondent. The survey(s) 112 can include a set of natural language queries. For example, the natural language queries can include questions such as, “do you work at home or in an office?”, context specific to a customer’s domain, “How much time is spent fetching documents from the corporate printer?”, and/or the like. The persona generator 100 can include software that automates the interview process of respondent(s) through an online and/or actionable digital data. In some implementations, the survey(s) 112 can be performed through one or more communication channels such as a website, mobile, Slack, Microsoft Teams, etc. The persona generator 100 can create and/or define the survey(s) 112 and/or the questions for the survey(s) 112. In some cases, the changemaker (as described in further detail herein) can also create and/or update the survey(s) 112.

[0036] The memory 104 stores instructions to cause the processor 108 to receive respondent data, where the respondent data is a result and/or answers of the survey(s) 112 and/or the natural language queries of the survey(s) 112. The memory 104 also stores instructions to cause the processor 108 to train persona machine learning models 116 to predict and/or determine one or more clusters of a plurality of clusters that will define a persona 120 based on the respondent data. The persona machine learning models 116 The persona 120 can be stored and/or retrieved from the memory 104. In some implementations, the persona generator compute device 100 can also include a database that can store the persona 120 (e.g. a persona repository) to store a variety of survey(s) 112, questions, personas, respondent data, and/or the like.

[0037] FIG. 2 is a flowchart of a system 200 for persona generation and/or prediction using ensemble machine learning, according to an embodiment. System 200 includes a persona generator compute device 204. The persona generator compute device 204 can be structurally and/or functionally similar to the persona generator compute device 100 described in FIG. 1. The persona generator compute device 204 and its components can include software (e.g., executing in hardware such as processor 108 of FIG. 1) to automate an interview process through an online survey. The survey can also be referred to as “actionable digital data.” The actionable digital data 204 can include natural language queries 212. The natural language queries 212 can include questions where the answers (e.g., respondent data 232) can be in textual form (e.g., a short description), numerical form (e.g., value between a range of 1 and 100), and/or a categorical form (e.g., strongly disagree, somewhat disagree, somewhat agree, strongly agree, etc.). For example, the natural language queries can include questions such as “do you work at home or in an office?” or “how much time is spent fetching documents from the corporate printer?” The actionable digital data 204 can be completed by a respondent 224 and/or multiple respondents 224 via a user interface 228 (e.g., computer, laptop, mobile device, tablet, and/or the like), where the actionable digital data 208 and the natural language queries 212 are presented to the respondent 224 via the user interface 228 (e.g., of a compute device of the respondent).

[0038] The actionable digital data 208 and the natural language queries 212 are based on and/or around a challenge statement 216. A “challenge statement” can refer to a context used in curating natural language queries or to obtain specific respondent data 232 from respondents (e.g., respondent 224) in response to the natural language queries 212 for the generating of personas (e.g., qualitative entity identifier 264). In some implementations, respondent data 232 from the actionable digital data 208 provided to respondent 224 can be given within the context of the challenge statement 216 that has been defined by a changemaker 220. A “changemaker,” as used in this disclosure, can be, for example, person(s), device (e.g., artificial intelligence) and/or entity that wants to effect some change with which the person(s), device and/or entity is familiar. For example, the changemaker 220 can define a problem and/or the challenge statement 216 and identify and/or invite a group of respondents to participate in responding to the actionable digital data 208 (e.g., participate in a survey). In some cases, the changemaker 220 can have certain privileges and responsibilities within the software of the persona generator compute device 204. The changemaker 220 can include an individual within an organization or company seeking to extract information via natural language queries 212 (e.g., interview questions) regarding a specific challenge statement such as improving a process, implementing a new software solution, increasing sales, and/or the like. The actionable digital data 208 can be completed by the respondent 224 through one or more communication channels such as a website, Slack®, Microsoft Teams®, and/or the like (e.g., using a compute device and/or user interface 228 of the respondent 224).

[0039] A memory (not shown in FIG. 2) of the persona generator compute device 204 includes instructions to cause a processor (not shown in FIG. 2) of the persona generator compute device 204 to transmit the actionable digital data 208 to the respondent 224, where the actionable digital data 208 can include, for example, the natural language queries 212. The respondent 224 can complete the actionable digital data 208 and the natural language queries 212 via the user interface 228 and transmit the respondent data 232 to the persona generator compute device 204. The respondent data 232 can include multiple natural language feature descriptions. The respondent data 232 can also be referred to as “natural language feature descriptions.” The natural language feature descriptions can include textual and/or numerical respondent data 232 to the natural language queries 212 of the actionable digital data 208.

[0040] In some implementations, the memory of the persona generator compute device 204 can store instructions to cause the processor of the persona generator compute device 204 to receive additional data such as crowdsourced data 236 from data sources such as directory services or customer relationship management (CRM) systems that specify aspects of the respondent 224 that are helpful with generating personas such as job title, organization, geographic location, team structure, roles, and/or the like.

[0041] The memory of the persona generator compute device 204 stores instructions to cause the processor of the persona generator compute device 204 to receive the respondent data 232 and/or the crowdsourced data 236. In some implementations, the persona generator compute device 204 includes a system component such as a feature collector 240, which is further described in FIG. 3B. The feature collector 240 receives respondent data 232 and/or the crowdsourced data 236 and encodes the respondent data 232 and/or the crowdsourced data 236 to produce feature data 244. Feature data 244 can also be referred to as “feature(s).” Feature data 244 can include any salient data point based on the respondent data 232 to be used for machine learning. The feature data 244 can include multiple numerical values and/or salient data associated with the natural language feature descriptions of the respondent data 232 answered by the respondent 224.

[0042] The memory of the persona generator compute device 204 also stores instructions to cause the processor of the persona generator compute device 204 to train a persona machine learning model(s) 252. The persona machine learning model(s) 252 can also be referred to as a “set of persona machine learning models.” A persona machine learning model from the set of persona machine learning model(s) 252 can also be referred to as “machine learning model.” In some implementations, the persona machine learning model(s) 252 can include, for example, a logistic regression model, a neural network, a feedforward neural network language model, a hierarchical clustering model, a decision tree model, a k-nearest neighbor model, and/or any other suitable machine learning model. In some cases, the memory of the persona generator compute device 204 can store instructions to cause the processor to train the set of persona machine learning model(s) 252 using a persona training subset from a set of persona training subset(s) 260. For instance, the persona machine learning model(s) 252 can include supervised machine learning models, where the persona training subset(s) 260 can include input samples labeled with desired output values to train the persona machine learning model(s) 252 to correctly classify samples that do not occur in the training set. The samples can include respondent data 232 and/or the feature data 244 (e.g., encoded data). The persona machine learning model(s) 252 can serve to classify the respondents 224 to a specific user group making up a qualitative entity identifier 264 (e.g., a persona such as customers, suppliers, consultants, engineers, teachers, etc.). The persona training subset(s) 260 can be derived from a persona training set, where the persona training set can include criteria such as number of respondents, length of survey, amount of data, missing data, amount of enriched data, and/or the like, which can be used to form persona training subset(s) 260. In some implementations the persona training subset(s) 260 can be formed by the changemaker 220 based on the challenge statement 216. The persona generator compute device 204 can store data such as the respondent data 232, the feature data 244, the persona training subset(s) 260, and/or the like, in a data storage device such as a persona repository 256. The persona repository 256 can be any suitable memory and can include a local and/or cloud database.

[0043] In some implementations, the persona generator compute device 204 includes a clustering engine 248 that receives the feature data 244 as an input to execute the persona machine learning models 252. The clustering engine 248 can include a system component of the persona generator compute device 204, which is further described in FIG. 3C. In some implementations, the persona machine learning model(s) 252 can be trained using unsupervised learning.

[0044] The memory of the persona generator compute device 204 can store further instructions to then cause the processor of the persona generator compute device 204 to execute one or more persona machine learning model(s) 252 using the feature data 244 as an input, to produce a qualitative entity identifier 264. A “qualitative entity identifier,” as used in this disclosure, can be a projection of respondent data 232 identifying and/or classifying a respondent (e.g., respondent 224) to an archetypal entity. The qualitative entity identifier 264 can also be referred to as a “persona.” The qualitative entity identifier 264 can be or include a representation (or fictional representation) of a user (e.g., respondent 224 in which that user can be associated with an archetypal entity and/or a group of archetypal entities 268. For example, the archetypal entities 268 can include a group of users who have similar goal and/or characteristics that represents that group of users or the needs of that group of users. In some cases, users can include, for example, customers, baristas, managers, analysts, manual laborers, and/or the like. In some cases, the archetypal entities 268 can include a group of users having similar goals and/or characteristics such as, for example, demographic, behavior, lifestyle and interests, challenges, and/or the like.

[0045] In some implementations, encoding the respondent data 232 can include encoding each natural language feature description from a set of natural language feature descriptions in the respondent data 232. Each natural language feature description can be a response/answer to a natural language query from the set of natural language queries 212. The memory of the persona generator compute device 204 can store instructions to cause the processor to identify and/or correct one or more errors in the respondent data 232 (e.g., natural language feature descriptions, unanswered natural language queries, etc.). In some cases, errors can include missing data (e.g., absent natural language feature descriptions for natural language queries 212). The memory can store instructions to further cause the processor to select one or more feature inputs for the persona machine learning model(s) 252. In some cases, a feature input can be or include selected natural language feature descriptions from the feature data 244 based on the respondent data 232. The selected feature inputs can be used as inputs for one or more persona machine learning model(s) 252. In some cases, the processor can select natural language feature descriptions (and associated natural language queries) or not select missing natural language feature descriptions (and associated natural language queries) when training and/or executing the persona machine learning model(s) 252.

[0046] The memory can store instructions to further cause the processor to transform one or more selected feature inputs for the persona machine learning model(s) 252. Transforming can include normalizing and/or standardizing as further described in FIG. 3B. Transforming the feature inputs allows for standardizing different types of natural language feature descriptions for efficient training and/or executing of the persona machine learning model(s) 252. The memory can store instructions to further cause the processor to derive new feature inputs based on crowdsourced data 236 (e.g., external data from external data systems) received from an external respondent. The external respondent can be or include another respondent invited by the respondent 244 to participate in submitting the crowdsourced data 232 in response to receiving the actionable digital data 208 via the invite. The feature collector 240 can be configured to produce the feature data 244 based on receiving both the crowdsourced data 236 and the respondent data 232.

[0047] The memory can store instructions to further cause the processor to generate a compact feature projection (not shown in FIG. 2), and encode the compact feature projection into the feature data 244. In some cases, the actionable digital data 208 can be configured, via, for example, the changemaker 220, to include additional natural language queries such that the produced feature data 244, in response to the respondent data 232 having more natural language feature descriptions from the additional natural language queries 212, can result in an increased feature dimension, in which the persona machine learning models 252 are configured to process and/or interpret. In some cases, a higher feature dimension (e.g., a more robust respondent data 232) does not necessarily produce better results and/or not every natural language description can be encoded and/or transformed. As such, the processor can be configured to reduce a number of feature dimension (e.g., reduce size of respondent data 232, select a prioritized number of natural language descriptions in the respondent data 232, etc.) to produce the compact feature projection, which can be a reduced version of the feature data 244 such that the persona machine learning model(s) 252 can produce better and/or more accurate qualitative entity identifiers, In some cases, the compact feature projection can be generated (based on the additional natural language queries), via principal- component-analysis (PCA), missing value ratio, a high correlation filter, and/or the like.

[0048] FIG. 3A is a flowchart of a system 300 for persona generation, according to an embodiment. The system 300 can be structurally and/or functionally similar to the system 200 described in FIG. 2. The system 300 can include a feature collector 304, a clustering engine 308, a user interface 316, and a persona repository 312. The feature collector 304 can also be referred to as a “feature collector component” and/or a “data collector.” The feature collector 304 can include any hardware/software component (e.g., executing in hardware such as processor 108 of FIG. 1) or step for collecting and transforming respondent data. The feature collector 304 of the system 300 is further described in FIG. 3B. In some cases the feature collector 304 can collect answers (e.g., respondent data) from the respondent s) in the context of a specific challenge statement. In some implementations, answers from the surveys provided to the respondent are given within the context of the challenge statement that has been defined (e.g., by an individual within an organization). The challenge statement can include a context of a challenge such as improving a process, implementing a new software solution, increasing sales, and/or the like.

[0049] Within the system 300, an entity that defines the challenge statement can include a changemaker. The changemaker can refer to an authoritative figure with certain privileges and/or responsibilities within the system 300. In some cases, the changemaker can incorporate bias through the challenge statement. In a non-limiting example, the changemaker can create and/or define a challenge statement such as “I want to improve our company’s document security when my employees are working from home.” In another example, the changemaker and/or respondent can invite other individuals who are relevant to and/or have a relationship to the challenge statement to participate. For instance, in some implementations, such other individuals might also experience the challenge statement, are part of an organization, may know the process that underlies the challenge statement, and/or may work in a part of an organization that supports the challenge statement. In such implementations, this is so, at least in part, to enable the system 200 to engage with respondents that are relevant to the challenge statement. By excluding nonrelevant users in some implementations, the system 300 can avoid receiving data from the nonrelevant users that have no relation with the challenge statement and that may negatively affect the classification of respondent groups with such nonrelevant data. In some implementations, for example, the respondents can include individuals within an organization and the individuals who are relevant to the challenge statement can include, for example, coworkers, partners, suppliers, and/or the like. In some implementations, the challenge statement can be open ended. In other implementations, any other group and/or classification of users can be identified and/or invited. In still other implementations, the invited individuals are not limited and/or restricted.

[0050] In some implementations, the questions in the survey(s) can be general. The system 300 can generate questions using, for example, a decision tree model and/or algorithm. The decision tree model and/or algorithm can be trained based on a criteria such as, for example, number of respondents, length of survey, amount of textual data, amount of enriched data, and/or the like, and can potentially change between different kinds of challenge statements. In an example if a respondent specifies that they are part of a sales organization (or the system 300 has enriched data about that individual that indicates they are part of the sales organization), additional questions based on the challenge statement can be incorporated into the survey (and irrelevant questions removed). In some implementations, the survey(s) given to respondents can include, for example, three categories of questions: (1) behavioral, (2) organizational, and (3) context or domain specific. The decision tree model can be generated where a first question of the survey(s) is the vertex and/or node of the decision tree model, a link to a following question is an edge of the vertex and/or node, and the following linked question is a child of the vertex and/or node. In some implementations, the survey(s) can be further customized easily with context specific questions that can be injected at certain points in the survey(s) or upon the discretion of the changemaker.

[0051] The respondent data collected in these survey(s) can be used to form the basis for an ensemble of machine learning models within the clustering engine 308 (described in further detail with respect to the clustering engine 308 of FIG. 3C). The clustering engine 308 can also be referred to as a “clustering engine component.” The clustering engine 308 can be hardware and/or software executing in hardware (e.g., such as processor 108 of FIG. 1). In some cases, the respondent data can include textual and/or numerical answers. For example, the questions can include “how many times a week do you work from home?” or “when was the last time you experienced this?”

[0052] In some implementations, the system 300 can include network externality capabilities which can be used for respondent s) to invite other respondents that might also be familiar with the challenge statement (e.g., collecting crowd-sourced data from external sources such as, for example, invitees).

[0053] In some implementations, the system 300 can include a user interface 316 (e.g., presented on a display of a user device). The user interface can be structurally and/or functionally similar to the user interface 228 in FIG. 2. In some cases, the user interface 316 can be used by the changemaker to make further refinements to the output of the clustering engine, where the output can include groupings and/or increase or decrease the number of clusters to generate the desired number of personas. In some cases, the changemaker can add additional description and/or labels to the group.

[0054] The system 300 also includes a persona repository 312. The clustering engine 308 and/or the model executor and scorer 356 (see FIG. 3C) can generate a persona, where the persona can be similar to the qualitative entity identifier 264 in FIG. 2. In some implementations, the clustering engine 308 can produce a cluster of archetypal users forming a persona. The persona can be stored in the persona repository 312. In some implementations, the persona can be generated and/or retrieved from the persona repository 312.

[0055] FIG. 3B is a flowchart for a feature collector 304 system component, according to an embodiment. The feature collector 304 can be structurally and/or functionally similar to the feature collector 240 of the system 200 in FIG. 2. The respondent data acquired through the survey generates “features” for the ensemble machine learning models. A “feature” can refer to any salient data point based on the answers for the survey(s). Features can also be referred to as “feature data.” In some implementations, the feature collector 300 can include a process including data cleansing. Data cleansing can include identifying and correcting mistakes or errors in the respondent data. For instance, the respondent data can include typos, incorrect and/or inadequate answers, and/or the like. The mistakes or errors can also include unanswered questions. [0056] Feature selection can include identifying input variables that are most relevant to the challenge statement. For instance, the challenge statement can focus on improving security, where the feature selection identifies input variables from the cleansed data to be transformed and/or encoded for machine learning. Data transformation can include changing the scale or distribution of those input variables. For instance, the data transformation can include using normalization, standardization, and/or word vectorization to map words, phrases, texts, and/or the like to a corresponding vector and/or number. Feature engineering can include deriving new variables from available data obtained from the respondents 320- 332 and/or other respondents. New variables can include question answers that are new, unknown, and/or unexpected that can also be cleansed and transformed to salient data. Dimensionality reduction can include creating compact projections of the respondent data. Dimensionality reduction can include scaling the cleansed using normalization and/or standardization.

[0057] In some implementations, for example, normalization can be used to rescale the respondent data from an original numerical range so that the values are within a new range of, for example, 0 and 1 or another suitable range (e.g., 0 and 100). For instance, the respondent data can include a variety of response scales such as ranges from 1 to 7, ranges from 0 to 100, and/or binaries such as yes (1) and no (0). In some implementations, standardization can include rescaling the distribution of values in the respondent data so that the mean of observed values is 0 and the standard deviation is 1. Feature elimination can include removing features from the respondent data deemed to have the least importance using various methods such as, for example, finding p-value and observing a correlation matrix (e.g., removing data that is not statistically related and/or significant). In some implementations, feature extraction can include using, for example, principal component analysis (PCA), which combines features in the respondent data in a specific way that excludes features of least importance while retaining as much data as possible.

[0058] In some implementations, as respondents 320-332 complete the survey, their data is collected into the feature collector 304. The respondent data can be enriched with additional metadata features about the respondents 320-332 retrieved from the data enricher 336. The additional data of the data enricher 336 can be received from out-of-band resources such as, for example, CRM systems, Microsoft Active Directory®, Salesforce®, and/or the like. In some cases, the enriched data can include data from other respondents invited by the respondents 320-332. The feature collector 304 can also transform and/or normalize the respondent data to be compatible with the ensemble of machine learning models in the clustering engine 308 by transforming and/or encoding the respondent data into feature data. The clustering engine 308 is further described with respect to FIG. 3C. The feature collector 300 can process respondent data into feature data using Natural Language Processing (NLP) and/or word vectorization, where the processed data is scaled using normalization in which processed values are shifted and rescaled so that the values end up ranging between 0 and 1 and/or standardization where the processed values are centered around the mean with a unit of standard deviation. In some implementations, the feature collector 300 can use NLP and/or word vectorization using keyword detection for classification, summarization, and/or sentimental analysis of respondent data. In some implementations, features and/or feature types can be binary, continuous, or categorical in nature. Binary features can be in one of two states. Continuous features can represent a value in a range and can, for example, be numeric in nature. Categorical features can represent a value within a discrete set of possible labels. Some machine learning models use numeric input variables and output variables. Accordingly, features can be encoded to numbers before use. For example, a respondent data including an answer to a question “how many days do you work?” can include “seven days,” where the encoded data can include a value of 7 of a range between 0 and 7.

[0059] In some implementations, some respondent data can exist in different scales. Accordingly, such respondent data can be adjusted and/or normalized. Without normalization, a variable that ranges between 0 and 1000 can outweigh a variable that ranges between 0 and 10 and can skew the model(s). Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between a common range (e.g., 0 and 1). In some cases, the encoded respondent data can be standardized where the values can be centered around the mean with a unit standard deviation. The feature collector 304 can also process textual data of the respondent data using Natural Language Processing (NLP) for text extraction, key phrase determination, and general sentiment analysis. These values are then transformed as potential numerical features to an ensemble of persona machine learning models (e.g., as described herein). The feature collector 304 can also implement word vectorization to map words or phrases from vocabulary to a corresponding vector of real numbers, which are then used to find word predictions, word similarities/semantics, and determine respondent sentiment. The real numbers can be the salient data of the feature data used as inputs for the clustering engine 308 of the system 300.

[0060] In some implementations, the respondent data can include missing data. For instance, a respondent can choose to not answer a question or leave a question without an answer. The respondent can also provide an incomplete and/or incompatible answer. Based on the feature associated with the questions from which missing data originates, the collection feature 304 can impute the respondent data and/or the missing data using the mean and/or median of that feature (e.g., from other users). In some cases, for example, the feature collector 304 can omit the specific entry of the questions with missing data if the feature data for those questions do not meet a specific threshold value for completeness (e.g., remove specific questions if a predefined number and/or percentage of respondents skip or do not answer such questions). In some cases, the feature collector 304 can impute the missing data to a value based on, for example, the mean and/or median for that feature (from data from other users) based on a predetermined threshold value.

[0061] FIG. 11 is a screenshot of a webpage 1100 illustrating an interface in which a user can set parameters for handling missing data, according to an embodiment. In some implementations, a user (e.g., a changemaker) can be presented the webpage 1100 to define parameters of responses to a survey. Specifically, at 1105, the webpage 1100 allows a changemaker to set a number of personas (e.g., clusters or groups) to define. This allows a changemaker to customize the number of groups of respondents in which to classify the respondents. Moreover, at 1110, the changemaker can set a threshold for when to remove a question and its answers from the data. For example, if less than the threshold number and/or percentage of respondents answer a specific question, that question can be removed and not used to define personas. Further, at 1115, the changemaker can identify how to handle skipped questions in a survey. For example, the changemaker can set missing values to a predefined number (e.g., a negative number like -1 or -100) or use the median, mean and/or average of other answers for that question. In some implementations, the changemaker can also remove a respondent from participation in the survey if that respondent don’t meet a threshold for survey completeness (e.g., that respondent did not complete a sufficient number of questions in the survey). For example, a respondent can be removed from participation if the respondent does not answer between 70-80% of the questions. These changes can alter the clusters and/or groups. This allows the changemaker to customize the persona generation.

[0062] In some implementations, the feature collector 300 can handle an increasing number of questions from surveys. For instance, the number of survey questions can increase the level of feature dimension that the machine learning model of the ensemble of machine learning models processes and/or interprets. More features are not necessarily better and not every answer will produce value. Therefore, the feature collector 300 can implement a feature selection and/or feature reduction to reduce the number of feature dimensions to those that produce the best results. In some implementations, this can be done using techniques such as principal-component-analysis (PCA), missing value ratio, and high correlation filter. [0063] In some implementations, the respondent data can be enriched with additional data sources. The data enricher 324 can include obtaining additional data to enrich the feature collector 304. The data enricher 328 can include additional data sources such as directory services or customer relationship management (CRM) systems that specify aspects of the respondent that are helpful with persona creation such as job title, organization, geographic location, team structure, roles, and/or the like thereof. These can then be incorporated as features to the ensemble of machine learning models.

[0064] FIG. 3C is a flowchart for a clustering engine 308 of the system 300, according to an embodiment. The clustering engine 308 includes one or more sets of persona machine learning models, such as a set of persona machine learning models A 344, a set of persona machine learning models B 348, and/or a set of persona machine learning models C 352. The persona machine learning models 344-352 can be trained persona machine learning models. The model selector 356 can select the specific machine learning models (e.g., a set of persona machine learning models 344-352) to use for the scoring and generating of the persona repository 312. The clustering engine 308 can be structurally and/or functionally similar to the clustering engine 248 of the system 200 as described in FIG. 2. The clustering engine 308 can receive the feature data from the feature collector 304 for an ensemble of machine learning models including multiple machine learning models.

[0065] In some cases, ensemble learning is an approach to machine learning that seeks better predictive performance by combining the predictions from multiple machine learning models. A machine learning model can include an algorithm such as, for example, a classification, regression, or a clustering algorithm. Some examples include logistic regression, neural network, a feedforward neural network language model, hierarchical clustering, decision tree, and k-nearest neighbor. In some implementations, some machine learning models can be trained using supervised learning, in which a training set of input samples labeled with the desired output values conditions the model to correctly classify samples that do not occur in the training set, or each machine learning model can be trained using unsupervised learning, in which an algorithm identifies structure, features and/or classifications in unlabeled data.

[0066] The clustering engine 308 can include software (e.g., stored in memory 104 and executed by processor 108 of FIG. 1) that uses the ensemble of persona machine learning models 344-352 (when determining the clusters that make up a persona). The clustering engine 308 can also include a model selector 356, where the model selector 356 can determine the persona machine learning models (e.g., persona machine learning models 344- 352) selected based on a criteria such as, for example, number of respondents, length of survey, amount of textual data, amount of enriched data, and/or the like, and can potentially change between different kinds of challenge statements. For instance, a first persona machine learning model can be used for generating personas from longer surveys, a second persona machine learning model can be used for generating personas from surveys that have a majority of a specific format or type of information (e.g., textual data, short descriptions, sentences, etc.), a third persona machine learning model can be used for generating personas associated with a specific industry and/or role, and/or the like. A persona machine learning model that is assigned to a specific criteria can allow for efficient and accurate training by using training data (e.g., surveys and answers) that are collectively consistent in dimension, size, and/or amount. The combination of outputs from each persona machine learning model can be combined to improve accuracy of grouping users and/or generating personas for users, reducing overfitting, improve model diversity and robustness, and/or the like. The combination of outputs can also be used as training data to further train the ensemble of persona machine learning models The clustering engine 308 can also include a model executor and scorer 356 that can execute the ensemble of persona machine learning models 344-352 based on the selected persona machine learning models by the model selector 356. The model executor and scorer 356 can generate one or more clusters from the respondent and/or the feature data. [0067] In some implementations, models to be used can be selected based on the task and/or the type of input data (e.g., certain models perform better on industry specific natural language, while others perform better on yes/no answers). In some implementations, the surveys can include numeric responses (e.g., slider scale answers), audio responses, free-text responses and/or the like. In some implementations, K-means clustering can be used to define a predefined number of clusters. In such implementations, the K-means clustering can be used to define an initial set of clusters and/or groups of users based on numeric responses.

[0068] Once initial set of groups are identified, one or more transformer natural language processing models (e.g., large language models (LLMs)) can be executed on the free-text responses and/or any follow-up questions. This can then be used to refine the initial set of clusters and/or groups and define a persona for each user.

[0069] In some implementations, different models can be used for different purposes and/or to analyze different data. For example, a first model can be used for key phrase detection (e.g., to highlight and/or identify interesting things users have said about a product or service). For another example, a second model can be used for named entity recognition (NER) to, for example, identify the tools, products and/or services users mention as part of their job. For yet another example, a third model can be used for sentiment analysis to identify positive and/or negative aspects of a user’s experience with a product, service and/or job. For still another example, generative Al and/or LLMs can be used to summarize, rephrase and/or normalize sentences and/or inputs by the users. For yet another example, a

metric) about which questions were the most mipona it to and/or distinguishing m a given kJ. t.i i k'. H Oi kJ Si.Jfc?..

[0070] In some implementations, the model executor and scorer 356 can include a machine learning model that combines one or more outputs of the persona machine learning models 344-352 to define a persona. In some implementations, for example, multiple persona machine learning models can be used to generate a potential persona. The machine learning model in the model executor and scorer 356 can, for example, use a voting scheme in which the output of multiple persona machine learning models 344-352 vote on a potential persona. The final identified persona can be identified based on the persona most often identified by the multiple persona machine learning models 344-352. In some cases, the model executor and scorer 356 can also form logical groups of users that ultimately form a persona. The groups can be presented through a visual interface associated with a particular and/or associated challenge statement. The clusters can be formed using archetypal groups, where the archetypal groups can be previously generated and/or classified, to form the persona.

[0071] In some implementations, the outputs from the persona machine learning models 344-352 can be programmatically combined such that the model executor and scorer 356 can send back a list of clusters to the calling application (e.g., to a compute device of the changemaker). That list can identify which respondents were grouped in which clusters, a generated summary description of that cluster, as well as a list of questions that were deemed the most important for that cluster.

[0072] In some implementations, the calling application (e.g., on a compute device of the changemaker) can also receive text data for each cluster that has been marked-up with span references that identify key words or sentences in a certain way so that they can be rendered in a special way within the application. For example, that the fourth word in the tenth sentence was a specific CRM tool and that this CRM tool had a negative sentiment.

[0073] In some cases, the changemaker can make further refinements to the groupings, and can increase or decrease the number of clusters to generate the desired number of personas. The changemaker can also add additional description and labels to the group. For instance, the changemaker can be presented a user interface that allows the changemaker to refine groups of archetypal entities of a persona with a respondent.

[0074] In some implementations, the persona and/or cluster of groups of users can be presented visually using a dendrogram to identify how the respondents were correlated based on their data, which is further described with respect to FIG. 9.

[0075] FIG. 4 is a screenshot of a webpage illustrating a description of a persona, according to an embodiment. The screenshot displays a persona of a “Deal Maker.” The persona can also include a projection of an archetypal user (e.g. Deal Maker).

[0076] FIG. 5 is a screenshot of a webpage illustrating a survey invitation for a respondent based on a challenge statement, according to an embodiment. The survey invitation can be provided to respondents. In some cases, the survey invitation can be further shared by the respondents to other respondents.

[0077] FIGS. 6A-D are screenshots of a webpage illustrating context questions for a survey based on a challenge statement, according to an embodiment. The screenshot displays questions based around the challenge statement. The challenge statement can serve to influence the way the respondents answer the questions of the survey.

[0078] FIG. 7 is a screenshot of a webpage allowing a user to invite respondents, according to an embodiment. The screenshot includes an extension enabling the respondent to invite other respondents that might be familiar with the given challenge statement through a software using network externality.

[0079] FIG. 8 is a screenshot of a webpage illustrating a cluster grouping for generating a persona, according to an embodiment. The cluster grouping can include multiple archetypal users and/or groups. In some cases, each archetypal user and/or group can represent the persona of the respondent.

[0080] FIG. 9 is a screenshot of a webpage illustrating a dendrogram of the cluster groupings of FIG. 8, according to an embodiment. The dendrogram for the cluster of groups forming the persona. In some implementations, the dendrogram can be generated by a processor based on a persona that identifies a correlation between the between the respondent data and the feature data of FIG. 2.

[0081] FIG. 10 is a flow diagram of an example method 1000 for persona generation using ensemble machine learning, according to an embodiment. In some implementations, the method 1000 can be performed by a processor of a compute device (e.g., processor 108 of persona generator compute device 100 of FIG. 1) and that executes instructions from a memory (e.g., memory 104 of FIG. 1) operatively coupled to the processor. At 1005 the method 1000 includes transmitting actionable data to a first respondent. The actionable data (e.g., survey, questionnaire, digital form, etc.) can include a set of natural language queries (e.g., interview questions, context-specific interview questions, etc.). The first respondent can be or include a user and/or individual at an organization that can provide respondent data in response to the actionable data that can be used to train machine learning models and/or algorithms to generate user personas.

[0082] At 1010, the method 1000 includes receiving respondent data from the first respondent. The respondent data (e.g., user data) can include a set of natural language feature descriptions (e.g., answers to interview questions in a survey). In some cases, the respondent data can include answers in various formats that are binary (e.g., yes/no), continuous (e.g., 1-100), and/or categorical (e.g., strongly disagree, somewhat disagree, somewhat agree, strongly agree, etc.). [0083] At 1015, the method 1000 includes encoding each natural language feature description from the respondent data to produce feature data based on a feature type. The feature type can be, for example, a format in which an answer to an interview question is associated with (e.g., binary, continuous, categorical, etc.). In some cases, certain questions in the survey can be predefined to conform to a specific format such that answers submitted to those questions also conform to that specific format.

[0084] At 1020, the method 1000 includes training, using the feature data, each machine learning model from a set of machine learning models to produce trained persona machine learning models. In some cases, each machine learning model can be trained using the same user data. In such instances, outputs of the machine learning models can be combined to improve results such as, for example, predictability of a user being classified into an archetypal group for persona generation. In some cases, a machine learning model can be associated with a specific archetypal group, industry or persona such that user data from a user associated with the archetypal group, industry and/or persona for that machine learning model is used to train that machine learning model. In some cases, the method 1000 can include selecting a specific machine learning model from the set of machine learning models based on a criteria such as, for example, number of respondents, length of survey, amount of textual data, amount of enriched data, and/or the like. For example, a first machine learning model can be used for generating personas from longer surveys, a second machine learning model can be used for generating personas from surveys that have a majority of a specific format or type of information (e.g., textual data, short descriptions, sentences, etc.), a third persona machine learning model can be used for generating personas associated with a specific industry and/or role and/or the like. A machine learning model that is assigned to a specific criteria can allow for efficient and accurate training by using training data (e.g., surveys and answers) that are collectively consistent in dimension, size, and/or amount.

[0085] At 1025, the method 1000 includes executing one or more trained machine learning models to produce a qualitative entity identifier (e.g., a persona) that represents archetypal entities associated with a second respondent. The persona can represent a group of archetypal entities in which subsequent users can be categorized. For instance, a second user can receive a survey and provide answers to questions in the survey such that the second user is accurately grouped in the correct archetypal group. The second user can be presented a persona reflecting the second user. In some implementations, surveys can be context- specific via an administrative user (e.g., a changemaker) that defines a specific challenge statement about, for example, improving a process, implementing a new software solution, increasing sales, etc. Context-specific surveys can enable the processor to generate personas for more nuanced groups of users.

[0086] It is to be noted that any one or more of the aspects and embodiments described herein can be conveniently implemented using one or more machines (e.g., one or more compute devices that are utilized as a user compute device for an electronic document, one or more server devices, such as a document server, etc.) programmed according to the teachings of the present specification. Aspects and implementations discussed above employing software and/or software modules can also include appropriate hardware for assisting in the implementation of the machine executable instructions of the software and/or software module.

[0087] Such software can be a computer program product that employs a machine- readable storage medium. A machine-readable storage medium can be any medium that is capable of storing and/or encoding a sequence of instructions for execution by a machine and that causes the machine to perform any one of the methodologies and/or embodiments described herein. Examples of a machine-readable storage medium include, but are not limited to, a magnetic disk, an optical disc (e.g., CD, CD-R, DVD, DVD-R, etc.), a magnetooptical disk, a read-only memory "ROM" device, a random-access memory "RAM" device, a magnetic card, an optical card, a solid-state memory device, an EPROM, an EEPROM, and any combinations thereof. A machine-readable medium, as used herein, is intended to include a single medium as well as a collection of physically separate media, such as, for example, a collection of compact discs or one or more hard disk drives in combination with a computer memory. As used herein, a machine-readable storage medium does not include transitory forms of signal transmission.

[0088] Examples of a compute device include, but are not limited to, an electronic book reading device, a computer workstation, a terminal computer, a server computer, a handheld device (e.g., a tablet computer, a smartphone, etc.), a web appliance, a network router, a network switch, a network bridge, any machine capable of executing a sequence of instructions that specify an action to be taken by that machine, and any combinations thereof. In one example, a compute device can include and/or be included in a kiosk. [0089] Combinations of the foregoing concepts and additional concepts discussed herewithin (provided such concepts are not mutually inconsistent) are contemplated as being part of the subject matter disclosed herein. The terminology explicitly employed herein that also can appear in any disclosure incorporated by reference should be accorded a meaning most consistent with the particular concepts disclosed herein.

[0090] The drawings are primarily for illustrative purposes, and are not intended to limit the scope of the subject matter described herein. The drawings are not necessarily to scale; in some instances, various aspects of the subject matter disclosed herein can be shown exaggerated or enlarged in the drawings to facilitate an understanding of different features. In the drawings, like reference characters generally refer to like features (e.g., functionally similar and/or structurally similar elements).

[0091] The entirety of this application (including the Cover Page, Title, Headings, Background, Summary, Brief Description of the Drawings, Detailed Description, Embodiments, Abstract, Figures, Appendices, and otherwise) shows, by way of illustration, various embodiments in which the embodiments can be practiced. The advantages and features of the application are of a representative sample of embodiments, and are not exhaustive and/or exclusive. Rather, they are presented to assist in understanding and teach the embodiments, and are not representative of all embodiments. As such, all examples and/or embodiments are deemed to be non-limiting throughout this disclosure.

[0092] It is to be understood that the logical and/or topological structure of any combination of any program components (a component collection), other components and/or any present feature sets as described in the figures and/or throughout are not limited to a fixed operating order and/or arrangement, but rather, any disclosed order is an example and all equivalents, regardless of order, are contemplated by the disclosure.

[0093] The term “automatically” is used herein to modify actions that occur without direct input or prompting by an external source such as a user. Automatically occurring actions can occur periodically, sporadically, in response to a detected event (e.g., a user logging in), or according to a predetermined schedule.

[0094] The term “determining” encompasses a wide variety of actions and, therefore, “determining” can include, for example, calculating, computing, processing, deriving, investigating, looking up (e.g., looking up in a table, a database or another data structure), ascertaining and the like. Also, “determining” can include receiving (e.g., receiving information), accessing (e.g., accessing data in a memory) and the like. Also, “determining” can include resolving, selecting, choosing, establishing and the like.

[0095] The phrase “based on” does not mean “based only on,” unless expressly specified otherwise. In other words, the phrase “based on” describes both “based only on” and “based at least on.”

[0096] The term “processor” should be interpreted broadly to encompass a general- purpose processor, a central processing unit (CPU), a microprocessor, a digital signal processor (DSP), a controller, a microcontroller, a state machine and so forth. Under some circumstances, a “processor” can refer to an application specific integrated circuit (ASIC), a programmable logic device (PLD), a field programmable gate array (FPGA), etc. The term “processor” can refer to a combination of processing devices, e.g., a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors in conjunction with a DSP core or any other such configuration.

[0097] The term “memory” should be interpreted broadly to encompass any electronic component capable of storing electronic information. The term memory can refer to various types of processor-readable media such as random-access memory (RAM), read-only memory (ROM), non-volatile random-access memory (NVRAM), programmable read-only memory (PROM), erasable programmable read only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, etc. Memory is said to be in electronic communication with a processor if the processor can read information from and/or write information to the memory. Memory that is integral to a processor is in electronic communication with the processor.

[0098] The terms “instructions” and “code” should be interpreted broadly to include any type of computer-readable statement(s). For example, the terms “instructions” and “code” can refer to one or more programs, routines, sub-routines, functions, procedures, etc. “Instructions” and “code” can comprise a single computer-readable statement or many computer-readable statements.

[0099] Some embodiments described herein relate to a computer storage product with a non-transitory computer-readable medium (also can be referred to as a non-transitory processor-readable medium) having instructions or computer code thereon for performing various computer-implemented operations. The computer-readable medium (or processor- readable medium) is non-transitory in the sense that it does not include transitory propagating signals per se (e.g., a propagating electromagnetic wave carrying information on a transmission medium such as space or a cable). The media and computer code (also can be referred to as code) can be those designed and constructed for the specific purpose or purposes. Examples of non-transitory computer-readable media include, but are not limited to, magnetic storage media such as hard disks, floppy disks, and magnetic tape; optical storage media such as Compact Disc/Digital Video Discs (CD/DVDs), Compact Disc-Read Only Memories (CD-ROMs), and holographic devices; magneto-optical storage media such as optical disks; carrier wave signal processing modules; and hardware devices that are specially configured to store and execute program code, such as Application-Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Read-Only Memory (ROM) and Random-Access Memory (RAM) devices. Other embodiments described herein relate to a computer program product, which can include, for example, the instructions and/or computer code discussed herein.

[0100] Some embodiments and/or methods described herein can be performed by software (executed on hardware), hardware, or a combination thereof. Hardware modules can include, for example, a general-purpose processor, a field programmable gate array (FPGA), and/or an application specific integrated circuit (ASIC). Software modules (executed on hardware) can be expressed in a variety of software languages (e.g., computer code), including C, C++, Java™, Ruby, Visual Basic™, and/or other object-oriented, procedural, or other programming language and development tools. Examples of computer code include, but are not limited to, micro-code or micro-instructions, machine instructions, such as produced by a compiler, code used to produce a web service, and files containing higher-level instructions that are executed by a computer using an interpreter. For example, embodiments can be implemented using imperative programming languages (e.g., C, Fortran, etc.), functional programming languages (Haskell, Erlang, etc.), logical programming languages (e.g., Prolog), object-oriented programming languages (e.g., Java, C++, etc.) or other suitable programming languages and/or development tools. Additional examples of computer code include, but are not limited to, control signals, encrypted code, and compressed code.

[0101] Various concepts can be embodied as one or more methods, of which at least one example has been provided. The acts performed as part of the method can be ordered in any suitable way. Accordingly, embodiments can be constructed in which acts are performed in an order different than illustrated, which can include performing some acts simultaneously, even though shown as sequential acts in illustrative embodiments. Put differently, it is to be understood that such features can not necessarily be limited to a particular order of execution, but rather, any number of threads, processes, services, servers, and/or the like that can execute serially, asynchronously, concurrently, in parallel, simultaneously, synchronously, and/or the like in a manner consistent with the disclosure.

[0102] The indefinite articles “a” and “an,” as used herein in the specification and in the embodiments, unless clearly indicated to the contrary, should be understood to mean “at least one.”

[0103] The phrase “and/or,” as used herein in the specification and in the embodiments, should be understood to mean “either or both” of the elements so conjoined, i.e., elements that are conjunctively present in some cases and disjunctively present in other cases. Multiple elements listed with “and/or” should be construed in the same fashion, i.e., “one or more” of the elements so conjoined. Other elements can optionally be present other than the elements specifically identified by the “and/or” clause, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, a reference to “A and/or B”, when used in conjunction with open-ended language such as “comprising” can refer, in one embodiment, to A only (optionally including elements other than B); in another embodiment, to B only (optionally including elements other than A); in yet another embodiment, to both A and B (optionally including other elements); etc.

[0104] As used herein in the specification and in the embodiments, “or” should be understood to have the same meaning as “and/or” as defined above. For example, when separating items in a list, “or” or “and/or” shall be interpreted as being inclusive, i.e., the inclusion of at least one, but also including more than one, of a number or list of elements, and, optionally, additional unlisted items. Only terms clearly indicated to the contrary, such as “only one of’ or “exactly one of,” or, when used in the embodiments, “consisting of,” will refer to the inclusion of exactly one element of a number or list of elements. In general, the term “or” as used herein shall only be interpreted as indicating exclusive alternatives (i.e. “one or the other but not both”) when preceded by terms of exclusivity, such as “either,” “one of,” “only one of,” or “exactly one of.” “Consisting essentially of,” when used in the embodiments, shall have its ordinary meaning as used in the field of patent law. [0105] As used herein in the specification and in the embodiments, the phrase “at least one,” in reference to a list of one or more elements, should be understood to mean at least one element selected from any one or more of the elements in the list of elements, but not necessarily including at least one of each and every element specifically listed within the list of elements and not excluding any combinations of elements in the list of elements. This definition also allows that elements can optionally be present other than the elements specifically identified within the list of elements to which the phrase “at least one” refers, whether related or unrelated to those elements specifically identified. Thus, as a non-limiting example, “at least one of A and B” (or, equivalently, “at least one of A or B,” or, equivalently “at least one of A and/or B”) can refer, in one embodiment, to at least one, optionally including more than one, A, with no B present (and optionally including elements other than B); in another embodiment, to at least one, optionally including more than one, B, with no A present (and optionally including elements other than A); in yet another embodiment, to at least one, optionally including more than one, A, and at least one, optionally including more than one, B (and optionally including other elements); etc.

[0106] In the embodiments, as well as in the specification above, all transitional phrases such as “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” “holding,” “composed of,” and the like are to be understood to be open-ended, i.e., to mean including but not limited to. Only the transitional phrases “consisting of’ and “consisting essentially of’ shall be closed or semi-closed transitional phrases, respectively, as set forth in the United States Patent Office Manual ofPatent Examining Procedures, Section 2111.03.

Claims

CLAIMS What is claimed is:

1. An apparatus for persona generation using ensemble machine learning, comprising: a processor; and a memory operatively coupled to the processor, the memory storing instructions to cause the processor to: transmit actionable digital data to a first respondent, the actionable digital data including a plurality of natural language queries; receive respondent data from the first respondent, the respondent data including a plurality of natural language feature descriptions; encode each natural language feature description from the plurality of natural language feature descriptions to produce feature data from a plurality of feature data based on a feature type that is associated with a natural language query from the plurality of natural language queries for that natural language feature description; train, using the plurality of feature data, each persona machine learning model from a plurality of persona machine learning models to produce a plurality of trained persona machine learning models; and execute one or more trained persona machine learning models from the plurality of trained persona machine learning models, to produce a qualitative entity identifier, the qualitative entity identifier representing a group of archetypal entities associated with a second respondent.

2. The apparatus of claim 1, wherein the actionable digital data is based on a challenge statement, the challenge statement defined by a changemaker.

3. The apparatus of claim 2, wherein the memory stores instructions to further cause the processor to enable the second respondent to invite other respondents to receive the actionable digital data based on the challenge statement.

4. The apparatus of claim 2, wherein the memory stores instructions to further cause the processor to generate a plurality of persona training subsets based on the challenge statement.

5. The apparatus of claim 1, wherein the memory stores instructions to further cause the processor to generate the plurality of natural language queries using a decision tree model.

6. The apparatus of claim 1, wherein prior to encoding each natural language feature description from the plurality of natural language feature descriptions to produce the feature data from the plurality of feature data, the memory stores instructions to further cause the processor to: identify or correct one or more errors in the respondent data; select one or more feature inputs for the plurality of persona machine learning models based on the respondent data; transform the one or more selected feature inputs for the plurality of persona machine learning models; derive new feature inputs based on external data, the external data received from an external respondent; and generate a compact feature projection, to encode the compact feature projection into the feature data.

7. The apparatus of claim 6, wherein the memory stores instructions to further cause the processor to generate the compact feature projection based on one or more additional natural language queries included in the actionable digital data using a principal- component-analysis, a missing value ratio, and a high correlation filter.

8. The apparatus of claim 1, wherein the memory stores instructions to further cause the processor to receive crowdsourced data from an external data system.

9. The apparatus of claim 1, wherein the feature data includes binary, continuous, or categorical values.

10. The apparatus of claim 1, wherein the memory stores instructions to further cause the processor to encode the respondent data using normalization, standardization, or word vectorization, to produce the feature data.

11. The apparatus of claim 1 , wherein the memory stores instructions to further cause the processor to encode missing data in the respondent data, the missing data including unanswered natural language queries.

12. The apparatus of claim 11, wherein the memory stores instructions to further cause the processor to impute the missing data to a numerical value based on the mean or median of the feature data associated with the missing data.

13. The apparatus of claim 11, wherein the memory stores instructions to further cause the processor to impute the missing data to a numerical value based on a predetermined threshold.

14. The apparatus of claim 1, wherein the plurality of persona machine learning models includes an ensemble machine learning model.

15. The apparatus of claim 1, wherein at least one persona machine learning model from the plurality of persona machine learning models includes a feedforward neural network language model.

16. The apparatus of claim 1, wherein each persona machine learning model from the plurality of persona machine learning models includes an unsupervised machine learning model.

17. The apparatus of claim 1, wherein instructions to cause the processor to combine the one or more persona machine learning models uses a voting scheme, to produce the qualitative entity identifier.

18. The apparatus of claim 1, wherein the memory stores instructions to further cause the processor to present a user interface to allow a user to refine the group of archetypal entities of the persona.

19. The apparatus of claim 1, wherein the memory stores instructions to cause the processor to generate a dendrogram based on the persona, the dendrogram identifying a correlation between the respondent data and the feature data.

20. A method, comprising: transmitting, by a processor operatively coupled to a memory, actionable digital data to a respondent, the actionable digital data including a plurality of natural language queries; receiving respondent data from the respondent, the respondent data including a plurality of natural language feature descriptions; encoding each natural language feature description from the plurality of natural language feature descriptions, to produce feature data from a plurality of feature data based on a feature type that is associated with a natural language query from the plurality of natural language queries for that natural language feature description; and executing one or more trained persona machine learning models from a plurality of trained persona machine learning models using the plurality of feature data as an input, to produce a qualitative entity identifier, the qualitative entity identifier representing a group of archetypal entities associated with the respondent.