CN117235354A - User personalized service strategy and system based on multi-mode large model - Google Patents

User personalized service strategy and system based on multi-mode large model Download PDF

Info

Publication number
CN117235354A
CN117235354A CN202311116653.1A CN202311116653A CN117235354A CN 117235354 A CN117235354 A CN 117235354A CN 202311116653 A CN202311116653 A CN 202311116653A CN 117235354 A CN117235354 A CN 117235354A
Authority
CN
China
Prior art keywords
user
model
data
mode
personalized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311116653.1A
Other languages
Chinese (zh)
Inventor
何传雯
司玉景
李全忠
蒲瑶
何国涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Original Assignee
Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Puqiang Times Zhuhai Hengqin Information Technology Co ltd filed Critical Puqiang Times Zhuhai Hengqin Information Technology Co ltd
Priority to CN202311116653.1A priority Critical patent/CN117235354A/en
Publication of CN117235354A publication Critical patent/CN117235354A/en
Pending legal-status Critical Current

Links

Abstract

The invention belongs to the technical field of multi-mode information processing, and particularly relates to a user personalized service strategy and system based on a multi-mode large model. The method comprises the following steps: step 1: multi-mode data acquisition and model self-adaptive training; step 2: multi-mode data fusion; step 3: constructing a user portrait; step 4: generating a personalized strategy; step 5: user portrayal optimization. The invention has the beneficial effects that: compared with the traditional personalized strategy, the method and the device extract the multi-modal information of the user by using the multi-modal large model, so that the omnibearing three-dimensional user portrait is generated and applied to a subsequent interaction system, and the accuracy of the user portrait is improved: compared with the prior user portrait construction mode with a single mode (such as relying on text or voice only), the method introduces a multi-mode large model, and can fuse and process user data from multiple modes such as text, voice, images, video and the like.

Description

User personalized service strategy and system based on multi-mode large model
Technical Field
The invention belongs to the technical field of multi-mode information processing, and particularly relates to a user personalized service strategy and system based on a multi-mode large model.
Background
With the explosion and increasing penetration of technology in various fields, users' demands for experience that is compliant with ideas and personalized are also continuing, and even accelerating growth. In this context, various industries pay more and more attention to the user-refined service strategies, recognizing that this is a core element of enterprise survival development, associative dominance and even leadership. Therefore, establishing a user portrait becomes an essential important link. User portraits are core tools for enterprises to deeply understand user groups and to design products and provide services according to personalized requirements, and play a vital role in attracting new potential customers and retaining existing faithful users.
Conventional methods for constructing user portraits can be largely divided into two types. One is to rely on manual input and feedback from the user, such as user configuration, questionnaires, information filled in when the user registers, etc., to obtain information on the user's needs and interests, etc. Another approach is to model the behavior of the user using deep learning or complex machine learning models, and model mining the behavior habits of the user.
The existing user portrait construction method has the following defects:
(1) The user portrait precision is low
The traditional methods such as user configuration, questionnaire survey or information filled in during user registration often simplify the complex demands and personalized performances of users into a number of values or category characteristics, lack deep and comprehensive understanding of users, neglect the influence of emotion, preference, living habit, social environment and the like of the users, neglect the behavior time sequence of the users, for example, the interests of the users change along with the time. And the user is not necessarily true and objective in terms of privacy protection and the like, and the user portrait generated by the information is not accurate enough and may even be wrong.
(2) High manpower cost input
Whether information is collected by questionnaires or the like, or trained using deep learning or complex machine learning models, a large amount of human resources are required. The former needs professional data analysts to formulate a targeted questionnaire to analyze and refine information. The latter requires a great deal of time investment by the developer, including data labeling, model design, debugging optimization, and the like. Therefore, the current user portrait construction method is relatively large in labor cost.
(3) Interaction system information storage based on large model is not durable: the existing multi-mode large model interaction system resets after each round of dialogue, which means that the model cannot effectively memorize previous user information, such as the preference, habit and the like displayed by the user in the previous interaction process. This limits the understanding and adaptation of the model to the user's personalized needs and also reduces the accuracy and degree of personalization of the service.
Disclosure of Invention
The invention aims to provide a user personalized service strategy based on a multi-mode large model, so as to solve the problems of single mode information extraction, large labor cost investment and non-persistent information storage of an interactive system based on the large model in the prior art.
The technical scheme of the invention is as follows: a user-personalized service policy based on a multimodal big model, comprising the steps of:
step 1: multi-mode data acquisition and model self-adaptive training;
step 2: multi-mode data fusion;
step 3: constructing a user portrait;
step 4: generating a personalized strategy;
step 5: user portrayal optimization.
The step 1 includes the steps of,
step 11: multimodal data acquisition
Through interaction of a user and a multi-mode large model, multi-mode user data are collected, wherein the user data comprise data of various mode types including characters, voice, images, videos and the like;
Step 12: and (5) model self-adaptive training.
The step 12 includes:
step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
step 123: training model
Inputting the preprocessed and characteristic extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function;
step 124: model testing
Using part of the data to perform model test, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to step 123; if the performance of the model meets the expected result, then go to the next step.
Step 2 is to collect and process data of different modes, then to use multi-mode big model to extract features from data of each mode, and to fuse these features to establish relevance among modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Association of other heterogeneous modality information.
Step 3 includes inputting the result of the multimodal data fusion into a multimodal big model, from which the multimodal big model further generates detailed user portraits including, but not limited to, user preferences, behavioral patterns, emotional states, and language styles, followed by the following operations:
step 31: feature labeling
Labeling each user characteristic according to the characteristic result obtained by multi-mode data fusion;
step 32: image filling
The user information marked by the multi-mode large model is mapped to the user portrait, so that the user information from different modes can be expressed comprehensively, deeply and humanizedly.
Step 4 includes inputting the user portrait as a prompt into the multi-modal large model after constructing the user portrait, and providing personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following:
(1) Personalized reply;
(2) Personalized recommendation;
(3) And personalizing the interface.
The step 5 comprises that the multi-mode large model can automatically learn through feedback of the user to optimize the user portrait, and according to real-time feedback and behaviors of the user, when new user data is input, the multi-mode large model comprehensively analyzes the latest preference, individuality and other characteristics of the user according to the steps, and continuously optimizes the user portrait.
A user personalized service system based on a multi-mode large model comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
The multi-modal data acquisition model and the self-adaptive training module collect multi-modal user data through interaction between a user and a multi-modal large model, wherein the user data comprises data of various modal types including characters, voice, images, videos and the like;
preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
Inputting the preprocessed and feature extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function, and in training, model parameters can be continuously adjusted through optimization algorithms such as gradient descent, so that the value of the loss function is minimized;
using part of data to perform model test, evaluating the accuracy of the model, if the model does not meet the expected result, adjusting the model, and returning to perform model training again until the model meets the expected result;
after the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
Feature labeling, namely labeling each user feature according to a feature result obtained through multi-mode data fusion;
filling the portraits, namely mapping the user information marked by the multi-mode large model to the user portraits, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed;
the personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.
The invention has the beneficial effects that: compared with the traditional personalized strategy, the method and the system for extracting the user multi-modal information by using the multi-modal large model, disclosed by the invention, have the advantages and positive effects that the user multi-modal information is extracted, so that the omnibearing three-dimensional user portrait is generated and is applied to a subsequent interaction system, and the method and the system have the following advantages and positive effects:
(1) Accuracy of user image: compared with the prior user portrait construction mode with a single mode (such as relying on text or voice only), the method introduces a multi-mode large model, and can fuse and process user data from multiple modes such as text, voice, images, video and the like. By using the rich and comprehensive data, the multi-mode large model can generate more accurate and microscopic user figures, and realize the omnibearing and deep understanding of each user requirement. The user portrait precision is greatly improved, the application value is greatly widened, and the high-precision user portraits can be used for more accurate recommendation, discrimination and prediction, so that the user experience is effectively improved, and the user satisfaction is improved;
(2) Continuity of interactive system based on multi-modal large model: compared with the problem that the existing multi-mode large model interaction system is reset in each round of dialogue, the multi-mode large model interaction system in the method does not reset after each interaction with a user, but generates an accurate user portrait by accumulating and learning user information, so that the information of each user is stored more permanently, and a durable personalized service is provided;
(3) Convenience of system maintenance: compared with the prior mode of introducing a reward mechanism to carry out model adjustment or retraining, the multi-mode large model can carry out self optimization according to new user data by virtue of self learning and adjustment capability. Meanwhile, the multi-mode large model can input and manage the user images through visual and simple prompt words. For example, if the portrait of a certain user needs to be updated or corrected, the portrait can be immediately reflected in the service strategy of the model only by modifying the corresponding prompt word, so that the maintenance efficiency and convenience of the system are greatly improved;
(4) Uniqueness of user experience: based on a more accurate representation of the user, a highly customized service experience may be provided to the user. Whether personalized reply is performed in the interactive dialogue, personalized content recommendation is provided according to user preference and real-time requirement, or a personalized interface is automatically adjusted according to user visual preference, and the like, a personalized strategy aiming at the user can be generated, so that the user really enjoys thousands of people and unique service experience.
Drawings
Fig. 1 is a flowchart of a user personalized service policy based on a multi-mode big model provided by the invention.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific examples.
The multi-mode large model (such as GPT-4) has excellent performance in multiple tasks such as language, image processing and the like, and provides a new idea for personalized strategies. The multi-mode large model has powerful understanding and computing power far higher than that of the previous model, can efficiently analyze and process a large number of data sets, and realizes autonomous learning and adjustment of the model. Moreover, the multi-modal large model can analyze inputs in various forms such as language, images, sound, video and the like, even if the environment is complex or the information is fuzzy, the multi-modal large model can quickly find important features and effectively apply the important features, so that basic preference and behavior of a user are identified, fine habits and subtle changes hidden behind data are further mined, and the multi-modal large model has the capability of understanding and learning no matter how complex the preference and behavior mode of the user is, and provides highly personalized service for each user. The multi-mode large model has outstanding convenience, supports operation in the form of prompt words, greatly reduces the use threshold, is more humanized, enables operators to efficiently use the model without elaborate technical knowledge, and improves the implementation effect of personalized service.
The invention provides a user personalized service strategy based on a multi-mode large model, which comprises the steps of firstly, utilizing the information processing capability of the multi-mode large model to obtain corresponding user personalized information such as age, sex, region occupation, interest and hobbies by analyzing the behaviors and hobbies of a user in various modes such as voice, text, image, video and the like. Secondly, based on the information, a more accurate and microscopic user portrait is generated by using a multi-mode large model. And finally, applying the generated user portrait to a multi-mode large model or other product services to further form an accurate user personalized service strategy and improve user experience and satisfaction.
As shown in fig. 1, a user personalized service policy based on a multi-mode big model includes the following steps:
step 1: multi-modal data acquisition and model adaptive training
Step 11: multimodal data acquisition
And collecting multi-mode user data through interaction of the user and the multi-mode large model on the premise of respecting and protecting the privacy of the user. Specifically, the collected user data includes data of a plurality of modality types including text, voice, image, video, and the like.
Text modality: mainly refers to text content entered by a user, including but not limited to search records, text chat records, and the like. The multi-mode large model analyzes the multi-mode large model through a plurality of dimensions such as vocabulary understanding, sentence structure, context topics and the like, and can reveal language modes, preference demands and emotion tendencies of users.
Voice modality: the voice input of the user is mainly collected, for example, an audio signal sent out by a microphone, and the voice characteristics, the emotion states, the regional dialect and other information of the user are extracted through voice recognition and voiceprint recognition.
Image modality: the method mainly refers to static picture information provided by a user, such as personal photos uploaded by the user, expression packages in chat records and the like, and features of images are extracted through an image processing method to supplement the user information.
Video modality: the method mainly comprises video chat, user pictures and other video information acquired through the camera. Dynamic information in the video, such as facial expressions, limb actions, etc., can be analyzed by facial recognition for gender age, behavioral habits, emotional expressions, etc. of the user.
Step 12: model adaptive training
Step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleaning, data normalization, and the like. Invalid, redundant and erroneous data is removed from the data set, guaranteeing the quality and integrity of the data.
Step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases of the text data, extracting high-dimensional features of emotion analysis and the like; and extracting multi-level and multi-scale characteristic representation from the voice and visual data by using a deep neural network model. The extracted features are stored as input to model training.
Step 123: training model
The data after pretreatment and feature extraction is input into a multi-mode large model, and the algorithms such as supervised learning and the like are used for training. The algorithm will train on the training data and measure the difference between the predicted and actual values of the model by the loss function. In training, model parameters are continuously adjusted through optimization algorithms such as gradient descent and the like so as to minimize the value of the loss function, and therefore the optimal training effect is obtained.
Step 124: model testing
And (3) performing model test by using part of data, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model and returning to the step of training the model. If the performance of the model meets the expected result, then go to the next step.
Step 2: multimodal data fusion
Features are extracted from data of all modes by using a trained multi-mode large model, and are fused to establish relevance among all modes, including but not limited to the following relevance:
(1) Association of language patterns with emotion expressions: language patterns are habits and ways in which the user uses the language in communication, including selected words, etc., and emotional expressions can be classified into sad, anger, happy, surprise, fear, aversion, and neutral 7 states. The language mode and the emotion expression tend to show the same trend, and the specific vocabulary or expression mode frequently used by the user in the dialogue can reflect the specific emotional state or psychological tendency, and the specific association is as follows:
Sadness: the user may use some vocabulary in the communication that depicts sadness or loss, such as "hard to see", "hurt", which means that the user is experiencing sad or frustrated emotion, or that the user shows a feeling of being distracted by the appearance of something.
Anger: the user uses words representing anger, annoyance, such as "angry", "dying", in the communication, and then indicates that the user is experiencing anger emotion, or that the emotional tendency of anger and angry is exhibited for the occurrence of something.
Happy: users use words in communication that express pleasure, satisfaction, such as "happy", to indicate that the user is experiencing happy and happy emotions, or to show emotional tendency to anger and gas to the appearance of something.
Surprisingly: the user uses words representing surprise or surprise in the communication, such as "java", "true moral", to indicate that the user is experiencing surprisal emotion, or that the user is exhibiting surprisal emotion tendencies for something to appear.
Fear of: the user uses some words expressing fear and tension in communication, such as fear and fear, and the emotion tendency that the user is experiencing fear or showing fear to the appearance of something is indicated.
Aversion to: users use words in communication that represent aversion, dislikes, such as "nausea", "dislikes", that represent the user is experiencing a disliked emotion, or that represent a disliked emotional tendency for the occurrence of something.
And (3) neutral: the user uses some neutral vocabulary in the communication, e.g. "good", "understanding", to indicate that the user's current emotional state is smooth or that there is no significant emotional fluctuation in the appearance of something.
The language understanding capability is realized through the multi-mode large model, data of different mode types are input into the multi-mode large model, the corresponding relation between the vocabulary and emotion is automatically captured through the multi-mode large model, and related relations are established, so that emotion expression habits and object preferences of users can be more accurately positioned.
(2) Association of language patterns with behavior patterns: the behavior patterns of the user are often reflected in the language expression, and the behavior patterns of the user such as thinking habits, decision trends and the like can be deduced through analysis of the characteristics of the user such as voice rate, intonation, pause and the like. The specific association is as follows:
speech rate: generally refers to the speed at which the user speaks, i.e., the number of words spoken per minute, which can intuitively exhibit the user's mental activity and decision speed. If a user speaks particularly quickly, this means that he is a thinking-agile, fast-acting person, accustomed to a high-stress, fast-paced environment. Conversely, if a user speaks slowly, this means that he is a cool, deliberate person.
Intonation: intonation refers to the change in pitch of a user when speaking, including the level of pitch, the intensity of speech, etc., and is often directly related to the emotional and psychological states of the user. For example, a user with higher intonation, excited mouth may be experiencing anger or excited emotion, while a user with lower intonation, voice calm may be in a calm, happy or passive emotional state.
Pause: the length, frequency and the like of the pause are important reflections of the thinking mode of the user, and the thinking depth and rhythm sense of the user can be reflected. More frequent pauses indicate that the user is spending a great deal of time thinking and judging, and such users may be more focused on the depth and comprehensiveness of things, more prone to a steady pattern of behavior; while users with lower discontinuity, i.e., almost no pauses in the speaking process, may show that their thinking is faster, the decision process is simple and efficient, and the behavior patterns of such users are more prone to aggressive.
The multi-mode large model obtains the voice characteristics of the user through processing the input voice data of the user to understand the behavior mode of the user, and by means of the association, the thinking habit and the decision tendency of the user can be better known, so that services and suggestions which meet the requirements of the user are provided.
(3) Association of behavior patterns with visual information: when a user interacts, whether reading a web page, watching a video, or clicking on a link, various behavioral patterns are generated, which are closely related to the visual appearance of the user, as follows:
browsing behavior of the user: the user may show a strong interest or preference in a particular thing or topic while watching a picture or video, which may be reflected in the behavior of a long-time gaze or multiple clicks of the picture by the user. For example, a user frequently clicks on a detailed description of food materials or an enlarged view of a production step while watching a video on cooking, which indicates that the user has a high interest in cooking.
Visual element selected by user: the user can make a large number of choices in the interaction, the visual preference can be reflected in the clicking action of the user, and the user can have more browsing time and deeper interaction for the content conforming to the visual preference. The multimodal big model can treat these choices as behavior patterns of the user and analyze on visual elements (e.g. merchandise pictures, activity posters, interface styles, etc.). For example, pictures that the user looks at or clicks on for a long time all have some similar color and style, and it is understood that the user may have a preference for such color and style.
Through the association analysis of the behavior patterns and the visual information, the interest or preference of the user to a certain type of content can be acquired, so that the content and the interaction interface which are closer to the interest of the user can be provided when the user interacts.
(4) Association of other heterogeneous modality information: in some cases, the same user may exhibit different behavior patterns or preferences in different modal information, including text, voice, images, and video. For example, users often discuss technical news in text communication, but in video games, sports games are preferred, and the two seemingly contradictory information can reflect multiple interests of the users. The correlation between the heterogeneous modal information can mine the complicated multidimensional preference of the user, and is very important for constructing comprehensive and deep user portraits.
Through the processing of multi-mode data fusion, the data of each mode not only covers the basic information of the user, but also can fully understand the behavior mode, the emotion state, the preference habit and the like of the user, and the data of each mode are not isolated but are mutually complemented, so that a comprehensive and deep user portrait is formed. The accuracy and the practicability of the scheme are greatly improved, and a more accurate and effective basis is provided for the follow-up personalized service strategy.
Step 3: user portrayal construction
The multi-modal data fusion result, namely the user basic information extracted from the multi-modal and four associated results, are processed and output into text content by means of understanding capability and language capability of the multi-modal large model, and then are input into the multi-modal large model, and the multi-modal large model further generates detailed user portraits according to the text content, wherein the portraits comprise, but are not limited to, preference, behavior mode, emotional state, language style and the like of the user. The following operations are then performed:
step 31: feature labeling
And labeling each user characteristic, such as age, gender, behavior habit, consumption habit and the like, according to the characteristic result obtained by multi-mode data fusion.
Step 32: image filling
After the feature labeling in step 31 is performed, the labeled user information is mapped to the user portrait, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed. For example, the language pattern of the user in the dialog is converted into an expression of his favorites, or the audio features of the user are converted into their possible emotional states, etc.
Step 4: personalized policy generation
After the user representation is built, the user representation is input into the multi-modal large model as a prompt, and through self-learning of the multi-modal large model, the multi-modal large model provides personalized services customized based on the user representation after interaction, including but not limited to the following:
(1) Personalized reply: in the process of talking with the user, the reply content conforming to each user style is automatically generated according to the language mode and preference in the user portrait.
For example, if a user uses a large number of terms in a conversation or likes to reply with a phrase, the language style will be learned and imitated so that the reply returned can be closer to the user's habit.
(2) Personalized recommendation: and generating more accurate personalized recommendation strategies according to preference information in the user portrait.
For example, if the user portraits that someone likes an outdoor activity, then the interactive system will actively recommend an outdoor activity that is appropriate for the weather of the day while providing weather information; if the user prefers to be spicy in his usual diet, then the spicy series will be recommended for the user when providing nearby restaurant information. Meanwhile, the preference of the user and the special requirement of a certain time period can be captured, for example, the user pays attention to a certain hot event recently, and information related to the hot event can be given preferentially.
(3) Personalized interface: a personalized user interface is customized for the user.
For example, the tone, contrast, font size, style, etc. can be automatically adjusted based on the visual preference information of the user representation. For example, if the user prefers a dark-tinted design, the user may be provided with a dark background interface to provide the most personalized visual experience.
Based on the personalized strategies, the user can obtain more specific services, so that the satisfaction degree and the loyalty degree of the user to the services are greatly improved.
Step 5: user portrayal optimization
The multi-mode large model automatically learns through feedback of the user, and continuously optimizes the user portrait. According to the real-time feedback and behavior of the user, when new user data is input, the multi-mode large model can comprehensively analyze the latest preference, individuality and other characteristics of the user according to the steps, namely, the step 1, the step 2, the step 3 and the step 4, and continuously optimize the user portrait, so that more accurate individualization service is provided.
A user personalized service system based on a multi-mode large model comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
The multi-modal data acquisition and model self-adaptive training module collects multi-modal user data through interaction between a user and a multi-modal large model. Specifically, the collected user data includes data of a plurality of modality types including text, voice, image, video, and the like. The method comprises the following steps:
Text modality: mainly refers to text content entered by a user, including but not limited to search records, text chat records, and the like. The multi-mode large model analyzes the multi-mode large model through a plurality of dimensions such as vocabulary understanding, sentence structure, context topics and the like, and can reveal language modes, preference demands and emotion tendencies of users.
Voice modality: the voice input of the user is mainly collected, for example, an audio signal sent out by a microphone, and the voice characteristics, the emotion states, the regional dialect and other information of the user are extracted through voice recognition and voiceprint recognition.
Image modality: the method mainly refers to static picture information provided by a user, such as personal photos uploaded by the user, expression packages in chat records and the like, and features of images are extracted through an image processing method to supplement the user information.
Video modality: the method mainly comprises video chat, user pictures and other video information acquired through the camera. Dynamic information in the video, such as facial expressions, limb actions, etc., can be analyzed by facial recognition for gender age, behavioral habits, emotional expressions, etc. of the user.
The model self-adaptive training of the multi-mode data acquisition and model self-adaptive training module comprises the following steps:
And preprocessing data, namely preprocessing the collected multi-mode data, including data cleaning, data standardization and the like. Invalid, redundant and erroneous data is removed from the data set, guaranteeing the quality and integrity of the data.
Feature extraction, namely performing feature extraction on the preprocessed data, and extracting high-dimensional features such as keyword or phrase extraction, emotion analysis and the like on the text data; and extracting multi-level and multi-scale characteristic representation from the voice and visual data by using a deep neural network model. The extracted features are stored as input to model training.
And training a model, namely inputting the data subjected to pretreatment and feature extraction into a multi-mode large model, and training by using algorithms such as supervised learning. The algorithm will train on the training data and measure the difference between the predicted and actual values of the model by the loss function. In training, model parameters are continuously adjusted through optimization algorithms such as gradient descent and the like so as to minimize the value of the loss function, and therefore the optimal training effect is obtained.
And (3) model testing, namely performing model testing by using part of data, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to the process of training the model until the model meets the expected result.
After the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the method comprises the following steps:
(1) Association of language patterns with emotion expressions: language patterns are habits and ways in which the user uses the language in communication, including selected words, etc., and emotional expressions can be classified into sad, anger, happy, surprise, fear, aversion, and neutral 7 states. The language mode and the emotion expression tend to show the same trend, and the specific vocabulary or expression mode frequently used by the user in the dialogue can reflect the specific emotional state or psychological tendency, and the specific association is as follows:
sadness: the user may use some vocabulary in the communication that depicts sadness or loss, such as "hard to see", "hurt", which means that the user is experiencing sad or frustrated emotion, or that the user shows a feeling of being distracted by the appearance of something.
Anger: the user uses words representing anger, annoyance, such as "angry", "dying", in the communication, and then indicates that the user is experiencing anger emotion, or that the emotional tendency of anger and angry is exhibited for the occurrence of something.
Happy: users use words in communication that express pleasure, satisfaction, such as "happy", to indicate that the user is experiencing happy and happy emotions, or to show emotional tendency to anger and gas to the appearance of something.
Surprisingly: the user uses words representing surprise or surprise in the communication, such as "java", "true moral", to indicate that the user is experiencing surprisal emotion, or that the user is exhibiting surprisal emotion tendencies for something to appear.
Fear of: the user uses some words expressing fear and tension in communication, such as fear and fear, and the emotion tendency that the user is experiencing fear or showing fear to the appearance of something is indicated.
Aversion to: users use words in communication that represent aversion, dislikes, such as "nausea", "dislikes", that represent the user is experiencing a disliked emotion, or that represent a disliked emotional tendency for the occurrence of something.
And (3) neutral: the user uses some neutral vocabulary in the communication, e.g. "good", "understanding", to indicate that the user's current emotional state is smooth or that there is no significant emotional fluctuation in the appearance of something.
The language understanding capability is realized through the multi-mode large model, data of different mode types are input into the multi-mode large model, the corresponding relation between the vocabulary and emotion is automatically captured through the multi-mode large model, and related relations are established, so that emotion expression habits and object preferences of users can be more accurately positioned.
(2) Association of language patterns with behavior patterns: the behavior patterns of the user are often reflected in the language expression, and the behavior patterns of the user such as thinking habits, decision trends and the like can be deduced through analysis of the characteristics of the user such as voice rate, intonation, pause and the like. The specific association is as follows:
speech rate: generally refers to the speed at which the user speaks, i.e., the number of words spoken per minute, which can intuitively exhibit the user's mental activity and decision speed. If a user speaks particularly quickly, this means that he is a thinking-agile, fast-acting person, accustomed to a high-stress, fast-paced environment. Conversely, if a user speaks slowly, this means that he is a cool, deliberate person.
Intonation: intonation refers to the change in pitch of a user when speaking, including the level of pitch, the intensity of speech, etc., and is often directly related to the emotional and psychological states of the user. For example, a user with higher intonation, excited mouth may be experiencing anger or excited emotion, while a user with lower intonation, voice calm may be in a calm, happy or passive emotional state.
Pause: the length, frequency and the like of the pause are important reflections of the thinking mode of the user, and the thinking depth and rhythm sense of the user can be reflected. More frequent pauses indicate that the user is spending a great deal of time thinking and judging, and such users may be more focused on the depth and comprehensiveness of things, more prone to a steady pattern of behavior; while users with lower discontinuity, i.e., almost no pauses in the speaking process, may show that their thinking is faster, the decision process is simple and efficient, and the behavior patterns of such users are more prone to aggressive.
The multi-mode large model obtains the voice characteristics of the user through processing the input voice data of the user to understand the behavior mode of the user, and by means of the association, the thinking habit and the decision tendency of the user can be better known, so that services and suggestions which meet the requirements of the user are provided.
(3) Association of behavior patterns with visual information: when a user interacts, whether reading a web page, watching a video, or clicking on a link, various behavioral patterns are generated, which are closely related to the visual appearance of the user, as follows:
browsing behavior of the user: the user may show a strong interest or preference in a particular thing or topic while watching a picture or video, which may be reflected in the behavior of a long-time gaze or multiple clicks of the picture by the user. For example, a user frequently clicks on a detailed description of food materials or an enlarged view of a production step while watching a video on cooking, which indicates that the user has a high interest in cooking.
Visual element selected by user: the user can make a large number of choices in the interaction, the visual preference can be reflected in the clicking action of the user, and the user can have more browsing time and deeper interaction for the content conforming to the visual preference. The multimodal big model can treat these choices as behavior patterns of the user and analyze on visual elements (e.g. merchandise pictures, activity posters, interface styles, etc.). For example, pictures that the user looks at or clicks on for a long time all have some similar color and style, and it is understood that the user may have a preference for such color and style.
Through the association analysis of the behavior patterns and the visual information, the interest or preference of the user to a certain type of content can be acquired, so that the content and the interaction interface which are closer to the interest of the user can be provided when the user interacts.
(4) Association of other heterogeneous modality information: in some cases, the same user may exhibit different behavior patterns or preferences in different modal information, including text, voice, images, and video. For example, users often discuss technical news in text communication, but in video games, sports games are preferred, and the two seemingly contradictory information can reflect multiple interests of the users. The correlation between the heterogeneous modal information can mine the complicated multidimensional preference of the user, and is very important for constructing comprehensive and deep user portraits.
The four correlations are obtained after the user data are autonomously learned based on the large model, and through the multi-mode data fusion processing, the data of each mode not only cover the basic information of the user, but also can fully understand the behavior mode, the emotion state, the preference habit and the like of the user, and the data of each mode are not isolated but are mutually complemented, so that a comprehensive and deep user portrait is formed. The accuracy and the practicability of the scheme are greatly improved, and a more accurate and effective basis is provided for the follow-up personalized service strategy.
The user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
and marking the characteristics of each user, such as age, gender, behavior habit, consumption habit and the like, according to the characteristic results obtained by the multi-mode data fusion.
After feature labeling is performed, the labeled user information is mapped to the user portrait, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed. For example, the language pattern of the user in the dialog is converted into an expression of his favorites, or the audio features of the user are converted into their possible emotional states, etc.
The personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the method comprises the following steps:
(1) Personalized reply: in the process of talking with the user, the reply content conforming to each user style is automatically generated according to the language mode and preference in the user portrait.
For example, if a user uses a large number of terms in a conversation or likes to reply with a phrase, the language style will be learned and imitated so that the reply returned can be closer to the user's habit.
(2) Personalized recommendation: and generating more accurate personalized recommendation strategies according to preference information in the user portrait.
For example, if the user portraits that someone likes an outdoor activity, then the interactive system will actively recommend an outdoor activity that is appropriate for the weather of the day while providing weather information; if the user prefers to be spicy in his usual diet, then the spicy series will be recommended for the user when providing nearby restaurant information. Meanwhile, the preference of the user and the special requirement of a certain time period can be captured, for example, the user pays attention to a certain hot event recently, and information related to the hot event can be given preferentially.
(3) Personalized interface: a personalized user interface is customized for the user.
For example, the tone, contrast, font size, style, etc. can be automatically adjusted based on the visual preference information of the user representation. For example, if the user prefers a dark-tinted design, the user may be provided with a dark background interface to provide the most personalized visual experience.
Based on the personalized strategies, the user can obtain more specific services, so that the satisfaction degree and the loyalty degree of the user to the services are greatly improved.
The user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.

Claims (9)

1. A user personalized service policy based on a multi-modal large model, comprising the steps of:
step 1: multi-mode data acquisition and model self-adaptive training;
step 2: multi-mode data fusion;
step 3: constructing a user portrait;
step 4: generating a personalized strategy;
step 5: user portrayal optimization.
2. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: the step 1 includes the steps of,
step 11: multimodal data acquisition
Through interaction of a user and a multi-mode large model, multi-mode user data are collected, wherein the user data comprise data of various mode types including characters, voice, images, videos and the like;
step 12: and (5) model self-adaptive training.
3. The user-personalized service policy based on a multimodal big model according to claim 2, wherein said step 12 comprises:
step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
step 123: training model
Inputting the preprocessed and characteristic extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function;
Step 124: model testing
Using part of the data to perform model test, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to step 123; if the performance of the model meets the expected result, then go to the next step.
4. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 2 is to collect and process data of different modes, then to use multi-mode big model to extract features from data of each mode, and to fuse these features to establish relevance among modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Association of other heterogeneous modality information.
5. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 3 includes inputting the result of the multimodal data fusion into a multimodal big model, from which the multimodal big model further generates detailed user portraits including, but not limited to, user preferences, behavioral patterns, emotional states, and language styles, followed by the following operations:
Step 31: feature labeling
Labeling each user characteristic according to the characteristic result obtained by multi-mode data fusion;
step 32: image filling
The user information marked by the multi-mode large model is mapped to the user portrait, so that the user information from different modes can be expressed comprehensively, deeply and humanizedly.
6. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 4 includes inputting the user portrait as a prompt into the multi-modal large model after constructing the user portrait, and providing personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following:
(1) Personalized reply;
(2) Personalized recommendation;
(3) And personalizing the interface.
7. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: the step 5 comprises that the multi-mode large model can automatically learn through feedback of the user to optimize the user portrait, and according to real-time feedback and behaviors of the user, when new user data is input, the multi-mode large model comprehensively analyzes the latest preference, individuality and other characteristics of the user according to the steps, and continuously optimizes the user portrait.
8. A user personalized service system based on a multi-mode large model is characterized in that: the system comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
9. The multi-modal large model based user personalized service system of claim 8 wherein: the multi-modal data acquisition model and the self-adaptive training module collect multi-modal user data through interaction between a user and a multi-modal large model, wherein the user data comprises data of various modal types including characters, voice, images, videos and the like;
preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
inputting the preprocessed and feature extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function, and in training, model parameters can be continuously adjusted through optimization algorithms such as gradient descent, so that the value of the loss function is minimized;
Using part of data to perform model test, evaluating the accuracy of the model, if the model does not meet the expected result, adjusting the model, and returning to perform model training again until the model meets the expected result;
after the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
feature labeling, namely labeling each user feature according to a feature result obtained through multi-mode data fusion;
filling the portraits, namely mapping the user information marked by the multi-mode large model to the user portraits, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed;
The personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.
CN202311116653.1A 2023-09-01 2023-09-01 User personalized service strategy and system based on multi-mode large model Pending CN117235354A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311116653.1A CN117235354A (en) 2023-09-01 2023-09-01 User personalized service strategy and system based on multi-mode large model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311116653.1A CN117235354A (en) 2023-09-01 2023-09-01 User personalized service strategy and system based on multi-mode large model

Publications (1)

Publication Number Publication Date
CN117235354A true CN117235354A (en) 2023-12-15

Family

ID=89085354

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311116653.1A Pending CN117235354A (en) 2023-09-01 2023-09-01 User personalized service strategy and system based on multi-mode large model

Country Status (1)

Country Link
CN (1) CN117235354A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688621A (en) * 2024-02-02 2024-03-12 新疆七色花信息科技有限公司 Traceability adhesive tape, traceability system and traceability method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117688621A (en) * 2024-02-02 2024-03-12 新疆七色花信息科技有限公司 Traceability adhesive tape, traceability system and traceability method

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
US11226673B2 (en) Affective interaction systems, devices, and methods based on affective computing user interface
CN111459290B (en) Interactive intention determining method and device, computer equipment and storage medium
CN109416816B (en) Artificial intelligence system supporting communication
CN105895087B (en) Voice recognition method and device
CN109918650B (en) Interview intelligent robot device capable of automatically generating interview draft and intelligent interview method
US9213558B2 (en) Method and apparatus for tailoring the output of an intelligent automated assistant to a user
US20180352091A1 (en) Recommendations based on feature usage in applications
CN110110169A (en) Man-machine interaction method and human-computer interaction device
US20180129647A1 (en) Systems and methods for dynamically collecting and evaluating potential imprecise characteristics for creating precise characteristics
CN113380271B (en) Emotion recognition method, system, device and medium
Shen et al. Kwickchat: A multi-turn dialogue system for aac using context-aware sentence generation by bag-of-keywords
US10770072B2 (en) Cognitive triggering of human interaction strategies to facilitate collaboration, productivity, and learning
CN117235354A (en) User personalized service strategy and system based on multi-mode large model
Yordanova et al. Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations
Lee et al. A temporal community contexts based funny joke generation
Karpouzis et al. Induction, recording and recognition of natural emotions from facial expressions and speech prosody
Bianchi-Berthouze Kansei-mining: Identifying visual impressions as patterns in images
Hernández et al. User-centric Recommendation Model for AAC based on Multi-criteria Planning
KR20230099936A (en) A dialogue friends porviding system based on ai dialogue model
Wattearachchi et al. Emotional Keyboard: To Provide Adaptive Functionalities Based on the Current User Emotion and the Context.
Xueming et al. Application of Emotional Voice user Interface in Securities Industry
CN117786079A (en) Dialogue method and system based on self-adaptive learning and context awareness AI proxy
Stappen Multimodal sentiment analysis in real-life videos

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination