CN117235354A - User personalized service strategy and system based on multi-mode large model - Google Patents
User personalized service strategy and system based on multi-mode large model Download PDFInfo
- Publication number
- CN117235354A CN117235354A CN202311116653.1A CN202311116653A CN117235354A CN 117235354 A CN117235354 A CN 117235354A CN 202311116653 A CN202311116653 A CN 202311116653A CN 117235354 A CN117235354 A CN 117235354A
- Authority
- CN
- China
- Prior art keywords
- user
- model
- data
- mode
- personalized
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012549 training Methods 0.000 claims abstract description 47
- 238000000034 method Methods 0.000 claims abstract description 31
- 230000003993 interaction Effects 0.000 claims abstract description 26
- 230000004927 fusion Effects 0.000 claims abstract description 23
- 238000005457 optimization Methods 0.000 claims abstract description 16
- 230000008569 process Effects 0.000 claims abstract description 16
- 238000010276 construction Methods 0.000 claims abstract description 11
- 230000006399 behavior Effects 0.000 claims description 54
- 230000008451 emotion Effects 0.000 claims description 41
- 230000000007 visual effect Effects 0.000 claims description 31
- 230000002996 emotional effect Effects 0.000 claims description 25
- 230000014509 gene expression Effects 0.000 claims description 23
- 238000002372 labelling Methods 0.000 claims description 13
- 230000006870 function Effects 0.000 claims description 10
- 238000007781 pre-processing Methods 0.000 claims description 10
- 238000012360 testing method Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 230000003542 behavioural effect Effects 0.000 claims description 6
- 238000003062 neural network model Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 5
- 238000013507 mapping Methods 0.000 claims description 2
- 230000010365 information processing Effects 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 abstract description 2
- 238000004891 communication Methods 0.000 description 18
- 230000000694 effects Effects 0.000 description 9
- 206010063659 Aversion Diseases 0.000 description 6
- 230000007935 neutral effect Effects 0.000 description 6
- 230000002452 interceptive effect Effects 0.000 description 5
- 230000009471 action Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 238000010411 cooking Methods 0.000 description 4
- 238000013461 design Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 206010028813 Nausea Diseases 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000012098 association analyses Methods 0.000 description 2
- 238000004140 cleaning Methods 0.000 description 2
- 230000008094 contradictory effect Effects 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 230000037213 diet Effects 0.000 description 2
- 235000005911 diet Nutrition 0.000 description 2
- 230000001747 exhibiting effect Effects 0.000 description 2
- 230000001815 facial effect Effects 0.000 description 2
- 230000008921 facial expression Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012423 maintenance Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 230000003340 mental effect Effects 0.000 description 2
- 230000008693 nausea Effects 0.000 description 2
- 230000008092 positive effect Effects 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 239000000047 product Substances 0.000 description 2
- 230000033764 rhythmic process Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 238000003860 storage Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000004880 explosion Methods 0.000 description 1
- 238000007499 fusion processing Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 230000004083 survival effect Effects 0.000 description 1
Abstract
The invention belongs to the technical field of multi-mode information processing, and particularly relates to a user personalized service strategy and system based on a multi-mode large model. The method comprises the following steps: step 1: multi-mode data acquisition and model self-adaptive training; step 2: multi-mode data fusion; step 3: constructing a user portrait; step 4: generating a personalized strategy; step 5: user portrayal optimization. The invention has the beneficial effects that: compared with the traditional personalized strategy, the method and the device extract the multi-modal information of the user by using the multi-modal large model, so that the omnibearing three-dimensional user portrait is generated and applied to a subsequent interaction system, and the accuracy of the user portrait is improved: compared with the prior user portrait construction mode with a single mode (such as relying on text or voice only), the method introduces a multi-mode large model, and can fuse and process user data from multiple modes such as text, voice, images, video and the like.
Description
Technical Field
The invention belongs to the technical field of multi-mode information processing, and particularly relates to a user personalized service strategy and system based on a multi-mode large model.
Background
With the explosion and increasing penetration of technology in various fields, users' demands for experience that is compliant with ideas and personalized are also continuing, and even accelerating growth. In this context, various industries pay more and more attention to the user-refined service strategies, recognizing that this is a core element of enterprise survival development, associative dominance and even leadership. Therefore, establishing a user portrait becomes an essential important link. User portraits are core tools for enterprises to deeply understand user groups and to design products and provide services according to personalized requirements, and play a vital role in attracting new potential customers and retaining existing faithful users.
Conventional methods for constructing user portraits can be largely divided into two types. One is to rely on manual input and feedback from the user, such as user configuration, questionnaires, information filled in when the user registers, etc., to obtain information on the user's needs and interests, etc. Another approach is to model the behavior of the user using deep learning or complex machine learning models, and model mining the behavior habits of the user.
The existing user portrait construction method has the following defects:
(1) The user portrait precision is low
The traditional methods such as user configuration, questionnaire survey or information filled in during user registration often simplify the complex demands and personalized performances of users into a number of values or category characteristics, lack deep and comprehensive understanding of users, neglect the influence of emotion, preference, living habit, social environment and the like of the users, neglect the behavior time sequence of the users, for example, the interests of the users change along with the time. And the user is not necessarily true and objective in terms of privacy protection and the like, and the user portrait generated by the information is not accurate enough and may even be wrong.
(2) High manpower cost input
Whether information is collected by questionnaires or the like, or trained using deep learning or complex machine learning models, a large amount of human resources are required. The former needs professional data analysts to formulate a targeted questionnaire to analyze and refine information. The latter requires a great deal of time investment by the developer, including data labeling, model design, debugging optimization, and the like. Therefore, the current user portrait construction method is relatively large in labor cost.
(3) Interaction system information storage based on large model is not durable: the existing multi-mode large model interaction system resets after each round of dialogue, which means that the model cannot effectively memorize previous user information, such as the preference, habit and the like displayed by the user in the previous interaction process. This limits the understanding and adaptation of the model to the user's personalized needs and also reduces the accuracy and degree of personalization of the service.
Disclosure of Invention
The invention aims to provide a user personalized service strategy based on a multi-mode large model, so as to solve the problems of single mode information extraction, large labor cost investment and non-persistent information storage of an interactive system based on the large model in the prior art.
The technical scheme of the invention is as follows: a user-personalized service policy based on a multimodal big model, comprising the steps of:
step 1: multi-mode data acquisition and model self-adaptive training;
step 2: multi-mode data fusion;
step 3: constructing a user portrait;
step 4: generating a personalized strategy;
step 5: user portrayal optimization.
The step 1 includes the steps of,
step 11: multimodal data acquisition
Through interaction of a user and a multi-mode large model, multi-mode user data are collected, wherein the user data comprise data of various mode types including characters, voice, images, videos and the like;
Step 12: and (5) model self-adaptive training.
The step 12 includes:
step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
step 123: training model
Inputting the preprocessed and characteristic extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function;
step 124: model testing
Using part of the data to perform model test, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to step 123; if the performance of the model meets the expected result, then go to the next step.
Step 2 is to collect and process data of different modes, then to use multi-mode big model to extract features from data of each mode, and to fuse these features to establish relevance among modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Association of other heterogeneous modality information.
Step 3 includes inputting the result of the multimodal data fusion into a multimodal big model, from which the multimodal big model further generates detailed user portraits including, but not limited to, user preferences, behavioral patterns, emotional states, and language styles, followed by the following operations:
step 31: feature labeling
Labeling each user characteristic according to the characteristic result obtained by multi-mode data fusion;
step 32: image filling
The user information marked by the multi-mode large model is mapped to the user portrait, so that the user information from different modes can be expressed comprehensively, deeply and humanizedly.
Step 4 includes inputting the user portrait as a prompt into the multi-modal large model after constructing the user portrait, and providing personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following:
(1) Personalized reply;
(2) Personalized recommendation;
(3) And personalizing the interface.
The step 5 comprises that the multi-mode large model can automatically learn through feedback of the user to optimize the user portrait, and according to real-time feedback and behaviors of the user, when new user data is input, the multi-mode large model comprehensively analyzes the latest preference, individuality and other characteristics of the user according to the steps, and continuously optimizes the user portrait.
A user personalized service system based on a multi-mode large model comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
The multi-modal data acquisition model and the self-adaptive training module collect multi-modal user data through interaction between a user and a multi-modal large model, wherein the user data comprises data of various modal types including characters, voice, images, videos and the like;
preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
Inputting the preprocessed and feature extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function, and in training, model parameters can be continuously adjusted through optimization algorithms such as gradient descent, so that the value of the loss function is minimized;
using part of data to perform model test, evaluating the accuracy of the model, if the model does not meet the expected result, adjusting the model, and returning to perform model training again until the model meets the expected result;
after the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
Feature labeling, namely labeling each user feature according to a feature result obtained through multi-mode data fusion;
filling the portraits, namely mapping the user information marked by the multi-mode large model to the user portraits, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed;
the personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.
The invention has the beneficial effects that: compared with the traditional personalized strategy, the method and the system for extracting the user multi-modal information by using the multi-modal large model, disclosed by the invention, have the advantages and positive effects that the user multi-modal information is extracted, so that the omnibearing three-dimensional user portrait is generated and is applied to a subsequent interaction system, and the method and the system have the following advantages and positive effects:
(1) Accuracy of user image: compared with the prior user portrait construction mode with a single mode (such as relying on text or voice only), the method introduces a multi-mode large model, and can fuse and process user data from multiple modes such as text, voice, images, video and the like. By using the rich and comprehensive data, the multi-mode large model can generate more accurate and microscopic user figures, and realize the omnibearing and deep understanding of each user requirement. The user portrait precision is greatly improved, the application value is greatly widened, and the high-precision user portraits can be used for more accurate recommendation, discrimination and prediction, so that the user experience is effectively improved, and the user satisfaction is improved;
(2) Continuity of interactive system based on multi-modal large model: compared with the problem that the existing multi-mode large model interaction system is reset in each round of dialogue, the multi-mode large model interaction system in the method does not reset after each interaction with a user, but generates an accurate user portrait by accumulating and learning user information, so that the information of each user is stored more permanently, and a durable personalized service is provided;
(3) Convenience of system maintenance: compared with the prior mode of introducing a reward mechanism to carry out model adjustment or retraining, the multi-mode large model can carry out self optimization according to new user data by virtue of self learning and adjustment capability. Meanwhile, the multi-mode large model can input and manage the user images through visual and simple prompt words. For example, if the portrait of a certain user needs to be updated or corrected, the portrait can be immediately reflected in the service strategy of the model only by modifying the corresponding prompt word, so that the maintenance efficiency and convenience of the system are greatly improved;
(4) Uniqueness of user experience: based on a more accurate representation of the user, a highly customized service experience may be provided to the user. Whether personalized reply is performed in the interactive dialogue, personalized content recommendation is provided according to user preference and real-time requirement, or a personalized interface is automatically adjusted according to user visual preference, and the like, a personalized strategy aiming at the user can be generated, so that the user really enjoys thousands of people and unique service experience.
Drawings
Fig. 1 is a flowchart of a user personalized service policy based on a multi-mode big model provided by the invention.
Detailed Description
The invention will be described in further detail with reference to the accompanying drawings and specific examples.
The multi-mode large model (such as GPT-4) has excellent performance in multiple tasks such as language, image processing and the like, and provides a new idea for personalized strategies. The multi-mode large model has powerful understanding and computing power far higher than that of the previous model, can efficiently analyze and process a large number of data sets, and realizes autonomous learning and adjustment of the model. Moreover, the multi-modal large model can analyze inputs in various forms such as language, images, sound, video and the like, even if the environment is complex or the information is fuzzy, the multi-modal large model can quickly find important features and effectively apply the important features, so that basic preference and behavior of a user are identified, fine habits and subtle changes hidden behind data are further mined, and the multi-modal large model has the capability of understanding and learning no matter how complex the preference and behavior mode of the user is, and provides highly personalized service for each user. The multi-mode large model has outstanding convenience, supports operation in the form of prompt words, greatly reduces the use threshold, is more humanized, enables operators to efficiently use the model without elaborate technical knowledge, and improves the implementation effect of personalized service.
The invention provides a user personalized service strategy based on a multi-mode large model, which comprises the steps of firstly, utilizing the information processing capability of the multi-mode large model to obtain corresponding user personalized information such as age, sex, region occupation, interest and hobbies by analyzing the behaviors and hobbies of a user in various modes such as voice, text, image, video and the like. Secondly, based on the information, a more accurate and microscopic user portrait is generated by using a multi-mode large model. And finally, applying the generated user portrait to a multi-mode large model or other product services to further form an accurate user personalized service strategy and improve user experience and satisfaction.
As shown in fig. 1, a user personalized service policy based on a multi-mode big model includes the following steps:
step 1: multi-modal data acquisition and model adaptive training
Step 11: multimodal data acquisition
And collecting multi-mode user data through interaction of the user and the multi-mode large model on the premise of respecting and protecting the privacy of the user. Specifically, the collected user data includes data of a plurality of modality types including text, voice, image, video, and the like.
Text modality: mainly refers to text content entered by a user, including but not limited to search records, text chat records, and the like. The multi-mode large model analyzes the multi-mode large model through a plurality of dimensions such as vocabulary understanding, sentence structure, context topics and the like, and can reveal language modes, preference demands and emotion tendencies of users.
Voice modality: the voice input of the user is mainly collected, for example, an audio signal sent out by a microphone, and the voice characteristics, the emotion states, the regional dialect and other information of the user are extracted through voice recognition and voiceprint recognition.
Image modality: the method mainly refers to static picture information provided by a user, such as personal photos uploaded by the user, expression packages in chat records and the like, and features of images are extracted through an image processing method to supplement the user information.
Video modality: the method mainly comprises video chat, user pictures and other video information acquired through the camera. Dynamic information in the video, such as facial expressions, limb actions, etc., can be analyzed by facial recognition for gender age, behavioral habits, emotional expressions, etc. of the user.
Step 12: model adaptive training
Step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleaning, data normalization, and the like. Invalid, redundant and erroneous data is removed from the data set, guaranteeing the quality and integrity of the data.
Step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases of the text data, extracting high-dimensional features of emotion analysis and the like; and extracting multi-level and multi-scale characteristic representation from the voice and visual data by using a deep neural network model. The extracted features are stored as input to model training.
Step 123: training model
The data after pretreatment and feature extraction is input into a multi-mode large model, and the algorithms such as supervised learning and the like are used for training. The algorithm will train on the training data and measure the difference between the predicted and actual values of the model by the loss function. In training, model parameters are continuously adjusted through optimization algorithms such as gradient descent and the like so as to minimize the value of the loss function, and therefore the optimal training effect is obtained.
Step 124: model testing
And (3) performing model test by using part of data, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model and returning to the step of training the model. If the performance of the model meets the expected result, then go to the next step.
Step 2: multimodal data fusion
Features are extracted from data of all modes by using a trained multi-mode large model, and are fused to establish relevance among all modes, including but not limited to the following relevance:
(1) Association of language patterns with emotion expressions: language patterns are habits and ways in which the user uses the language in communication, including selected words, etc., and emotional expressions can be classified into sad, anger, happy, surprise, fear, aversion, and neutral 7 states. The language mode and the emotion expression tend to show the same trend, and the specific vocabulary or expression mode frequently used by the user in the dialogue can reflect the specific emotional state or psychological tendency, and the specific association is as follows:
Sadness: the user may use some vocabulary in the communication that depicts sadness or loss, such as "hard to see", "hurt", which means that the user is experiencing sad or frustrated emotion, or that the user shows a feeling of being distracted by the appearance of something.
Anger: the user uses words representing anger, annoyance, such as "angry", "dying", in the communication, and then indicates that the user is experiencing anger emotion, or that the emotional tendency of anger and angry is exhibited for the occurrence of something.
Happy: users use words in communication that express pleasure, satisfaction, such as "happy", to indicate that the user is experiencing happy and happy emotions, or to show emotional tendency to anger and gas to the appearance of something.
Surprisingly: the user uses words representing surprise or surprise in the communication, such as "java", "true moral", to indicate that the user is experiencing surprisal emotion, or that the user is exhibiting surprisal emotion tendencies for something to appear.
Fear of: the user uses some words expressing fear and tension in communication, such as fear and fear, and the emotion tendency that the user is experiencing fear or showing fear to the appearance of something is indicated.
Aversion to: users use words in communication that represent aversion, dislikes, such as "nausea", "dislikes", that represent the user is experiencing a disliked emotion, or that represent a disliked emotional tendency for the occurrence of something.
And (3) neutral: the user uses some neutral vocabulary in the communication, e.g. "good", "understanding", to indicate that the user's current emotional state is smooth or that there is no significant emotional fluctuation in the appearance of something.
The language understanding capability is realized through the multi-mode large model, data of different mode types are input into the multi-mode large model, the corresponding relation between the vocabulary and emotion is automatically captured through the multi-mode large model, and related relations are established, so that emotion expression habits and object preferences of users can be more accurately positioned.
(2) Association of language patterns with behavior patterns: the behavior patterns of the user are often reflected in the language expression, and the behavior patterns of the user such as thinking habits, decision trends and the like can be deduced through analysis of the characteristics of the user such as voice rate, intonation, pause and the like. The specific association is as follows:
speech rate: generally refers to the speed at which the user speaks, i.e., the number of words spoken per minute, which can intuitively exhibit the user's mental activity and decision speed. If a user speaks particularly quickly, this means that he is a thinking-agile, fast-acting person, accustomed to a high-stress, fast-paced environment. Conversely, if a user speaks slowly, this means that he is a cool, deliberate person.
Intonation: intonation refers to the change in pitch of a user when speaking, including the level of pitch, the intensity of speech, etc., and is often directly related to the emotional and psychological states of the user. For example, a user with higher intonation, excited mouth may be experiencing anger or excited emotion, while a user with lower intonation, voice calm may be in a calm, happy or passive emotional state.
Pause: the length, frequency and the like of the pause are important reflections of the thinking mode of the user, and the thinking depth and rhythm sense of the user can be reflected. More frequent pauses indicate that the user is spending a great deal of time thinking and judging, and such users may be more focused on the depth and comprehensiveness of things, more prone to a steady pattern of behavior; while users with lower discontinuity, i.e., almost no pauses in the speaking process, may show that their thinking is faster, the decision process is simple and efficient, and the behavior patterns of such users are more prone to aggressive.
The multi-mode large model obtains the voice characteristics of the user through processing the input voice data of the user to understand the behavior mode of the user, and by means of the association, the thinking habit and the decision tendency of the user can be better known, so that services and suggestions which meet the requirements of the user are provided.
(3) Association of behavior patterns with visual information: when a user interacts, whether reading a web page, watching a video, or clicking on a link, various behavioral patterns are generated, which are closely related to the visual appearance of the user, as follows:
browsing behavior of the user: the user may show a strong interest or preference in a particular thing or topic while watching a picture or video, which may be reflected in the behavior of a long-time gaze or multiple clicks of the picture by the user. For example, a user frequently clicks on a detailed description of food materials or an enlarged view of a production step while watching a video on cooking, which indicates that the user has a high interest in cooking.
Visual element selected by user: the user can make a large number of choices in the interaction, the visual preference can be reflected in the clicking action of the user, and the user can have more browsing time and deeper interaction for the content conforming to the visual preference. The multimodal big model can treat these choices as behavior patterns of the user and analyze on visual elements (e.g. merchandise pictures, activity posters, interface styles, etc.). For example, pictures that the user looks at or clicks on for a long time all have some similar color and style, and it is understood that the user may have a preference for such color and style.
Through the association analysis of the behavior patterns and the visual information, the interest or preference of the user to a certain type of content can be acquired, so that the content and the interaction interface which are closer to the interest of the user can be provided when the user interacts.
(4) Association of other heterogeneous modality information: in some cases, the same user may exhibit different behavior patterns or preferences in different modal information, including text, voice, images, and video. For example, users often discuss technical news in text communication, but in video games, sports games are preferred, and the two seemingly contradictory information can reflect multiple interests of the users. The correlation between the heterogeneous modal information can mine the complicated multidimensional preference of the user, and is very important for constructing comprehensive and deep user portraits.
Through the processing of multi-mode data fusion, the data of each mode not only covers the basic information of the user, but also can fully understand the behavior mode, the emotion state, the preference habit and the like of the user, and the data of each mode are not isolated but are mutually complemented, so that a comprehensive and deep user portrait is formed. The accuracy and the practicability of the scheme are greatly improved, and a more accurate and effective basis is provided for the follow-up personalized service strategy.
Step 3: user portrayal construction
The multi-modal data fusion result, namely the user basic information extracted from the multi-modal and four associated results, are processed and output into text content by means of understanding capability and language capability of the multi-modal large model, and then are input into the multi-modal large model, and the multi-modal large model further generates detailed user portraits according to the text content, wherein the portraits comprise, but are not limited to, preference, behavior mode, emotional state, language style and the like of the user. The following operations are then performed:
step 31: feature labeling
And labeling each user characteristic, such as age, gender, behavior habit, consumption habit and the like, according to the characteristic result obtained by multi-mode data fusion.
Step 32: image filling
After the feature labeling in step 31 is performed, the labeled user information is mapped to the user portrait, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed. For example, the language pattern of the user in the dialog is converted into an expression of his favorites, or the audio features of the user are converted into their possible emotional states, etc.
Step 4: personalized policy generation
After the user representation is built, the user representation is input into the multi-modal large model as a prompt, and through self-learning of the multi-modal large model, the multi-modal large model provides personalized services customized based on the user representation after interaction, including but not limited to the following:
(1) Personalized reply: in the process of talking with the user, the reply content conforming to each user style is automatically generated according to the language mode and preference in the user portrait.
For example, if a user uses a large number of terms in a conversation or likes to reply with a phrase, the language style will be learned and imitated so that the reply returned can be closer to the user's habit.
(2) Personalized recommendation: and generating more accurate personalized recommendation strategies according to preference information in the user portrait.
For example, if the user portraits that someone likes an outdoor activity, then the interactive system will actively recommend an outdoor activity that is appropriate for the weather of the day while providing weather information; if the user prefers to be spicy in his usual diet, then the spicy series will be recommended for the user when providing nearby restaurant information. Meanwhile, the preference of the user and the special requirement of a certain time period can be captured, for example, the user pays attention to a certain hot event recently, and information related to the hot event can be given preferentially.
(3) Personalized interface: a personalized user interface is customized for the user.
For example, the tone, contrast, font size, style, etc. can be automatically adjusted based on the visual preference information of the user representation. For example, if the user prefers a dark-tinted design, the user may be provided with a dark background interface to provide the most personalized visual experience.
Based on the personalized strategies, the user can obtain more specific services, so that the satisfaction degree and the loyalty degree of the user to the services are greatly improved.
Step 5: user portrayal optimization
The multi-mode large model automatically learns through feedback of the user, and continuously optimizes the user portrait. According to the real-time feedback and behavior of the user, when new user data is input, the multi-mode large model can comprehensively analyze the latest preference, individuality and other characteristics of the user according to the steps, namely, the step 1, the step 2, the step 3 and the step 4, and continuously optimize the user portrait, so that more accurate individualization service is provided.
A user personalized service system based on a multi-mode large model comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
The multi-modal data acquisition and model self-adaptive training module collects multi-modal user data through interaction between a user and a multi-modal large model. Specifically, the collected user data includes data of a plurality of modality types including text, voice, image, video, and the like. The method comprises the following steps:
Text modality: mainly refers to text content entered by a user, including but not limited to search records, text chat records, and the like. The multi-mode large model analyzes the multi-mode large model through a plurality of dimensions such as vocabulary understanding, sentence structure, context topics and the like, and can reveal language modes, preference demands and emotion tendencies of users.
Voice modality: the voice input of the user is mainly collected, for example, an audio signal sent out by a microphone, and the voice characteristics, the emotion states, the regional dialect and other information of the user are extracted through voice recognition and voiceprint recognition.
Image modality: the method mainly refers to static picture information provided by a user, such as personal photos uploaded by the user, expression packages in chat records and the like, and features of images are extracted through an image processing method to supplement the user information.
Video modality: the method mainly comprises video chat, user pictures and other video information acquired through the camera. Dynamic information in the video, such as facial expressions, limb actions, etc., can be analyzed by facial recognition for gender age, behavioral habits, emotional expressions, etc. of the user.
The model self-adaptive training of the multi-mode data acquisition and model self-adaptive training module comprises the following steps:
And preprocessing data, namely preprocessing the collected multi-mode data, including data cleaning, data standardization and the like. Invalid, redundant and erroneous data is removed from the data set, guaranteeing the quality and integrity of the data.
Feature extraction, namely performing feature extraction on the preprocessed data, and extracting high-dimensional features such as keyword or phrase extraction, emotion analysis and the like on the text data; and extracting multi-level and multi-scale characteristic representation from the voice and visual data by using a deep neural network model. The extracted features are stored as input to model training.
And training a model, namely inputting the data subjected to pretreatment and feature extraction into a multi-mode large model, and training by using algorithms such as supervised learning. The algorithm will train on the training data and measure the difference between the predicted and actual values of the model by the loss function. In training, model parameters are continuously adjusted through optimization algorithms such as gradient descent and the like so as to minimize the value of the loss function, and therefore the optimal training effect is obtained.
And (3) model testing, namely performing model testing by using part of data, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to the process of training the model until the model meets the expected result.
After the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the method comprises the following steps:
(1) Association of language patterns with emotion expressions: language patterns are habits and ways in which the user uses the language in communication, including selected words, etc., and emotional expressions can be classified into sad, anger, happy, surprise, fear, aversion, and neutral 7 states. The language mode and the emotion expression tend to show the same trend, and the specific vocabulary or expression mode frequently used by the user in the dialogue can reflect the specific emotional state or psychological tendency, and the specific association is as follows:
sadness: the user may use some vocabulary in the communication that depicts sadness or loss, such as "hard to see", "hurt", which means that the user is experiencing sad or frustrated emotion, or that the user shows a feeling of being distracted by the appearance of something.
Anger: the user uses words representing anger, annoyance, such as "angry", "dying", in the communication, and then indicates that the user is experiencing anger emotion, or that the emotional tendency of anger and angry is exhibited for the occurrence of something.
Happy: users use words in communication that express pleasure, satisfaction, such as "happy", to indicate that the user is experiencing happy and happy emotions, or to show emotional tendency to anger and gas to the appearance of something.
Surprisingly: the user uses words representing surprise or surprise in the communication, such as "java", "true moral", to indicate that the user is experiencing surprisal emotion, or that the user is exhibiting surprisal emotion tendencies for something to appear.
Fear of: the user uses some words expressing fear and tension in communication, such as fear and fear, and the emotion tendency that the user is experiencing fear or showing fear to the appearance of something is indicated.
Aversion to: users use words in communication that represent aversion, dislikes, such as "nausea", "dislikes", that represent the user is experiencing a disliked emotion, or that represent a disliked emotional tendency for the occurrence of something.
And (3) neutral: the user uses some neutral vocabulary in the communication, e.g. "good", "understanding", to indicate that the user's current emotional state is smooth or that there is no significant emotional fluctuation in the appearance of something.
The language understanding capability is realized through the multi-mode large model, data of different mode types are input into the multi-mode large model, the corresponding relation between the vocabulary and emotion is automatically captured through the multi-mode large model, and related relations are established, so that emotion expression habits and object preferences of users can be more accurately positioned.
(2) Association of language patterns with behavior patterns: the behavior patterns of the user are often reflected in the language expression, and the behavior patterns of the user such as thinking habits, decision trends and the like can be deduced through analysis of the characteristics of the user such as voice rate, intonation, pause and the like. The specific association is as follows:
speech rate: generally refers to the speed at which the user speaks, i.e., the number of words spoken per minute, which can intuitively exhibit the user's mental activity and decision speed. If a user speaks particularly quickly, this means that he is a thinking-agile, fast-acting person, accustomed to a high-stress, fast-paced environment. Conversely, if a user speaks slowly, this means that he is a cool, deliberate person.
Intonation: intonation refers to the change in pitch of a user when speaking, including the level of pitch, the intensity of speech, etc., and is often directly related to the emotional and psychological states of the user. For example, a user with higher intonation, excited mouth may be experiencing anger or excited emotion, while a user with lower intonation, voice calm may be in a calm, happy or passive emotional state.
Pause: the length, frequency and the like of the pause are important reflections of the thinking mode of the user, and the thinking depth and rhythm sense of the user can be reflected. More frequent pauses indicate that the user is spending a great deal of time thinking and judging, and such users may be more focused on the depth and comprehensiveness of things, more prone to a steady pattern of behavior; while users with lower discontinuity, i.e., almost no pauses in the speaking process, may show that their thinking is faster, the decision process is simple and efficient, and the behavior patterns of such users are more prone to aggressive.
The multi-mode large model obtains the voice characteristics of the user through processing the input voice data of the user to understand the behavior mode of the user, and by means of the association, the thinking habit and the decision tendency of the user can be better known, so that services and suggestions which meet the requirements of the user are provided.
(3) Association of behavior patterns with visual information: when a user interacts, whether reading a web page, watching a video, or clicking on a link, various behavioral patterns are generated, which are closely related to the visual appearance of the user, as follows:
browsing behavior of the user: the user may show a strong interest or preference in a particular thing or topic while watching a picture or video, which may be reflected in the behavior of a long-time gaze or multiple clicks of the picture by the user. For example, a user frequently clicks on a detailed description of food materials or an enlarged view of a production step while watching a video on cooking, which indicates that the user has a high interest in cooking.
Visual element selected by user: the user can make a large number of choices in the interaction, the visual preference can be reflected in the clicking action of the user, and the user can have more browsing time and deeper interaction for the content conforming to the visual preference. The multimodal big model can treat these choices as behavior patterns of the user and analyze on visual elements (e.g. merchandise pictures, activity posters, interface styles, etc.). For example, pictures that the user looks at or clicks on for a long time all have some similar color and style, and it is understood that the user may have a preference for such color and style.
Through the association analysis of the behavior patterns and the visual information, the interest or preference of the user to a certain type of content can be acquired, so that the content and the interaction interface which are closer to the interest of the user can be provided when the user interacts.
(4) Association of other heterogeneous modality information: in some cases, the same user may exhibit different behavior patterns or preferences in different modal information, including text, voice, images, and video. For example, users often discuss technical news in text communication, but in video games, sports games are preferred, and the two seemingly contradictory information can reflect multiple interests of the users. The correlation between the heterogeneous modal information can mine the complicated multidimensional preference of the user, and is very important for constructing comprehensive and deep user portraits.
The four correlations are obtained after the user data are autonomously learned based on the large model, and through the multi-mode data fusion processing, the data of each mode not only cover the basic information of the user, but also can fully understand the behavior mode, the emotion state, the preference habit and the like of the user, and the data of each mode are not isolated but are mutually complemented, so that a comprehensive and deep user portrait is formed. The accuracy and the practicability of the scheme are greatly improved, and a more accurate and effective basis is provided for the follow-up personalized service strategy.
The user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
and marking the characteristics of each user, such as age, gender, behavior habit, consumption habit and the like, according to the characteristic results obtained by the multi-mode data fusion.
After feature labeling is performed, the labeled user information is mapped to the user portrait, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed. For example, the language pattern of the user in the dialog is converted into an expression of his favorites, or the audio features of the user are converted into their possible emotional states, etc.
The personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the method comprises the following steps:
(1) Personalized reply: in the process of talking with the user, the reply content conforming to each user style is automatically generated according to the language mode and preference in the user portrait.
For example, if a user uses a large number of terms in a conversation or likes to reply with a phrase, the language style will be learned and imitated so that the reply returned can be closer to the user's habit.
(2) Personalized recommendation: and generating more accurate personalized recommendation strategies according to preference information in the user portrait.
For example, if the user portraits that someone likes an outdoor activity, then the interactive system will actively recommend an outdoor activity that is appropriate for the weather of the day while providing weather information; if the user prefers to be spicy in his usual diet, then the spicy series will be recommended for the user when providing nearby restaurant information. Meanwhile, the preference of the user and the special requirement of a certain time period can be captured, for example, the user pays attention to a certain hot event recently, and information related to the hot event can be given preferentially.
(3) Personalized interface: a personalized user interface is customized for the user.
For example, the tone, contrast, font size, style, etc. can be automatically adjusted based on the visual preference information of the user representation. For example, if the user prefers a dark-tinted design, the user may be provided with a dark background interface to provide the most personalized visual experience.
Based on the personalized strategies, the user can obtain more specific services, so that the satisfaction degree and the loyalty degree of the user to the services are greatly improved.
The user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.
Claims (9)
1. A user personalized service policy based on a multi-modal large model, comprising the steps of:
step 1: multi-mode data acquisition and model self-adaptive training;
step 2: multi-mode data fusion;
step 3: constructing a user portrait;
step 4: generating a personalized strategy;
step 5: user portrayal optimization.
2. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: the step 1 includes the steps of,
step 11: multimodal data acquisition
Through interaction of a user and a multi-mode large model, multi-mode user data are collected, wherein the user data comprise data of various mode types including characters, voice, images, videos and the like;
step 12: and (5) model self-adaptive training.
3. The user-personalized service policy based on a multimodal big model according to claim 2, wherein said step 12 comprises:
step 121: data preprocessing
Preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
step 122: feature extraction
Extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
step 123: training model
Inputting the preprocessed and characteristic extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function;
Step 124: model testing
Using part of the data to perform model test, evaluating the accuracy of the model, and if the model does not meet the expected result, adjusting the model, and returning to step 123; if the performance of the model meets the expected result, then go to the next step.
4. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 2 is to collect and process data of different modes, then to use multi-mode big model to extract features from data of each mode, and to fuse these features to establish relevance among modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Association of other heterogeneous modality information.
5. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 3 includes inputting the result of the multimodal data fusion into a multimodal big model, from which the multimodal big model further generates detailed user portraits including, but not limited to, user preferences, behavioral patterns, emotional states, and language styles, followed by the following operations:
Step 31: feature labeling
Labeling each user characteristic according to the characteristic result obtained by multi-mode data fusion;
step 32: image filling
The user information marked by the multi-mode large model is mapped to the user portrait, so that the user information from different modes can be expressed comprehensively, deeply and humanizedly.
6. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: step 4 includes inputting the user portrait as a prompt into the multi-modal large model after constructing the user portrait, and providing personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following:
(1) Personalized reply;
(2) Personalized recommendation;
(3) And personalizing the interface.
7. A multi-modal large model based user personalized service policy as claimed in claim 1 wherein: the step 5 comprises that the multi-mode large model can automatically learn through feedback of the user to optimize the user portrait, and according to real-time feedback and behaviors of the user, when new user data is input, the multi-mode large model comprehensively analyzes the latest preference, individuality and other characteristics of the user according to the steps, and continuously optimizes the user portrait.
8. A user personalized service system based on a multi-mode large model is characterized in that: the system comprises a multi-mode data acquisition and model self-adaptive training module, a multi-mode data fusion module, a user portrait construction module, a personalized strategy generation module and a user portrait optimization module.
9. The multi-modal large model based user personalized service system of claim 8 wherein: the multi-modal data acquisition model and the self-adaptive training module collect multi-modal user data through interaction between a user and a multi-modal large model, wherein the user data comprises data of various modal types including characters, voice, images, videos and the like;
preprocessing the collected multi-modal data, including data cleansing, data normalization, and removing invalid, redundant and erroneous data from the dataset;
extracting features of the preprocessed data, extracting keywords or phrases from the text data, extracting multi-level and multi-scale feature representations from the voice and visual data by using a deep neural network model, and storing the extracted features as input of model training;
inputting the preprocessed and feature extracted data into a multi-mode large model, training by using algorithms such as supervised learning, wherein the algorithms can train according to training data, and the difference between a predicted value and a true value of the model is measured through a loss function, and in training, model parameters can be continuously adjusted through optimization algorithms such as gradient descent, so that the value of the loss function is minimized;
Using part of data to perform model test, evaluating the accuracy of the model, if the model does not meet the expected result, adjusting the model, and returning to perform model training again until the model meets the expected result;
after the multi-mode data fusion module collects and processes the data of different mode types, the multi-mode large model is used for extracting features from the data of each mode, and the features are fused to establish the relevance among the modes, including but not limited to the following relevance:
(1) The association of language patterns with emotion expressions;
(2) The association of language patterns with behavior patterns;
(3) Correlation of behavior patterns with visual information;
(4) Correlation of other heterogeneous modality information;
the user portrait construction module inputs the multi-mode data fusion result into a multi-mode big model, and the multi-mode big model further generates detailed user portraits according to the multi-mode big model, wherein the portraits comprise but are not limited to user preference, behavior mode, emotion state and language style, and then the following operations are carried out:
feature labeling, namely labeling each user feature according to a feature result obtained through multi-mode data fusion;
filling the portraits, namely mapping the user information marked by the multi-mode large model to the user portraits, so that the user information from different modes can be comprehensively, deeply and humanizedly expressed;
The personalized policy generation module inputs the user portrait into the multi-modal large model as a prompt after constructing the user portrait, and provides personalized services customized based on the user portrait after interaction through self-learning of the multi-modal large model, including but not limited to the following contents:
(1) Personalized reply;
(2) Personalized recommendation;
(3) A personalized interface;
the user portrait optimization module utilizes the multi-mode large model to automatically learn through feedback of the user to optimize the user portrait, and comprehensively analyzes the latest preference, individuality and other characteristics of the user according to real-time feedback and behaviors of the user when new user data is input, so as to continuously optimize the user portrait.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311116653.1A CN117235354A (en) | 2023-09-01 | 2023-09-01 | User personalized service strategy and system based on multi-mode large model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311116653.1A CN117235354A (en) | 2023-09-01 | 2023-09-01 | User personalized service strategy and system based on multi-mode large model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117235354A true CN117235354A (en) | 2023-12-15 |
Family
ID=89085354
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311116653.1A Pending CN117235354A (en) | 2023-09-01 | 2023-09-01 | User personalized service strategy and system based on multi-mode large model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117235354A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688621A (en) * | 2024-02-02 | 2024-03-12 | 新疆七色花信息科技有限公司 | Traceability adhesive tape, traceability system and traceability method |
-
2023
- 2023-09-01 CN CN202311116653.1A patent/CN117235354A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117688621A (en) * | 2024-02-02 | 2024-03-12 | 新疆七色花信息科技有限公司 | Traceability adhesive tape, traceability system and traceability method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10977452B2 (en) | Multi-lingual virtual personal assistant | |
US20210081056A1 (en) | Vpa with integrated object recognition and facial expression recognition | |
US11226673B2 (en) | Affective interaction systems, devices, and methods based on affective computing user interface | |
CN111459290B (en) | Interactive intention determining method and device, computer equipment and storage medium | |
CN109416816B (en) | Artificial intelligence system supporting communication | |
CN105895087B (en) | Voice recognition method and device | |
CN109918650B (en) | Interview intelligent robot device capable of automatically generating interview draft and intelligent interview method | |
US9213558B2 (en) | Method and apparatus for tailoring the output of an intelligent automated assistant to a user | |
US20180352091A1 (en) | Recommendations based on feature usage in applications | |
CN110110169A (en) | Man-machine interaction method and human-computer interaction device | |
US20180129647A1 (en) | Systems and methods for dynamically collecting and evaluating potential imprecise characteristics for creating precise characteristics | |
CN113380271B (en) | Emotion recognition method, system, device and medium | |
Shen et al. | Kwickchat: A multi-turn dialogue system for aac using context-aware sentence generation by bag-of-keywords | |
US10770072B2 (en) | Cognitive triggering of human interaction strategies to facilitate collaboration, productivity, and learning | |
CN117235354A (en) | User personalized service strategy and system based on multi-mode large model | |
Yordanova et al. | Automatic detection of everyday social behaviours and environments from verbatim transcripts of daily conversations | |
Lee et al. | A temporal community contexts based funny joke generation | |
Karpouzis et al. | Induction, recording and recognition of natural emotions from facial expressions and speech prosody | |
Bianchi-Berthouze | Kansei-mining: Identifying visual impressions as patterns in images | |
Hernández et al. | User-centric Recommendation Model for AAC based on Multi-criteria Planning | |
KR20230099936A (en) | A dialogue friends porviding system based on ai dialogue model | |
Wattearachchi et al. | Emotional Keyboard: To Provide Adaptive Functionalities Based on the Current User Emotion and the Context. | |
Xueming et al. | Application of Emotional Voice user Interface in Securities Industry | |
CN117786079A (en) | Dialogue method and system based on self-adaptive learning and context awareness AI proxy | |
Stappen | Multimodal sentiment analysis in real-life videos |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |