CN117828065B - Digital person customer service method, system, device and storage medium - Google Patents

Digital person customer service method, system, device and storage medium Download PDF

Info

Publication number
CN117828065B
CN117828065B CN202410256293.3A CN202410256293A CN117828065B CN 117828065 B CN117828065 B CN 117828065B CN 202410256293 A CN202410256293 A CN 202410256293A CN 117828065 B CN117828065 B CN 117828065B
Authority
CN
China
Prior art keywords
user
input
information
query
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410256293.3A
Other languages
Chinese (zh)
Other versions
CN117828065A (en
Inventor
高婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Rongcan Big Data Technology Co ltd
Original Assignee
Shenzhen Rongcan Big Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Rongcan Big Data Technology Co ltd filed Critical Shenzhen Rongcan Big Data Technology Co ltd
Priority to CN202410256293.3A priority Critical patent/CN117828065B/en
Publication of CN117828065A publication Critical patent/CN117828065A/en
Application granted granted Critical
Publication of CN117828065B publication Critical patent/CN117828065B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a digital person customer service method, a system, a device and a storage medium. Wherein the method first receives input from a user, which may be speech, text or images. The method employs specialized analysis means to process and interpret the data for different types of inputs. Based on the analysis results obtained, the method selects and adjusts answer strategies and contents to generate personalized answers closely connected with the user context. This includes adjusting the level of detail of the answer, the mood and the type of information provided, ensuring that the answer is not only relevant but also personalized. In addition, the method also considers the setting preference of the user, the historical interaction behavior or the explicit user indication to determine the information presentation form which is most suitable for the user. The invention provides an efficient, intelligent and flexible digital personal customer service method which can provide highly customized service experience according to specific requirements and preferences of users.

Description

Digital person customer service method, system, device and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a digital personal service method, system, device, and storage medium.
Background
In the traditional customer service model, enterprises are challenged to efficiently handle large numbers of customer consultations while maintaining quality of service and speed of response. Such services are typically dependent on human agents, the major drawbacks of which include high labor costs, uncertainty in service response time, and quality of service fluctuations due to human factors.
With the development of technology, and in particular the advancement of artificial intelligence technology, digital personal services systems have become one way to solve these problems. However, existing automated systems often perform poorly when involved in understanding and processing complex user queries. These systems are typically based on preset scripts and rules, lacking flexibility and the ability to understand the user's needs in depth. For example, in processing natural language input, particularly spoken expressions, these systems have difficulty accurately understanding the intent and emotional state of the user, and thus fail to provide a satisfactory response.
Therefore, it is very necessary to develop a new digital human customer service method.
Disclosure of Invention
The application provides a digital person customer service method, a system, a device and a storage medium, which are used for improving the interactive experience of a user.
The application provides a digital human customer service method, which comprises the following steps:
receiving user input, the user input comprising at least one of the following input types: speech, text or images;
Adopting a corresponding analysis method according to the input type of the received user input to obtain an analysis result, wherein the analysis result comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
Selecting and adjusting answer strategies and contents according to the analysis results to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the mood or the provided information type of the answers so as to ensure the relevance and the individuality of the answers;
and determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user instruction of the user, and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images.
Further, for voice input, analyzing the language expression, intonation features and historical interaction data of the user to identify the current emotional state and the query context of the user comprises:
Preprocessing voice input to obtain preprocessed voice data;
Converting the voice input into a corresponding text of the voice input;
Inputting the preprocessed voice data into a trained voice processing mixed model to obtain emotion scores and context labels; the voice processing hybrid model comprises a convolutional neural network, a long-term memory network, an emotion full-connection layer and a context full-connection layer; the convolutional neural network is used for receiving and processing the preprocessed voice data to obtain intonation features in the voice data; the long-term and short-term memory network is used for receiving and processing intonation features in voice data, historical interaction data and corresponding text of voice input to obtain feature vectors; the emotion full-connection layer is used for receiving and processing the feature vector to obtain emotion scores; the context full connection layer is used for receiving and processing the feature vectors to obtain context labels.
Further, for text input, analyzing vocabulary usage, grammar structure and expression style in text content, and understanding query intention of user by combining with historical interaction data, including:
converting each vocabulary in the text input into a vector with a fixed length by using a pre-trained word embedding model to obtain a vector sequence;
Processing the vector sequence by using a syntactic dependency analysis tool to obtain syntactic structure information;
analyzing the syntactic structure information and the vector sequence by using semantic role labeling technology to obtain semantic role information of each component in the text;
Determining current query information according to the text input, and further determining historical query information according to the historical interaction data; performing text similarity calculation on the front query information and the historical query information, and determining a similarity analysis result;
Based on the vector sequence, the syntax structure information, the semantic role information and the similarity analysis result, an SVM classifier is used for identifying the query intention of the user.
Still further, the analyzing the image content and associated historical interaction data for image input to identify query intent and environmental information of a user includes:
carrying out standardization processing on an image input by a user to obtain standardized image data;
according to the standardized image data, performing feature extraction by utilizing a pre-trained convolutional neural network to obtain an image feature vector;
inputting the image feature vector into a scene recognition model to recognize major scenes and objects in the image;
Analyzing potential query intention of the user by using a recurrent neural network according to the image feature vector, the main scene and object in the image and the historical interaction data of the user;
based on the main scene and object in the image and the potential query intention of the user, the query intention category of the user and the related environmental information label are obtained.
Still further, the selecting and adjusting answer policies and content according to the analysis result to generate personalized answers matched with the user context includes:
For voice input, selecting an answer template according to emotion scores and context labels, and adjusting the detail degree of the mood and the information;
For text input, customizing answer content according to the identified query intention of the user, and selecting proper information types and the detailed degree of the information, wherein the information types comprise explanatory and instructional;
for image input, generating an answer to the environment information label according to the query intention category of the user and the related environment information label, and adjusting the language and the detail degree of the answer to match the query intention category of the user.
The application provides a digital personal customer service device, comprising:
A receiving unit for receiving user input, the user input comprising at least one of the following input types: speech, text or images;
The analysis unit is used for obtaining an analysis result by adopting a corresponding analysis method according to the input type of the received user input, and comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
A selection unit for selecting and adjusting answer strategies and contents according to the analysis result to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the language or the provided information type of the answers so as to ensure the relevance and the personalization of the answers;
And the determining unit is used for determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user instruction of the user and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images.
The application provides a digital personal customer service system, comprising:
a data receiving and processing framework for collecting input data from a user terminal through an interface, wherein the input data comprises at least one of the following types: speech, text or images; the interface comprises an interface, a mobile application and a social media platform;
a context manager for tracking and understanding the user's session context, including the user's history of queries, preferences, and their interaction with the system, in order to provide a more consistent and personalized service experience;
the comprehensive analysis engine integrates natural language processing, voice recognition and image recognition technologies and is used for analyzing content input by a user and understanding the emotion state, query intention and environmental information of the user;
the content generation and adjustment module is used for selecting and adjusting answer strategies and content according to the output of the comprehensive analysis engine; the module comprises an answer template library, and can dynamically adjust the detail degree, the mood and the information type of the answer according to the situation of the user so as to generate a customized answer;
The multi-mode answer display platform is used for presenting answers in a proper form according to the type of equipment and preference settings of a user, and the answers comprise voice, characters or images.
The present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
receiving user input, the user input comprising at least one of the following input types: speech, text or images;
Adopting a corresponding analysis method according to the input type of the received user input to obtain an analysis result, wherein the analysis result comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
Selecting and adjusting answer strategies and contents according to the analysis results to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the mood or the provided information type of the answers so as to ensure the relevance and the individuality of the answers;
and determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user instruction of the user, and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images.
The beneficial effects of the application include: (1) By analyzing the voice, text or image input of the user and combining the historical interaction data, the application can more accurately understand the emotion state, the inquiry context and the intention of the user. This deep understanding makes the answers not only more relevant, but also more targeted, thereby greatly improving customer satisfaction. (2) The application adjusts the answer strategy and content according to the specific situation of the user, including adjusting the detail degree and the mood of the answer, thereby providing highly personalized customer service experience. This personalized answer approach can better meet the user's personal needs and preferences. (3) By considering the setting preference and the historical interaction behavior of the user, the application can select the most suitable presentation form (such as voice, text or image) to display the answer, thereby further improving the user experience. The flexible presentation mode enables information communication to be more visual and user-friendly. (4) The application processes diversified user inputs automatically and intelligently, reduces the dependence on artificial customer service representatives, thereby remarkably improving the service efficiency and reducing the operation cost.
Drawings
Fig. 1 is a flowchart of a digital personal customer service method according to a first embodiment of the present application.
Fig. 2 is a schematic diagram of a digital personal customer service device according to a second embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. The present application may be embodied in many other forms than those herein described, and those skilled in the art will readily appreciate that the present application may be similarly embodied without departing from the spirit or essential characteristics thereof, and therefore the present application is not limited to the specific embodiments disclosed below.
The first embodiment of the application provides a digital human customer service method. Referring to fig. 1, a schematic diagram of a first embodiment of the present application is shown. The following describes a digital personal customer service method in detail with reference to fig. 1.
Step S101: receiving user input, the user input comprising at least one of the following input types: speech, text or images.
In the digital personal customer service method provided in this embodiment, the core of step S101 is to receive and accurately understand the input of the user, which is the starting point of the whole customer service flow. This step focuses on processing information provided by the user in different forms, ensuring that the system is able to extract key information from these inputs and is ready for subsequent steps.
First, the receipt of user input involves a number of possible input types. These types include, but are not limited to:
1. voice input: the user interacts with the customer service system through voice. In this case, the system first needs to capture and record the user's voice signal. These speech signals are then converted to text for further processing and analysis. The processing of speech input is not just transcription of text, but also analysis of the user's intonation, which helps understand the user's emotional state and intent.
2. Character input: the user interacts with the customer service system by typing text. Text input may come from various channels such as instant messaging, email, or chat windows on websites. In processing text input, the system will analyze the text content, including vocabulary selection, grammar structure, and word style. Such information helps reveal the needs and desires of the user.
3. And (3) inputting an image: the user may upload an image to express his consultation or problem. The processing of image input involves image recognition techniques that can recognize and understand the content in an image, such as objects, scenes, and text. Image analysis can provide additional contextual information about the user's consultation.
Historical interaction data is also taken into account when processing these inputs. The historical data comprises past queries and feedback of the user and response records of the customer service system. These data help build the user's behavioral patterns and preference profile so that the system can better understand the current input content and the intent behind it.
The goal of step S101 is to ensure that the various inputs are accurately captured and resolved, laying a solid foundation for subsequent analysis and response generation. Through a thorough understanding of these inputs, digital personal customer service is able to more accurately identify the needs and desires of the user, thereby providing more accurate and personalized services in the next steps.
Step S102: adopting a corresponding analysis method according to the input type of the received user input to obtain an analysis result, wherein the analysis result comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, image content and related historical interaction data are analyzed to identify query intent and environmental information of a user.
In step S102, the user' S input data is processed by applying some analysis method, which is a key element in providing an efficient and personalized customer service experience. Each input type requires a specific analysis method to ensure that the most valuable and relevant information is extracted from the user's input.
First, for speech input, the system first converts the user's speech into text, which is the basic role of speech recognition technology. Further analysis is then applied to the transcribed text and the characteristics of the speech itself. The system will analyze the expression of the language, such as the vocabulary and sentence structure used, and features in the speech, such as intonation and intensity. This not only helps to understand what the user has said, but also helps to reveal the emotional state and urgency of the user. For example, a quick, high-pitched voice may indicate a user's urgency or anxiety. At the same time, historical interaction data is also used to provide context to help understand the user's possible intent and needs.
In processing speech input, "emotional state" and "query context" are two key concepts that understand user needs and provide effective customer service response:
Emotional state refers to the emotion expressed by the user in the voice interaction. The emotional state of the user can be identified by various characteristics of the voice, such as the level of intonation, the speed of speaking, the intensity of sound, etc. For example, tremors in the sound of a user speaking may indicate tension or anxiety, while fast and high-pitch speech may indicate urgency or excitement.
Identifying the emotional state of a user is critical to providing a more emotional and personalized customer service experience. For example, if the system detects that the user is frustrated or angry, it may take a more soothing mood, or forward the conversation to a human customer service to better address complex or sensitive situations.
The query context relates to the specific context and context in which the user posed the problem. This includes the user's goals, the particular circumstances of the query, and any relevant context information. For example, a user asking for return policies may be in a context of dissatisfaction after purchase, while a user asking for product functionality may be in a stage of consideration prior to purchase.
Understanding the query context helps the system to more accurately grasp the needs and intent of the user. This is not only based on the current query content of the user, but also includes consideration of the user's past interactions, as well as the particular circumstances they may be in.
In summary, by analyzing the emotional state and the query context in the voice input, the digital personal attendant system can more fully understand the user's needs and provide more accurate and careful services. The method not only improves the quality of customer service experience, but also is beneficial to building the trust and satisfaction of users.
Next, for text entry, the system analyzes the text content submitted by the user. This includes evaluating the selection of vocabulary, grammar structure and style of expression. Text analysis may reveal the needs of users and specific aspects of their queries. For example, formal and detailed text may indicate that the user requires a professional answer. Also, historical interaction data is important to understanding the needs and preferences of the user.
With respect to query intent in text entry, it is specifically the goals that a user wants to achieve or the information they attempt to obtain when interacting with a digital personal attendant system using text entry. It is the essence of the core demand or problem that the user expresses through text entry. Understanding the query intent is to provide a key component of an effective and accurate customer service response, especially when processing text input. In this scenario, the query intent may include:
1. Solution of specific problems: the user might ask a clear question by words like "when my order was shipped? "the intention here is to acquire specific information about the status of an order.
2. Information seeking: the user may seek general information about the product, service, or topic. For example, the user asks "what functions are the latest smart phone? "indicating that they are interested in new technology.
3. Seeking advice or advice: the user may request a suggestion or recommendation such as "i should select which kind of a notebook computer for office? This indicates that they are seeking purchase advice.
4. Feedback or complaints: the user may express satisfaction with the product or service through text, or present complaints. For example, "poor sound quality of headphones i purchase" is a query intent that is not expressed fully.
To accurately understand query intent, the digital personal attendant system analyzes the lexical usage, grammatical structure, and expression style of the user's text input. In addition, the ability of understanding the current demands of the user can be further improved by combining the historical interaction data of the user, such as the previous inquiry records and feedback. This comprehensive analysis enables the system to provide a more accurate and personalized response, meeting the specific needs of the user.
Finally, for image input, the system uses image recognition techniques to analyze the submitted image. This may involve identifying objects, scenes and any contained text in the image. Image analysis can provide an intuitive context for user queries, which is critical to understanding the intent of the user. For example, the product pictures uploaded by the user may direct the customer service system to provide relevant product information or support.
The query intent involved in image input refers specifically to the purpose that the user desires to achieve when providing image input or the problem they are attempting to solve. In other words, it is the nature of the core requirements or queries that the user attempts to convey through the image input. Understanding query intent is critical to providing accurate and efficient customer service response. In the context of image input, query intent may include, but is not limited to, the following:
1. Solution of specific problems: the user may upload images related to a particular problem. For example, uploading a photograph of a damaged device, the intent may be to seek repair advice or to learn about warranty policies.
2. Information acquisition: the user may seek specific information through the image. For example, uploading a cover of a book may be intended to know the rating, price, or availability of the book.
3. Guidance of actions: the image uploaded by the user may be for the purpose of acquiring an operating manual or step description. For example, uploading a photograph of assembled furniture may be seeking assembly guidance.
4. Providing opinion or feedback: in some cases, the user may upload the image in order to express satisfaction with the product or service, or to provide feedback. For example, uploading a picture of restaurant food may be to express an assessment of food quality.
Understanding query intent involves not only parsing the content of the image itself, but also combining relevant historical interaction data such as the user's past query habits, preferences, and interaction history. Such comprehensive analysis can help the digital personal customer service system to accurately understand the needs of the user, thereby providing more accurate and personalized answers and solutions.
In this embodiment, the environmental information refers to various information of the physical or contextual environment in which the user is currently located, which is revealed by the image. Such information helps the digital personal customer service system more fully understand the user's query intent, thereby providing more accurate and personalized services. The environmental information includes, but is not limited to, the following:
1. physical environmental characteristics: this may include the location shown in the image, objects, background, etc. For example, if the user uploaded an image of a computer hardware problem in an office environment, the system could identify that the user was in a work-related environment, and infer that the user may require specialized technical support.
2. Scene context: the scene in the image may provide contextual information of the scene in which the user is located. For example, a photograph within a restaurant may suggest that the user is querying for information related to the restaurant.
3. Time and place information: if the image contains a time stamp or geographic tag, this information can help the system understand the time sensitivity or geographic relevance of the user query. For example, an image of a travel location with a geographic tag may indicate that the user is looking for travel information associated with the location.
4. A context cue: other elements in the image, such as the expression, activity, or other objects of the character, may also provide important cues to help understand the current context and needs of the user. For example, depression of a user's expression or confusion of the surrounding environment may indicate an emergency or a particular emotional need.
The digital personal customer service system can read the query intention of the user more accurately by comprehensively analyzing the environmental information, and provides more personalized and context-related response for the user by combining the historical interaction data. The understanding of the environment information not only improves the accuracy of the answer, but also enables customer service experience to be more fit with the actual situation and the requirements of the user.
In all these cases, the analysis approach taken aims at extracting deep meaning and context from the user's input, thereby enabling the customer service system to better understand and respond to the specific needs and desires of the user. Through this in-depth analysis, the system is able to provide more accurate and personalized services to the user, which is critical to improving user satisfaction and efficiency.
Further, for voice input, analyzing the language expression, intonation features and historical interaction data of the user to identify the current emotional state and the query context of the user comprises:
Preprocessing voice input to obtain preprocessed voice data;
Converting the voice input into a corresponding text of the voice input;
Inputting the preprocessed voice data into a trained voice processing mixed model to obtain emotion scores and context labels; the voice processing hybrid model comprises a convolutional neural network, a long-term memory network, an emotion full-connection layer and a context full-connection layer; the convolutional neural network is used for receiving and processing the preprocessed voice data to obtain intonation features in the voice data; the long-term and short-term memory network is used for receiving and processing intonation features in voice data, historical interaction data and corresponding text of voice input to obtain feature vectors; the emotion full-connection layer is used for receiving and processing the feature vector to obtain emotion scores; the context full connection layer is used for receiving and processing the feature vectors to obtain context labels.
First, a voice input of a user is received. These voice data need to be preprocessed to improve the data quality and to prepare for subsequent analysis. The preprocessing steps include denoising, volume normalization, and possibly sound segmentation, in order to ensure the sharpness and consistency of the speech data. At this stage, these tasks may be performed using various audio processing libraries, such as the Librosa library of Python.
The preprocessed speech data is then converted into text. This step is accomplished by speech recognition techniques, and the user's speech input may be transcribed into text form for subsequent analysis using an existing speech recognition API, such as a hundred degree speech recognition API or the like.
The pre-processed speech data is then input into a trained hybrid model that incorporates the CNN and LSTM networks, as well as the full connected layers of emotion and context.
1. Convolutional Neural Network (CNN): the CNN first processes the preprocessed voice data to extract intonation features of the voice, such as pitch, intensity, speed, etc. These features are critical to understanding the emotional state of the user. CNNs extract meaningful feature vectors from the original speech signal by applying multiple convolution and pooling layers.
2. Long and short term memory network (LSTM): the resulting speech feature vector is input to the LSTM network along with the speech converted text and the user's historical interaction data. LSTM is capable of processing sequence data, capturing long-term dependencies, which makes it well suited for analyzing complex patterns of speech features and text content. Through LSTM processing, the model is able to generate a comprehensive feature vector reflecting the user's emotional state and the query context. The historical interaction data of the user is another key dimension for understanding the intent of the user's query. This includes the user's past query records, feedback, and the system's response. In the LSTM model, the historical data is used as additional input, so that the model can be helped to better understand what historical background the current query of the user is presented on, and the recognition accuracy of the emotion state of the user and the query context is improved.
3. Emotion full-connection layer: next, the LSTM output feature vector is processed by an emotion full link layer to obtain emotion scores. This fully-connected layer maps feature vectors onto probability distributions of emotional states using an appropriate activation function (e.g., softmax), thereby producing a numerical score reflecting the emotional tendency of the user. For example, the model may be trained to recognize three emotional states, positive, negative, and neutral. The emotion type with the highest probability distribution represents the current emotion tendency of the user, and the corresponding probability value can be converted into emotion score so as to quantify the emotion intensity.
4. Context full connectivity layer: also, a context-full-join layer processes the output feature vector of the LSTM to obtain the context label of the query. The full connectivity layer also applies a softmax activation function, mapping feature vectors onto probability distributions of different query contexts, identifying the specific context and intent of the user query. This layer will output a series of probability distributions associated with possible query intents. Each query intent corresponds to a tag, such as "product query", "service feedback", "emergency help" and the like. The highest probability intent may be selected as the primary context for the user query and converted to a context label.
Through the steps, the emotion state and the query context can be accurately identified from the voice input of the user, and key information is provided for digital people customer service so as to generate more accurate and personalized response.
The following is the reference implementation code of the speech processing hybrid model:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Conv1D, MaxPooling1D, LSTM, Dense, Flatten, Embedding, Concatenate
# assume that speech features, text and historical interaction data have been pre-processed into a suitable input format
# Assuming that the dimensions of the speech features are (time step, feature dimension), the text and history data have been converted into indexed forms of sequences
# Speech input processing part
Audio_input=input (shape= (None, audio_feature_dim)) #audio_feature_dim is the dimension of the speech feature
conv1 = Conv1D(filters=32, kernel_size=5, activation='relu')(audio_input)
pool1 = MaxPooling1D(pool_size=4)(conv1)
conv2 = Conv1D(filters=64, kernel_size=5, activation='relu')(pool1)
pool2 = MaxPooling1D(pool_size=4)(conv2)
audio_features = Flatten()(pool2)
# Text input processing section
Text_input=input (shape= (None,)) # None represents that the length of the text sequence is variable
Embedding = Embedding (input_dim= vocab _size, output_dim= embedding _dim) (text_input) # vocab _size is vocabulary size, embedding _dim is embedded vector dimension
text_features = LSTM(128)(embedding)
# Historical interaction data processing part
history_input = Input(shape=(None,))
history_embedding = Embedding(input_dim=history_vocab_size, output_dim=embedding_dim)(history_input) # history_vocab_size Vocabulary size for historical data
history_features = LSTM(128)(history_embedding)
# Merging feature
merged_features = Concatenate()([audio_features, text_features, history_features])
# Emotion full-connection layer
emotion_dense = Dense(64, activation='relu')(merged_features)
Emotion_output=Dense (3, activation= 'softmax', name= 'Emotion_output') (Emotion_dense) # assume 3 emotional states
Full connection layer of# context
context_dense = Dense(64, activation='relu')(merged_features)
context_output = Dense(num_context_labels, activation='softmax', name='context_output')(context_dense) # num_context_labels For the number of possible query intents
Model # construction
model = Model(inputs=[audio_input, text_input, history_input], outputs=[emotion_output, context_output])
# Compiling model
model.compile(optimizer='adam', loss={'emotion_output': 'categorical_crossentropy', 'context_output': 'categorical_crossentropy'}, metrics=['accuracy'])
# Printing model structure
model.summary()
This code demonstrates how TensorFlow can be used to construct a hybrid model that receives speech features, text and historical interaction data as input, extracts features through the CNN and LSTM networks, and finally outputs emotion scores and context labels through the full connection layer. Training of the model requires a large amount of annotated data, including emotion tags and query intent tags for speech input, and corresponding optimization and evaluation steps to ensure model performance.
It should be noted that the speech processing hybrid model herein may be trained using a prepared labeling data set, which belongs to a neural network training process standard in the prior art, and the training process will not be described herein.
Further, for text input, analyzing vocabulary usage, grammar structure and expression style in text content, and understanding query intention of user by combining with historical interaction data, including:
converting each vocabulary in the text input into a vector with a fixed length by using a pre-trained word embedding model to obtain a vector sequence;
Processing the vector sequence by using a syntactic dependency analysis tool to obtain syntactic structure information;
analyzing the syntactic structure information and the vector sequence by using semantic role labeling technology to obtain semantic role information of each component in the text;
Determining current query information according to the text input, and further determining historical query information according to the historical interaction data; performing text similarity calculation on the front query information and the historical query information, and determining a similarity analysis result;
Based on the vector sequence, the syntax structure information, the semantic role information and the similarity analysis result, an SVM classifier is used for identifying the query intention of the user.
First, the user's text input is preprocessed, which includes converting each Word in the text into a fixed-length vector through a pre-trained Word embedding model (e.g., word2Vec, BERT, etc.). The purpose of this step is to convert the natural language into a numerical form that the machine can understand and process, while preserving the semantic information of the vocabulary. The pre-trained word embedding model can capture complex relationships between words, such as semantic relationships like words, anti-ambiguities, etc., and meaning changes of words in different contexts.
The above vector sequence is then processed by a syntactic dependency analysis tool (e.g., spaCy) to obtain syntactic structure information for the text. Syntactic dependency analysis reveals dependencies between words in text, such as main predicate structures, idiom modification, etc., which are critical to understanding the structure and intent of sentences. Through the analysis of the syntactic structure, the function of the sentence component can be accurately identified, and the overall meaning of the sentence can be further understood.
The sentence structure information and the vector sequence are then further analyzed using semantic role labeling techniques (e.g., allenNLP) to obtain semantic role information for each component in the text. Semantic role labeling focuses on identifying relationships between roles and behaviors of entities in a sentence, such as who is the initiator of the action (agent), the recipient of the action (the recipient), the time and place at which the action occurred, and so on. This step is critical to understanding the specific intent of the user query in depth.
In addition, the system may determine historical query information based on the current query information and the user's historical interaction data. By calculating the text similarity between the current query information and the historical query information, a similarity analysis result can be determined. The text similarity may be calculated using a variety of methods, such as cosine similarity, jaccard similarity, etc., in order to find the historical interaction scenario most relevant to the current query, to assist in understanding the user's intent.
Finally, a Support Vector Machine (SVM) classifier is used to identify the query intent of the user based on the vector sequence, the syntactic structure information, the semantic role information, and the similarity analysis result. The SVM is a powerful supervised learning model, and is particularly suitable for text classification tasks. In training an SVM model, training data labeled with query intent needs to be prepared so that the model learns how to distinguish different query intent based on input features.
The following is an example code framework based on Python, illustrating how the above procedure is implemented:
# assume that functional implementation of word embedding, syntactic dependency analysis and semantic role labeling has been completed
The following are examples of integrating these analysis results and applying SVM for query intent recognition
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
Let vectorized _text, syntactic _info, semantic _roles be the output of the previous step
# Assume that history_queries are historical query data
The implementation here focuses mainly on how to integrate this information and make query intent recognition
Example #: integrating features
def integrate_features(vectorized_text, syntactic_info, semantic_roles, history_queries):
Logic to implement feature integration
This may involve stitching of feature vectors, selection and conversion of features, etc
Integrated_features=
return integrated_features
Example #: text similarity calculation
def calculate_similarity(current_query, history_queries):
Vectorizing current queries and historical queries using TF-IDF
vectorizer = TfidfVectorizer()
all_queries = [current_query] + history_queries
tfidf_matrix = vectorizer.fit_transform(all_queries)
# Calculate similarity of current query and historical query
similarity_scores = cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:])
return similarity_scores
Example #: identifying query intent using SVM classifier
def recognize_query_intent(features):
Svm_classification=svc (kernel= 'linear') # SVM classifier using linear kernel
# Assumption already has a trained SVM model
query_intent = svm_classifier.predict(features)
return query_intent
# Integrate features, calculate similarity, and identify query intent
integrated_features = integrate_features(vectorized_text, syntactic_info, semantic_roles, history_queries)
similarity_scores = calculate_similarity(current_query, history_queries)
# May be integrated into the feature set as an additional feature with similarity score
query_intent = recognize_query_intent(integrated_features)
# Output intent of user query
Print ("user query intent:", query_intent)
Note that the above codes are only conceptual examples, and implementation of each step needs to be adjusted and optimized according to specific tasks and data sets in practical applications.
To implement an end-to-end system for analyzing user text input and understanding user query intent in combination with historical interaction data, several key steps are required, including data preparation, feature engineering, model training, evaluation, and tuning. The following are specific training steps of this process:
1. Data preparation
Collecting data: a sufficient number of user text input samples are collected that should cover different query intents. At the same time, historical interaction data related to the queries is collected.
And (3) data marking: the collected user queries are manually annotated, and the actual intention category of each query, such as "product query", "service feedback", "emergency help" and the like, is determined.
Data cleaning: text cleaning is performed to remove useless characters, punctuation, stop words and the like, and word shapes (such as stem extraction or word shape reduction) are unified.
2. Feature engineering
Word embedding conversion: each Word in the text input is converted to a fixed length vector using a pre-trained Word embedding model (e.g., word2Vec, or BERT).
Syntactic dependency analysis: and processing the text by using a syntactic dependency analysis tool to acquire syntactic structure information.
Semantic role labeling: based on the syntactic structure information and the vector sequence, semantic role information of each component in the text is obtained by using a semantic role labeling technology.
Historical interaction characteristics: and calculating text similarity, such as TF-IDF and cosine similarity, according to the current query information and the historical interaction data to obtain similarity characteristics with the historical query.
Feature fusion: all the features (vector sequence, syntax structure, semantic roles and historical similarity) are fused into a comprehensive feature vector for training a model.
3. Model training
Selecting a model: an appropriate machine learning model, such as a Support Vector Machine (SVM), random forest, neural network, or the like, is selected based on the task characteristics.
Training set and test set: the data set is divided into a training set and a testing set, so that the two are distributed consistently.
Training a model: training the selected model by using training set data, inputting vectors with characteristics fused, and outputting the vectors as query intention categories.
Parameter adjustment and optimization: according to the performance of the model on the training set, the model parameters, the learning rate and the like are adjusted, and the optimal parameters are searched through methods such as cross verification and the like.
4. Evaluation and tuning
Model evaluation: and (3) evaluating the performance of the model by using an independent test set, and focusing on indexes such as accuracy, recall, F1 score and the like.
Error analysis: the case of the model prediction error is analyzed, and the reasons for the error are possibly insufficient characteristics, over-fitting of the model or error data labeling.
And (3) model tuning: based on the evaluation results and the error analysis, the model is optimized, which may include increasing the data amount, adjusting the feature engineering, replacing the model structure, and the like.
5. Deployment and feedback
Model deployment: and deploying the trained model into a production environment, and starting to process the actual user query.
Through the steps, an NLP system capable of understanding the query intention of the user can be constructed and trained, direct query content can be processed, and more accurate service can be provided by combining historical interaction data of the user.
Still further, the analyzing the image content and associated historical interaction data for image input to identify query intent and environmental information of a user includes:
carrying out standardization processing on an image input by a user to obtain standardized image data;
according to the standardized image data, performing feature extraction by utilizing a pre-trained convolutional neural network to obtain an image feature vector;
inputting the image feature vector into a scene recognition model to recognize major scenes and objects in the image;
analyzing the historical interaction data of the user to identify a behavior pattern and preferences of the user;
Analyzing potential query intention of the user by using a recurrent neural network according to the image feature vector, main scenes and objects in the image and the behavior mode and preference of the user;
based on the main scene and object in the image and the potential query intention of the user, the query intention category of the user and the related environmental information label are obtained.
The following details how the pre-trained convolutional neural network model ResNet is used for scene recognition, taking the Python language and TensorFlow framework as an example.
Step 1: selecting an appropriate pre-training model
Let ResNet be chosen as the base model, since it performs well in image recognition tasks. TensorFlow and Keras provide interfaces that facilitate loading of pre-trained models.
from tensorflow.keras.applications import ResNet50
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.resnet50 import preprocess_input, decode_predictions
import numpy as np
# Load pre-trained ResNet model
model = ResNet50(weights='imagenet')
Step 2: data set preparation
For scene recognition tasks, a large image dataset of the disclosure, such as Places or ImageNet, may be used. It is assumed that there is already one image dataset containing a plurality of scene categories and that each image has been annotated with a scene category. The preparation of the data set includes dividing the training set and the test set, image preprocessing, etc.
Step 3: model training and fine tuning
Even though ResNet models have been pre-trained on ImageNet, fine tuning of the model is still required for a particular scene recognition task. This typically involves replacing the top layer of the model to accommodate the new category number and continuing training on the new data set.
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, GlobalAveragePooling2D
Top layer of# replacement ResNet model
x = model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation='relu')(x)
Predictions =Dense (num_ classes, activation= 'softmax') (x) #num_ classes is the number of scene categories
Construction of a New model #
model = Model(inputs=model.input, outputs=predictions)
# Fine tuning model
for layer in model.layers[:-3]:
layer.trainable = False
model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
Step 4: model evaluation
And evaluating the performance of the fine-tuned model by using a test set, and focusing on indexes such as accuracy, recall rate and the like.
# Assume test_generator is a data generator for test set
model.evaluate(test_generator)
Step 5: integrating scene recognition results
The results of scene recognition (i.e., the primary scenes and objects in the image) need to be integrated with other parts of the system, such as query intent analysis in conjunction with the user's historical interaction data.
Step 6: deployment model
The trained model is deployed as a service for real-time use by the digital personal customer service system.
By following the steps, not only can the scene recognition model be implemented, but also the scene recognition model can be effectively integrated into a digital personal customer service system so as to improve the accuracy and individuation level of the service.
In combination with the image feature vectors, scene recognition results, and historical interaction data of the user, a Recurrent Neural Network (RNN) is used to analyze the user's potential query intent. The RNN is capable of processing the sequence data, adapting to analyze user behavior data having time-series characteristics, and performing a comprehensive analysis in combination with visual features. Through the RNN model, the system can infer from the comprehensive information the query intent that the user is attempting to express through image input.
To ensure that those skilled in the art can implement the above procedure, i.e. analyze the potential query intent of the user using a Recurrent Neural Network (RNN) in combination with image feature vectors, scene recognition results, and user history interaction data, the following is a more detailed description and implementation steps:
the preparation stage:
1. Extracting image features: feature vectors are extracted from images submitted by users using a pre-trained convolutional neural network (e.g., resNet, VGG, etc.). This step typically involves feeding the image forward through the CNN model, obtaining the output of a certain layer as a feature vector.
2. Scene recognition: the primary scene and objects in the image are identified using a scene recognition model (possibly another pre-trained CNN model), and the recognition result is encoded into a vector form. For example, class labels for scenes and objects may be converted to vectors by one-hot encoding.
3. Historical interaction data processing: the historical interaction data of the user, including past query records, clicking actions and the like, are arranged, converted into a sequence form and encoded into vectors. This may involve processing of time stamps, encoding of behavior types, etc.
Implementing RNN model:
1. RNN model design: an RNN model is designed to process the sequence data. The input of the model will include image feature vectors, scene recognition vectors, and historical interaction data vectors. RNN variants such as LSTM (long short term memory) or GRU (gate loop unit) can be used here because they can better handle long sequence data and avoid gradient vanishing problems.
2. Data set preparation: a data set is constructed, each sample containing a set of image feature vectors, scene recognition result vectors and corresponding historical interaction data vectors, and the intent tag of the user query (as a training target). The intention labels are predefined and marked according to actual conditions.
3. Model training: the RNN model is trained using the prepared dataset. Parameters (e.g., learning rate, batch size, etc.) need to be adjusted during training, cross entropy is used as a loss function, and an appropriate optimizer (e.g., adam) is employed.
The following is an example of an implementation using TensorFlow and Keras, assuming that the preparation of image feature vectors, scene recognition result vectors, and historical interaction data vectors has been completed.
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense
# Assume that the dimensions of the image feature vector, scene recognition vector, and historical interaction vector are known
input_dim = image_feature_dim + scene_vector_dim + history_vector_dim
Construction of RNN model #
model = Sequential()
model.add(LSTM(128, input_shape=(None, input_dim)))
model.add(Dense(num_intent_classes, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
Training model #
# Assume that X_train and y_train are training data and labels, respectively
model.fit(X_train, y_train, epochs=10, batch_size=32)
Model # prediction
Suppose x_test is the data to be predicted
predictions = model.predict(X_test)
Through the above steps, one skilled in the art can implement an RNN model to analyze the user's potential query intent in combination with image content, scene recognition results, and historical interaction data.
Finally, the system will generate an output containing query intent categories and relevant environmental information tags based on the primary scenes and objects in the image, and the analyzed potential query intent of the user. This output contains both a direct response to the user's query purpose and provides environmental context information related to the query, making the customer service response more accurate and personalized.
How to transition from the analysis stage to the output generation stage will be described in detail below. This includes converting the analyzed data into specific query intent categories and environmental information tags, and how to integrate such information into customer service responses. The following detailed implementation steps are as follows:
Tissue analysis results:
1. Integrating analysis data: in the analysis stage, image feature vectors, scene recognition results, historical interaction data analysis results of the user and potential query intentions of the user, which are obtained through RNN model analysis based on the information, are already obtained. First, it is necessary to integrate these analysis results into one structured format, for example, to map scene recognition results and query intent analysis results into specific tags and categories.
2. Defining intent categories and environment tags: based on the service requirement and the actual application scene, a group of query intention categories and environment information labels are predefined. For example, query intent categories may include "product information query", "technical support request", "order status query", etc., and environmental information tags may include "indoor", "outdoor", "city", "nature", etc.
Generating an output tag:
1. mapping the analysis result to a tag: mapping the potential query intention of the user, which is obtained by analyzing the RNN model, to a predefined query intention category. And simultaneously, mapping the result of the scene recognition model to the environment information label. This step may involve looking up a list of predefined tags, finding the tag that best matches the analysis result.
2. Output formatting: and formatting output content according to the requirements of the customer service system. For example, a JSON object may be generated containing query intent categories and context information tags, and possibly other useful information (e.g., confidence scores).
output = {
"Query_intent": "product information query",
"Indoor",
"confidence": 0.95
}
Integrated into customer service response:
1. And (3) response generation: based on the generated query intent category and the environmental information tag, the customer service system may generate a targeted response. This may involve querying a database for relevant information or generating specific answer templates.
2. And (3) personalized adjustment: and taking historical interaction data and preferences of the user into consideration, and performing personalized adjustment on the response. For example, if a user frequently queries for information about a certain product category in the past, information about that category may be provided preferentially in the response.
3. Feedback loop: after providing the response, feedback from the user is collected. These feedback can be used to further optimize the analytical model and the response generation strategy, enabling continued improvement of the customer service system.
Through the detailed description and the implementation mode, the analyzed user query intention and environment information can be effectively converted into specific output labels, and the information is integrated into a digital personal customer service system so as to provide more accurate and personalized service response.
Step S103: and selecting and adjusting answer strategies and contents according to the analysis results to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the mood or the provided information type of the answers so as to ensure the relevance and the personalization of the answers.
In step S103, the goal is to carefully select and adjust the answer policy and content to generate a personalized answer that highly matches the user context, based on the analysis results obtained in the previous steps. This step is a vital link in the overall customer service process, as it directly affects the quality and effectiveness of the user experience.
The method for selecting and adjusting the answer strategy and content refers to a mode that the system needs to flexibly change the answer according to the specific requirements and situations of users. This includes, but is not limited to:
1. varying the level of detail of the answer: depending on the user's query intent and understanding capabilities, the system may provide a brief answer or a more detailed explanation. For example, for a technically familiar user, the system may provide more technical details; and for non-professional users, a more concise and easily understood answer may be provided.
2. Adjusting the mood: based on the analysis of the emotional state of the user, the system may adjust his mood to better establish contact with the user. For example, if the user appears to be frustrated or dissatisfied, the system may employ a more emotional and soothing mood; if the user appears excited or curious, the system may use more aggressive and enthusiastic mood.
3. Selecting information types: the system will decide which type of information to provide based on the user's query intent and historical interaction data. For example, if a user is seeking a specific instruction, the system may provide a step description; if the user requires general consultation, the system may provide overview or background information.
The core of this step is to understand the needs of the user and respond to these needs in the most appropriate way. The intelligent system is characterized in that the intelligent system can not only understand the meaning of the text surface, but also can understand the implicit requirements and the emotional state of the user, and then adjust the answer mode and content according to the implicit requirements and the emotional state.
Through the personalized adjustment, the digital customer service can provide more relevant and close service, so that the answering accuracy is improved, and the satisfaction and loyalty of the user are enhanced. The method enables customer service experience to be more humanized and can meet diversified requirements of users.
Still further, the selecting and adjusting answer policies and content according to the analysis result to generate personalized answers matched with the user context includes:
For voice input, selecting an answer template according to emotion scores and context labels, and adjusting the detail degree of the mood and the information;
For text input, customizing answer content according to the identified query intention of the user, and selecting proper information types and the detailed degree of the information, wherein the information types comprise explanatory and instructional;
for image input, generating an answer to the environment information label according to the query intention category of the user and the related environment information label, and adjusting the language and the detail degree of the answer to match the query intention category of the user.
For speech input, the processing steps are as follows:
1. Emotion score and context label extraction: first, key information is extracted from emotion scores and context labels derived from a speech processing model. The emotion score helps understand the emotional state of the user, while the context label reveals the specific needs of the user.
2. Selecting an answer template: and selecting a proper template from a preset answer template library according to the emotion score and the context label. For example, for users whose emotional states are negative, a template of more placebo and supportive mood is selected.
3. Adjusting the mood and the detail degree: based on the emotion score, the mood of the answer is adjusted (placebo, encouragement, positive). The degree of detail of the provided information is adjusted according to the specific requirements of the context label, and the relevance and individualization of the answer are ensured.
In order to enable one skilled in the art to more fully understand how to process speech input, including extraction of emotion scores and context labels, selection of answer templates, and adjustment of mood and detail, the following are some specific examples.
Emotion score and context label extraction:
Suppose a user asks by voice input: "why did my order not yet arrive? I have been waiting for a long time. "Speech processing models first convert speech to text, then emotion analysis and context understanding:
Emotion analysis: analyzing the converted text, "i have been waiting a long time" expresses the anxiety and discontent emotion of the user. The emotion analysis model may give a negative emotion score, such as 0.2 (assuming that the emotion score ranges from 0 (very negative) to 1 (very positive)).
Context label: at the same time, the context understanding module recognizes that the user has asked a query about "order status", and the context label is therefore labeled "order query".
Based on the emotion score and context label, the system needs to select an appropriate template from a preset answer template library:
Comfort and support mood templates: since the emotion score points negative, the system selects a template that is intended to comfort the user and provide the actual help, such as: "we understand that your waiting may be inconvenient for you, let me check the order status for you immediately. "
Adjusting the mood and the detail degree:
and (3) adjusting the mood: based on the negative emotion scores, the answers use a milder and placebo mood, while expressing an understanding and synopsis of the user's inconvenience.
And (3) adjusting the detail degree: considering that the context label is "order query," the system decides to provide specific order status information and predicted solution time. For example, the answer may be further refined to: "we understand that you may feel inconvenient to wait. I have checked for your order status and your package is expected to be delivered in tomorrow. We thank you for patience waiting and sorry you for any inconvenience caused by delays. "
The processing steps for text input are as follows:
1. query intent recognition: the analyzed user query intention is the core of processing text input. This includes what type of information the user wants to obtain (e.g., product information, service support, etc.).
2. Customizing answer content: and customizing the answer content according to the query intention of the user. The information type (illustrative or instructive) is selected and the level of detail of the information is adjusted according to the user's needs. For example, for technical support queries, detailed step descriptions are provided.
3. Information type and detail level adjustment: the answer content is ensured to meet the information requirement of the user and meet the information receiving mode of the preference of the user. This may involve decomposing the answer into more digestible parts or providing additional resource links.
For processing text input, the following is a more specific example to understand how to recognize query intent, customize answer content, and adjust information type and detail.
Assume that a user asks questions via text input: "do your smartwatch support waterproof functionality? "for such input, the step of query intent recognition includes:
1. Keyword extraction: first, the system recognizes the keywords "smart watch" and "waterproof function". These keywords help the system locate the specific products and functions that the user is querying.
2. Intent classification: next, the system determines that this is an intent regarding "product function query" through a pre-trained classification model.
Based on the identified query intent "product function query," the system customizes the answer content:
Illustrative answer: the system selects an illustrative answer template that provides explicit information. "yes, our smart watch design contains waterproof function, can normally use up to 30 minutes under water 1 meter deep. "
And (3) adjusting the detail degree: the system determines that more details may be needed to meet the user's information needs based on the user's query. Thus, additional details are added to the system. "furthermore, the waterproof grade is IP68, which is suitable for wearing while swimming, but we do not suggest to use when performing diving activities. "
Providing additional resource links: to provide more comprehensive assistance, the system also provides an option to link to a product description page. "if more detailed information about the smart watch is needed, please access our product description page. "
The processing steps for image input are as follows:
1. interpreting query intent categories and environmental information tags: query intent categories and environmental information labels derived from image analysis are critical to generating targeted answers. This information helps the system understand the purpose and context of the user's query through the image.
2. Generating an environment-related answer: for the identified environmental information tags, a relevant answer is generated. For example, if the image is identified as an outdoor scene, the answer may contain information related to the outdoor activity.
3. The tone and detail level of the answer are adjusted: according to the query intention and the environment information of the user, the tone (friendly, professional and instructive) and the detail degree of the answer are adjusted. This requires the system to be able to flexibly adjust the answer strategy according to the actual context of the user.
How the image input is processed is further understood by a detailed example, including interpreting query intent categories and context information labels, generating context-dependent answers, and adjusting the mood and level of detail of the answers, as follows.
Assuming that the user uploaded an image through the customer service system, a photograph of a watch was displayed, the background was a bright outdoor environment, possibly in an outdoor sports scene. The user may want to know the outdoor performance of the watch.
The processing steps comprise:
1. interpreting query intent categories and environmental information tags:
Image content analysis: the system first analyzes the uploaded image using a pre-trained convolutional neural network model. The model recognizes that the image subject is a "watch" and the background environment is an "outdoor sport".
Query intent category and environmental information tag extraction: based on the image analysis results, the system classifies the query intent category as "product performance query" and the environmental information label as "outdoor sport".
2. Generating an environment-related answer:
Answer content construction: the system constructs an answer aiming at the use performance of the watch in the outdoor sport environment according to the identified query intention type and the environment information label. For example: the watch selected by you is particularly suitable for outdoor exercises, has the IP68 waterproof and dustproof standard, can work normally in various outdoor environments, and can continuously run for up to 30 minutes even under water. "
3. The tone and detail level of the answer are adjusted:
And (3) adjusting the mood: considering that the user may be an outdoor sports fan, the system selects a friendly and slightly specialized instructional mood for answering so that the user is felt to be relevant and supported by the necessary information.
And (3) adjusting the detail degree: to ensure that the user is able to obtain enough information to make a purchase decision, the system further provides other outdoor functional features of the watch, such as GPS tracking, barometric altimeter, etc. functional introductions. In addition, the watch is also internally provided with a GPS tracking function, so that the movement track of the user can be recorded, and the barometric altimeter can help the user to monitor the elevation change in real time, so that the watch is very suitable for hiking and climbing activities. "
Step S104: and determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user instruction of the user, and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images.
Step S104 is a key element in the digital personal customer service method, which focuses on how to present personalized answers in a manner that is most appropriate for the user. The purpose of this step is to ensure that the user not only obtains the required information, but that this information is provided in the most acceptable and understood form.
In this step, first consideration is given to the preference setting of the user. This may include preferences preset by the user in the system, such as the particular format (voice, text, or image) they choose to receive the information. For example, some users may prefer to obtain information by words, as this allows them to process the information at their own pace; while other users may prefer a voice answer because it is more convenient to obtain information quickly.
The system then considers the user's historical interaction behavior. This includes how users interact with the system in the past, and what type of presentation they have shown best understanding and satisfaction in the past. For example, if one user has in the past been active on interpretation feedback provided by an image, the system may tend to reuse the image form in a similar context.
In addition, the system may also consider explicit indications that the user may provide during the interaction. For example, in a session, a user may specifically want to receive information in some way (e.g., request to send a document or provide a chart).
Based on this information, the system will decide the presentation form that best suits the current context and user needs. For example, if the user is querying a complex operating guide, the system may choose to provide the steps in the form of images or videos that are easier to understand and follow. If the user asks a simple fact question, the system may choose to answer in a concise text form, quickly providing the desired information directly.
Finally, by presenting the personalized answer in a form that is most acceptable to the user, this step not only improves the efficiency of information exchange, but also enhances the satisfaction and engagement of the user. This flexibility and consideration of user preferences are important components in providing a quality customer service experience.
In the above embodiment, a digital personal customer service method is provided, and correspondingly, the application also provides a digital personal customer service device. Please refer to fig. 2, which is a schematic diagram of an embodiment of a digital personal organizer according to the present application. Since this embodiment, i.e. the second embodiment, is substantially similar to the method embodiment, the description is relatively simple, and reference should be made to the description of the method embodiment for relevant points. The device embodiments described below are merely illustrative.
The digital personal customer service device provided by the second embodiment of the application comprises:
A receiving unit 201, configured to receive a user input, where the user input includes at least one of the following input types: speech, text or images;
The analysis unit 202 is configured to obtain an analysis result by adopting a corresponding analysis method according to the input type of the received user input, where the analysis result includes: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
A selection unit 203 for selecting and adjusting answer strategies and contents according to the analysis result to generate personalized answers matching with the user context, wherein the adjustment includes changing the detail level, the language or the provided information type of the answers to ensure the relevance and the personalization of the answers;
A determining unit 204, configured to determine a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user indication of the user, and display the personalized answer in the presentation form, where the presentation form includes voice, text or image.
The third embodiment of the application provides an embodiment of a digital personal customer service system. Since this embodiment, i.e., the third embodiment, is substantially similar to the method embodiment, the description is relatively simple, and reference will be made to the description of the method embodiment for relevant points. The system embodiments described below are merely illustrative.
The digital personal customer service system comprises:
a data receiving and processing framework for collecting input data from a user terminal through an interface, wherein the input data comprises at least one of the following types: speech, text or images; the interface comprises an interface, a mobile application and a social media platform;
a context manager for tracking and understanding the user's session context, including the user's history of queries, preferences, and their interaction with the system, in order to provide a more consistent and personalized service experience;
the comprehensive analysis engine integrates natural language processing, voice recognition and image recognition technologies and is used for analyzing content input by a user and understanding the emotion state, query intention and environmental information of the user;
the content generation and adjustment module is used for selecting and adjusting answer strategies and content according to the output of the comprehensive analysis engine; the module comprises an answer template library, and can dynamically adjust the detail degree, the mood and the information type of the answer according to the situation of the user so as to generate a customized answer;
The multi-mode answer display platform is used for presenting answers in a proper form according to the type of equipment and preference settings of a user, and the answers comprise voice, characters or images.
The aim of the present embodiment is to create an efficient system that can understand and meet the needs of the user while providing a personalized and consistent service experience. The system implements a full-flow service from receiving user input to providing customized answers.
The system first gathers user input through a comprehensive data receiving and processing framework. Whether the user interacts through a web interface, mobile application, or social media platform, the framework can receive the user's voice, text, or image data. This flexibility ensures that users can interact seamlessly with the customer service system on different platforms and devices. To achieve this, APIs and SDKs need to be developed or integrated, which can work across platforms, collect and normalize data in different formats, ready for subsequent analysis.
The role of the context manager is to understand the session context of the user. This includes not only the content about the current session, but also the user's history of queries, preference settings, and interaction history with the system. In this way, the system can provide a more personalized and consistent service experience. For example, if a user has queried for specific product information in the past, the system may provide relevant recommendations or warnings in future queries. To implement this function, it is necessary to design a database architecture that is capable of storing, retrieving and analyzing user data, and to ensure that these operations comply with data protection regulations.
The comprehensive analysis engine is the core of the system and is responsible for analyzing the content input by the user and understanding the emotion state, the query intention and the environmental information of the user. This engine integrates natural language processing, speech recognition, and image recognition technologies, and is capable of processing multiple types of input data. Implementing this engine requires the use of machine learning and deep learning frameworks such as TensorFlow or PyTorch, and pre-trained models such as BERT (for NLP) or ResNet (for image recognition).
The content generation and adjustment module selects the appropriate answer strategy and content based on the output of the analysis-by-synthesis engine and adjusts. This includes selecting an appropriate template from a library of answer templates and dynamically adjusting the level of detail, mood and information type of the answer based on the user context. When developing this module, it is necessary to create a rich template library and develop algorithms to select and adjust the templates according to the analysis results.
Finally, the multimodal answer presentation platform is responsible for presenting the answers in the form of user preferences. This means that the system needs to be able to decide to present the answer in the form of speech, text or images, depending on the type of device and settings used by the user. Implementing this platform requires the development of a front-end system that can identify the user's device type and preferences and select the best presentation mode accordingly.
Through the integration of the modules and services, the digital personal customer service system can provide an efficient, flexible and user-friendly service platform, so that not only can the requirements of users be understood, but also highly personalized service experience can be provided.
A fourth embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
receiving user input, the user input comprising at least one of the following input types: speech, text or images;
Adopting a corresponding analysis method according to the input type of the received user input to obtain an analysis result, wherein the analysis result comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
Selecting and adjusting answer strategies and contents according to the analysis results to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the mood or the provided information type of the answers so as to ensure the relevance and the individuality of the answers;
and determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user instruction of the user, and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images.
While the application has been described in terms of preferred embodiments, it is not intended to be limiting, but rather, it will be apparent to those skilled in the art that various changes and modifications can be made herein without departing from the spirit and scope of the application as defined by the appended claims.

Claims (5)

1. A digital human customer service method, comprising:
receiving user input, the user input comprising at least one of the following input types: speech, text or images;
Adopting a corresponding analysis method according to the input type of the received user input to obtain an analysis result, wherein the analysis result comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
Selecting and adjusting answer strategies and contents according to the analysis results to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the mood or the provided information type of the answers so as to ensure the relevance and the individuality of the answers;
determining a presentation form acceptable to a user according to the setting preference, the historical interaction behavior or the explicit user indication of the user, and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images;
For voice input, the method analyzes language expression, intonation characteristics and historical interaction data of the user to identify the current emotion state and query context of the user, and comprises the following steps:
Preprocessing voice input to obtain preprocessed voice data;
Converting the voice input into a corresponding text of the voice input;
Inputting the preprocessed voice data into a trained voice processing mixed model to obtain emotion scores and context labels; the voice processing hybrid model comprises a convolutional neural network, a long-term memory network, an emotion full-connection layer and a context full-connection layer; the convolutional neural network is used for receiving and processing the preprocessed voice data to obtain intonation features in the voice data; the long-term and short-term memory network is used for receiving and processing intonation features in voice data, historical interaction data and corresponding text of voice input to obtain feature vectors; the emotion full-connection layer is used for receiving and processing the feature vector to obtain emotion scores; the context full-connection layer is used for receiving and processing the feature vector to obtain a context label;
For text input, analyzing vocabulary usage, grammar structure and expression style in text content, and understanding query intention of user by combining historical interaction data, comprising:
converting each vocabulary in the text input into a vector with a fixed length by using a pre-trained word embedding model to obtain a vector sequence;
Processing the vector sequence by using a syntactic dependency analysis tool to obtain syntactic structure information;
analyzing the syntactic structure information and the vector sequence by using semantic role labeling technology to obtain semantic role information of each component in the text;
Determining current query information according to the text input, and further determining historical query information according to the historical interaction data; performing text similarity calculation on the front query information and the historical query information, and determining a similarity analysis result;
Identifying a query intention of a user using an SVM classifier based on the vector sequence, the syntax structure information, the semantic role information, and the similarity analysis result;
the analyzing of image content and associated historical interaction data for image input to identify query intent and environmental information of a user includes:
carrying out standardization processing on an image input by a user to obtain standardized image data;
according to the standardized image data, performing feature extraction by utilizing a pre-trained convolutional neural network to obtain an image feature vector;
inputting the image feature vector into a scene recognition model to recognize major scenes and objects in the image;
Analyzing potential query intention of the user by using a recurrent neural network according to the image feature vector, the main scene and object in the image and the historical interaction data of the user;
based on the main scene and object in the image and the potential query intention of the user, the query intention category of the user and the related environmental information label are obtained.
2. The digital personal customer service method according to claim 1, wherein the selecting and adjusting answer policies and content to generate personalized answers matching with user context based on the analysis results comprises:
For voice input, selecting an answer template according to emotion scores and context labels, and adjusting the detail degree of the mood and the information;
For text input, customizing answer content according to the identified query intention of the user, and selecting proper information types and the detailed degree of the information, wherein the information types comprise explanatory and instructional;
for image input, generating an answer to the environment information label according to the query intention category of the user and the related environment information label, and adjusting the language and the detail degree of the answer to match the query intention category of the user.
3. A digital personal attendant apparatus, comprising:
A receiving unit for receiving user input, the user input comprising at least one of the following input types: speech, text or images;
The analysis unit is used for obtaining an analysis result by adopting a corresponding analysis method according to the input type of the received user input, and comprises the following steps: for voice input, analyzing language expression, intonation features and historical interaction data of a user to identify the current emotion state and query context of the user; for text input, analyzing vocabulary use, grammar structure and expression style in text content, and understanding query intention of a user by combining historical interaction data; for image input, analyzing image content and related historical interaction data to identify query intent and environmental information of a user;
A selection unit for selecting and adjusting answer strategies and contents according to the analysis result to generate personalized answers matched with the user situation, wherein the adjustment comprises changing the detail degree, the language or the provided information type of the answers so as to ensure the relevance and the personalization of the answers;
The determining unit is used for determining a presentation form acceptable to the user according to the setting preference, the historical interaction behavior or the explicit user indication of the user and displaying the personalized answer in the presentation form, wherein the presentation form comprises voice, characters or images;
Wherein, the analysis unit is specifically used for:
Preprocessing voice input to obtain preprocessed voice data;
Converting the voice input into a corresponding text of the voice input;
Inputting the preprocessed voice data into a trained voice processing mixed model to obtain emotion scores and context labels; the voice processing hybrid model comprises a convolutional neural network, a long-term memory network, an emotion full-connection layer and a context full-connection layer; the convolutional neural network is used for receiving and processing the preprocessed voice data to obtain intonation features in the voice data; the long-term and short-term memory network is used for receiving and processing intonation features in voice data, historical interaction data and corresponding text of voice input to obtain feature vectors; the emotion full-connection layer is used for receiving and processing the feature vector to obtain emotion scores; the context full-connection layer is used for receiving and processing the feature vector to obtain a context label;
The analysis unit is further configured to:
converting each vocabulary in the text input into a vector with a fixed length by using a pre-trained word embedding model to obtain a vector sequence;
Processing the vector sequence by using a syntactic dependency analysis tool to obtain syntactic structure information;
analyzing the syntactic structure information and the vector sequence by using semantic role labeling technology to obtain semantic role information of each component in the text;
Determining current query information according to the text input, and further determining historical query information according to the historical interaction data; performing text similarity calculation on the front query information and the historical query information, and determining a similarity analysis result;
Identifying a query intention of a user using an SVM classifier based on the vector sequence, the syntax structure information, the semantic role information, and the similarity analysis result;
The analysis unit is further configured to:
carrying out standardization processing on an image input by a user to obtain standardized image data;
according to the standardized image data, performing feature extraction by utilizing a pre-trained convolutional neural network to obtain an image feature vector;
inputting the image feature vector into a scene recognition model to recognize major scenes and objects in the image;
Analyzing potential query intention of the user by using a recurrent neural network according to the image feature vector, the main scene and object in the image and the historical interaction data of the user;
based on the main scene and object in the image and the potential query intention of the user, the query intention category of the user and the related environmental information label are obtained.
4. A digital personal customer service system, comprising:
A data receiving and processing framework for collecting input data from a user terminal through an interface, wherein the input data comprises at least one of the following types: speech, text or images; the interface comprises an interface, a mobile application or a social media platform;
a context manager for tracking and understanding the user's session context, including the user's history of queries, preferences, and their interaction with the system, in order to provide a more consistent and personalized service experience;
the comprehensive analysis engine integrates natural language processing, voice recognition and image recognition technologies and is used for analyzing content input by a user and understanding the emotion state, query intention and environmental information of the user;
the content generation and adjustment module is used for selecting and adjusting answer strategies and content according to the output of the comprehensive analysis engine; the module comprises an answer template library, and can dynamically adjust the detail degree, the mood and the information type of the answer according to the situation of the user so as to generate a customized answer;
the multi-mode answer display platform is used for presenting answers in a proper form according to the equipment type and preference setting of a user, wherein the answers comprise voice, characters or images;
Wherein, the comprehensive analysis engine is specifically configured to:
Preprocessing voice input to obtain preprocessed voice data;
Converting the voice input into a corresponding text of the voice input;
Inputting the preprocessed voice data into a trained voice processing mixed model to obtain emotion scores and context labels; the voice processing hybrid model comprises a convolutional neural network, a long-term memory network, an emotion full-connection layer and a context full-connection layer; the convolutional neural network is used for receiving and processing the preprocessed voice data to obtain intonation features in the voice data; the long-term and short-term memory network is used for receiving and processing intonation features in voice data, historical interaction data and corresponding text of voice input to obtain feature vectors; the emotion full-connection layer is used for receiving and processing the feature vector to obtain emotion scores; the context full-connection layer is used for receiving and processing the feature vector to obtain a context label;
The comprehensive analysis engine is further configured to:
converting each vocabulary in the text input into a vector with a fixed length by using a pre-trained word embedding model to obtain a vector sequence;
Processing the vector sequence by using a syntactic dependency analysis tool to obtain syntactic structure information;
analyzing the syntactic structure information and the vector sequence by using semantic role labeling technology to obtain semantic role information of each component in the text;
Determining current query information according to the text input, and further determining historical query information according to the historical interaction data; performing text similarity calculation on the front query information and the historical query information, and determining a similarity analysis result;
Identifying a query intention of a user using an SVM classifier based on the vector sequence, the syntax structure information, the semantic role information, and the similarity analysis result;
The comprehensive analysis engine is further configured to:
carrying out standardization processing on an image input by a user to obtain standardized image data;
according to the standardized image data, performing feature extraction by utilizing a pre-trained convolutional neural network to obtain an image feature vector;
inputting the image feature vector into a scene recognition model to recognize major scenes and objects in the image;
Analyzing potential query intention of the user by using a recurrent neural network according to the image feature vector, the main scene and object in the image and the historical interaction data of the user;
based on the main scene and object in the image and the potential query intention of the user, the query intention category of the user and the related environmental information label are obtained.
5. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the content of any of claims 1-2.
CN202410256293.3A 2024-03-06 2024-03-06 Digital person customer service method, system, device and storage medium Active CN117828065B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410256293.3A CN117828065B (en) 2024-03-06 2024-03-06 Digital person customer service method, system, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410256293.3A CN117828065B (en) 2024-03-06 2024-03-06 Digital person customer service method, system, device and storage medium

Publications (2)

Publication Number Publication Date
CN117828065A CN117828065A (en) 2024-04-05
CN117828065B true CN117828065B (en) 2024-05-03

Family

ID=90524480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410256293.3A Active CN117828065B (en) 2024-03-06 2024-03-06 Digital person customer service method, system, device and storage medium

Country Status (1)

Country Link
CN (1) CN117828065B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829757A (en) * 2018-05-28 2018-11-16 广州麦优网络科技有限公司 A kind of intelligent Service method, server and the storage medium of chat robots
CN114464180A (en) * 2022-02-21 2022-05-10 海信电子科技(武汉)有限公司 Intelligent device and intelligent voice interaction method
CN117312499A (en) * 2023-10-25 2023-12-29 中国烟草总公司天津市公司 Big data analysis system and method based on semantics

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829757A (en) * 2018-05-28 2018-11-16 广州麦优网络科技有限公司 A kind of intelligent Service method, server and the storage medium of chat robots
CN114464180A (en) * 2022-02-21 2022-05-10 海信电子科技(武汉)有限公司 Intelligent device and intelligent voice interaction method
CN117312499A (en) * 2023-10-25 2023-12-29 中国烟草总公司天津市公司 Big data analysis system and method based on semantics

Also Published As

Publication number Publication date
CN117828065A (en) 2024-04-05

Similar Documents

Publication Publication Date Title
US10977452B2 (en) Multi-lingual virtual personal assistant
KR102222451B1 (en) An apparatus for predicting the status of user's psychology and a method thereof
US20210081056A1 (en) Vpa with integrated object recognition and facial expression recognition
CN110413746B (en) Method and device for identifying intention of user problem
KR102199423B1 (en) An apparatus for machine learning the psychological counseling data and a method thereof
WO2019100350A1 (en) Providing a summary of a multimedia document in a session
CN112346567B (en) Virtual interaction model generation method and device based on AI (Artificial Intelligence) and computer equipment
KR102644992B1 (en) English speaking teaching method using interactive artificial intelligence avatar based on the topic of educational content, device and system therefor
KR20180055680A (en) Method of providing health care guide using chat-bot having user intension analysis function and apparatus for the same
US11545042B2 (en) Personalized learning system
Wilks et al. A prototype for a conversational companion for reminiscing about images
US11216497B2 (en) Method for processing language information and electronic device therefor
CN112395887A (en) Dialogue response method, dialogue response device, computer equipment and storage medium
Tönsing et al. Designing electronic graphic symbol-based AAC systems: a scoping review. Part 1: system description
CN117349515A (en) Search processing method, electronic device and storage medium
KR20210009266A (en) Method and appratus for analysing sales conversation based on voice recognition
CN117828065B (en) Digital person customer service method, system, device and storage medium
Ghosh Exploring intelligent functionalities of spoken conversational search systems
Joseph et al. Conversational agents and chatbots: Current trends
CN114048319B (en) Humor text classification method, device, equipment and medium based on attention mechanism
Hernández et al. User-centric Recommendation Model for AAC based on Multi-criteria Planning
CN113111652B (en) Data processing method and device and computing equipment
US20240127804A1 (en) Transcript tagging and real-time whisper in interactive communications
Sabharwal et al. Various Cognitive Platforms/Engines
Awino Swahili Conversational Ai Voicebot for Customer Support

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant