WO2020098249A1

WO2020098249A1 - Electronic device, response conversation technique recommendation method and computer readable storage medium

Info

Publication number: WO2020098249A1
Application number: PCT/CN2019/088635
Authority: WO
Inventors: 李茂辉; 张发友
Original assignee: 平安科技（深圳）有限公司
Priority date: 2018-11-12
Filing date: 2019-05-27
Publication date: 2020-05-22
Also published as: CN109587360A; CN109587360B

Abstract

The present application relates to semantic recognition technology, and disclosed thereby is a response conversation technique recommendation method. The method comprises: after receiving the incoming line of a customer, acquiring a voice stream of the incoming line customer and an operator in real time; extracting a first audio feature vector of a first voice segment in the voice stream, wherein the first voice segment corresponds to a voice segment inputted by the customer in the voice stream; inputting the first audio feature vector into a preset voice emotion analysis model for analysis, and recognizing first voice content and a first emotion classification corresponding to the first audio feature vector; inputting the recognized first voice content and the first emotion classification into a pre-trained response conversation technique recommendation model for analysis to obtain a recommended response conversation technique, and sending the recommended response conversation technique to a terminal of the operator in real time. Also disclosed by the present application are an electronic device and a computer-readable storage medium. The technical solution of the present application effectively ameliorates the condition of operators not knowing how to respond to customers, which reduces customer complaints and customer attrition.

Description

Electronic device, recommended method and computer readable storage medium

Priority declaration

This application is based on the declaration of the Paris Convention and enjoys the priority of the Chinese patent application with the application number CN201811340705.2 and the name "Electronic device, recommended method and computer readable storage medium" submitted on November 12, 2018. The entire content of is incorporated into this application by reference.

Technical field

The present application relates to the field of call center customer service systems, and in particular, to an electronic device, a method for responding to conversations, and a computer-readable storage medium.

Background technique

At present, in the call center customer service system, after assigning agents to serve incoming customers, some agents may conflict with agents due to their own emotional problems in the process of serving customers. In this case, many inexperienced agents do not know How to effectively deal with customers, resulting in customer complaints or customer churn and other consequences.

Summary of the invention

The main purpose of the present application is to provide an electronic device, a method for recommending communication technology, and a computer-readable storage medium, designed to help agents deal with customers and improve customer complaints and customer churn.

In order to achieve the above object, the electronic device proposed in the first aspect of the present application, the electronic device includes a memory and a processor, and the memory stores a recommended program that can be run on the processor. The following procedure is implemented when the technical recommendation program is executed by the processor:

A1. After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

B1. Perform feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream ;

C1. Input the first audio feature vector into a preset voice emotion analysis model for analysis, and identify the first voice content and the first emotion classification corresponding to the first audio feature vector;

D1, input the recognized first speech content and the first sentiment classification into a pre-trained dialogue-recommended recommendation model for analysis, to obtain a recommended dialogue-reporting technique, and put the recommended dialogue-reported technique in real time Sent to the terminal of the agent.

The second aspect of the present application also proposes a method for recommending conversational techniques, including the following steps:

A2. After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

B2. Perform feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream ;

C2. The first audio feature vector is input into a preset voice emotion analysis model for analysis, and the first speech content corresponding to the first audio feature vector and the first emotion classification are identified;

D2. The recognized first speech content and the first emotion classification are input into a pre-trained conversational dialogue recommendation model for analysis to obtain a recommended conversational dialogue, and the recommended conversational dialogue is real-time Sent to the terminal of the agent.

A third aspect of the present application also provides a computer-readable storage medium that stores a computer-speaking recommendation program that can be executed by at least one processor so that the at least one A processor performs the following steps:

After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

Performing feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream;

Input the first audio feature vector into a preset voice emotion analysis model for analysis, and identify the first voice content and the first emotion classification corresponding to the first audio feature vector;

Input the recognized first speech content and the first sentiment classification into a pre-trained conversational recommendation model for analysis to obtain a recommended conversational conversation, and send the recommended conversational conversation in real time to The terminal of the agent.

In the technical solution of the present application, the voice stream of the incoming client and the agent is obtained in real time when the customer talks to the agent after entering the line, and the first audio feature vector corresponding to the voice segment input by the customer is extracted from the obtained voice stream. The first audio feature vector is input into a preset speech emotion analysis model for analysis and recognition, the first speech content and the first emotion classification corresponding to the first audio feature vector are identified, and finally the pre-trained response model is used for recognition The first voice content and the first sentiment classification are analyzed, and the recommended response should be sent to the agent ’s terminal in real time for reference by the agent to help the agent deal with the customer; thus, the customer ’s own emotions are effectively improved. When the problem conflicts with the agent, the agent does not know how to deal with the customer's situation, reducing customer complaints and customer churn.

BRIEF DESCRIPTION

In order to more clearly explain the embodiments of the present application or the technical solutions in the prior art, the following will briefly introduce the drawings required in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only These are some embodiments of the present application. For those of ordinary skill in the art, without paying any creative work, other drawings can be obtained according to the structures shown in these drawings.

FIG. 1 is a schematic flowchart of an embodiment of a method for recommending a conversational application;

FIG. 2 is a schematic flowchart of the second embodiment of the application-recommended method for application of dialogue;

FIG. 3 is a schematic flow chart of three embodiments of the recommended method of application of dialogue technology;

FIG. 4 is a schematic diagram of the operating environment of the preferred embodiment of the application recommended procedure of the application;

FIG. 5 is a program module diagram of an embodiment of the application-recommended procedure of the application;

FIG. 6 is a program module diagram of the second embodiment of the recommended procedure for application of dialogue technology;

FIG. 7 is a program module diagram of three embodiments of the application-recommended application procedure.

The implementation, functional characteristics and advantages of the present application will be further described in conjunction with the embodiments and with reference to the drawings.

detailed description

The following describes the principles and features of the present application with reference to the drawings. The examples given are only used to explain the present application, not to limit the scope of the present application.

This application proposes a recommended method of dialogue technology, which is mainly used in the call center customer service system.

As shown in FIG. 1, FIG. 1 is a schematic flowchart of an embodiment of a method for recommending a conversational application in this application.

In this embodiment, the method for recommending conversational techniques includes:

Step S10: After receiving the incoming customer line, obtain the voice streams of the incoming customer and the agent in real time.

When a customer enters the line and connects with the agent, the call center customer service system obtains in real time the voice stream of the incoming customer and agent (that is, the interactive audio stream between the customer and the agent) through the telephone voice platform.

Step S20: Perform feature extraction on the voice stream to extract the first audio feature vector of the first voice segment in the voice stream, where the first voice segment corresponds to the voice input by the customer in the voice stream segment.

The call center customer service system performs feature extraction on the currently acquired voice stream, and extracts the first audio feature vector corresponding to the voice segment input by the customer in the voice stream. The first audio feature vector may be one or more of the following audio features: energy feature, pronunciation frame feature, pitch frequency feature, formant feature, harmonic noise ratio feature, and Mel cepstrum coefficient feature.

In step S30, the first audio feature vector is input into a preset voice emotion analysis model for analysis, and the first voice content and the first emotion classification corresponding to the first audio feature vector are identified.

The call center customer service system has a preset voice emotion analysis model. After extracting the first audio feature vector from the currently acquired voice stream, the call center customer service system inputs the first audio feature vector into the preset voice emotion analysis Perform analysis and recognition in the model to identify the first speech content corresponding to the first audio feature vector (that is, the text content corresponding to the speech segment input by the customer in the acquired speech stream) and the first emotion classification (that is, the customer's Emotion classification). In this embodiment, the preset voice sentiment analysis model preferably adopts a sentiment analysis model including a DNN-HMM acoustic model, an n-gram language model and a wfst weighted graph, and the first audio feature vector is preferably a melody cepstrum coefficient feature vector Of course, the preset voice sentiment analysis model may also be other sentiment analysis models. The first emotion classification includes: satisfaction, calm, irritability, and anger, etc.

Step S40, the recognized first speech content and the first emotion classification are input into a pre-trained conversational dialogue recommendation model for analysis, to obtain a recommended conversational dialogue, and the recommended conversational dialogue Send to the agent's terminal in real time.

The customer service system in the call center also has a pre-trained recommendation model for conversational skills. When the first speech feature and the first emotion classification corresponding to the first audio feature vector are identified through a preset speech emotion analysis model, the The first speech content and the first sentiment classification are input into the recommended model for analysis, and the recommended response technique is obtained, and the recommended response technique is sent to the terminal of the agent for display in real time for Agent reference, help agents deal with customers.

In the technical solution of this embodiment, by acquiring the voice streams of the incoming client and the agent in real time after the customer enters the line, and extracting the first audio feature vector corresponding to the voice segment input by the customer from the acquired voice stream, the The first audio feature vector is input into a preset speech emotion analysis model for analysis and recognition, and the first speech content and the first emotion classification corresponding to the first audio feature vector are identified, and finally a pre-trained response model is used The recognized first speech content and the first sentiment classification are analyzed, and the recommended response should be sent to the agent's terminal in real time for reference by the agent to help the agent deal with the customer; in this way, the customer's own When emotional issues conflict with the agent, the agent does not know how to deal with the customer's situation, reducing customer complaints and customer churn.

As shown in FIG. 2, FIG. 2 is a schematic flowchart of a second embodiment of a method for recommending conversational application in this application.

In this embodiment, after the step S10, the method for recommending conversational skills further includes:

Step S50: Perform feature extraction on the voice stream to extract a second audio feature vector of a second voice segment in the voice stream, where the second voice segment corresponds to the voice input by the agent in the voice stream segment.

The call center customer service system performs feature extraction on the currently acquired voice stream to extract the second audio feature vector corresponding to the voice segment input by the agent in the voice stream. The second audio feature vector may include one or more of the following audio features: energy feature, pronunciation frame feature, pitch frequency feature, formant feature, harmonic-to-noise ratio feature, and Mel cepstral coefficient feature .

In step S60, the second audio feature vector is input into a preset voice emotion analysis model for analysis, and a second emotion classification corresponding to the second audio feature vector is identified.

After extracting the second audio feature vector from the currently acquired voice stream, the call center customer service system inputs the second audio feature vector into a preset voice sentiment analysis model for analysis and recognition to identify the first audio The second emotion classification corresponding to the feature vector (ie, the emotion classification of the agent). The second emotion classification includes: satisfaction, calm, irritability, and anger, etc.

Step S70, if the second emotion classification is a preset abnormal emotion classification, then preset preset reminder information is sent to the terminal of the agent.

When it is found that the emotion classification of the agent (ie, the second emotion classification) is an abnormal emotion classification (for example, irritability, anger, and other emotional classifications that are not positive emotions), the call center customer service system will send a preset to the terminal of the agent The first reminder message is to remind the agent to pay attention to emotions and adjust the service attitude in time. The first reminder message is, for example: "Your service attitude is detected to be negative, please pay attention to adjust the service attitude", and so on.

This embodiment recognizes the emotion classification of the agent in real time, detects the change of the agent's emotion, and prompts the agent to adjust the emotion and state in real time when the agent's emotion is abnormal (that is, deteriorates), so as to better ensure the service quality of the agent to the customer To increase customer satisfaction.

As shown in FIG. 3, FIG. 3 is a schematic flowchart of three embodiments of the application-recommended method of application.

In the method for recommending conversational skills in this embodiment, in step S60, the preset voice sentiment analysis model also recognizes the second voice content corresponding to the second audio feature vector (that is, the agent in the acquired voice stream The text content corresponding to the input voice segment); the recommended dialogue method recommendation method after the step S60, further includes:

Step S01: When the second emotion classification is a preset abnormal emotion classification, analyze whether the second speech content contains a preset sensitive word.

A sensitive thesaurus is set in the customer service system of the call center (the sensitive word library includes many sensitive words). When the identified second emotion classification (that is, the emotion classification of the agent) is a preset abnormal emotion classification, the call center customer service system is based on Sensitive word library, analyze whether the second speech content (that is, the text content corresponding to the speech segment input by the agent in the acquired speech stream) contains preset sensitive words (for example, impolite and uncivilized words).

Step S02: if the preset sensitive words are included, analyze whether the number of times the preset sensitive words appear in the second speech content is greater than the first threshold;

If it is determined that the second voice content contains preset sensitive words, it means that the current agent's words are used improperly. At this time, the agent's word use is further judged by analyzing the number of times the preset sensitive words appear in the second voice content Improper severity, according to the severity of the corresponding treatment. Specifically, the number of occurrences of the preset sensitive word is compared with a first threshold (for example, 3 times).

Step S03: If the number of times the preset sensitive words appear in the second voice content is less than or equal to the first threshold, send preset second reminder information to the terminal of the agent.

When it is determined that the number of occurrences of the preset sensitive words in the second voice content is less than or equal to the first threshold, the call center customer service system determines that the use of the agent's words is not particularly serious, and then sends the agent Terminal sends a preset second reminder message to remind the agent to pay attention to the words and not to use sensitive words. At the same time, the call center customer service system can also send the preset sensitive words appearing in the second voice content of the agent to the terminal of the agent for highlighting. The second reminder message is, for example, "Please pay attention to words, and prohibit the use of sensitive words", etc.

Step S04, if the number of times the preset sensitive words appear in the second voice content is greater than the first threshold, send preset third reminder information to the superior management terminal of the agent.

When it is determined that the number of occurrences of the preset sensitive words in the second voice content is greater than the first threshold, the call center customer service system determines that the agent's words are used improperly, which may be due to the agent arguing with the customer, etc. Abnormal situation, at this time, the call center customer service system sends the preset third reminder message to the superior management terminal of the agent (the terminal of the superior manager) to remind the superior leader or manager of the agent to pay special attention to the agent's call Happening. At the same time, the call center customer service system can also transfer the call voice of the agent and the customer to the superior management terminal of the agent in real time, so that the superior leader or manager can directly monitor the voice process of the call to quarrel with the customer at the agent Timely handling. The third reminder message is, for example: "There is a serious problem with the agent's speech, please deal with it in time", etc.

In this embodiment, the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data.

The customer service system of the call center will record every call that a customer enters and save it in the call recording database. Each recorded data is usually marked with a service tag of the customer's satisfaction for the call service provided by the agent. First obtain a preset number (for example, 10,000) of recording data from the call recording database. These recording data are all recording data whose service tag is satisfactory; extract audio feature vectors for each recording data obtained to obtain each The first audio feature vector corresponding to the first speech segment (voice segment input by the customer) in the recorded data and the second audio feature vector corresponding to the second speech segment (voice segment input by the agent). The audio feature vector may include one or more of the following audio features: energy feature, pronunciation frame number feature, pitch frequency feature, formant feature, harmonic noise ratio feature, and Mel cepstrum coefficient feature.

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification, and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples.

Then, a preset voice sentiment analysis model is used to analyze and recognize the first audio feature vector and the second audio feature vector corresponding to each recording data, and the first speech content and the first speech feature corresponding to each first audio feature vector are identified. An emotion classification, and recognizing the second speech content corresponding to each second audio feature vector; the first speech content, the first emotion classification, and the second speech content corresponding to each recording data form a sample, so that A preset number of samples were obtained. The preset speech sentiment analysis model preferably uses a sentiment analysis model including a DNN-HMM acoustic model, an n-gram language model and a wfst weighted graph, and the first audio feature vector and the second audio feature vector are preferably Mel cepstrum coefficients Feature vector; of course, the preset voice sentiment analysis model may also be other sentiment analysis models. The first emotion classification includes: satisfaction, calm, irritability, and anger, etc.

S3. The first percentage of the preset number of samples is used as the training set, and the second percentage is used as the verification set. The sum of the first percentage and the second percentage is less than 100%.

From the obtained preset number of samples, a first percentage (for example, 70%) of the samples is used as the training set, and a second percentage (for example, 25% to 30%) of the samples is used as the verification set.

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

The samples of the training set are used to train the pre-trained recommendation model. After the training is completed, the samples in the verification set are used to verify the completed training model. In this embodiment, the following criteria are used to test the accuracy of the recommended dialogue model: The recommended dialogue model predicts a sample of the verification set after the corresponding dialogue, if the content of the corresponding dialogue exceeds N% ( For example, the content of 95%) is the same as the second speech content of the sample, and it is determined that the prediction is accurate.

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends.

A preset threshold (for example, 97%) of the prediction accuracy rate is preset in the system, which is used to check the training effect of the recommended model of the application. If the prediction accuracy rate is greater than the preset threshold, then the application The training of the dialogue recommended model has reached the preset standard, and then the model training ends.

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.

If the prediction accuracy rate is less than or equal to the preset threshold, it means that the training of the recommended model should not reach the preset standard. It may be that the number of samples in the training set is insufficient or the number of samples in the verification set is insufficient. In this case, increase the preset number (ie, increase the number of samples, for example, increase the fixed number each time or increase the random number each time), and then, on this basis, re-execute the above steps S1-S4 , And so on, until the requirement of step S5 is reached, then the model training is ended.

In addition, this application also proposes a recommended procedure for dialogue.

Please refer to FIG. 4, which is a schematic diagram of the operating environment of the preferred embodiment of the application recommended procedure 10 of the present application.

In this embodiment, the conversational recommendation program 10 should be installed and run in the electronic device 1. The electronic device 1 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a server. The electronic device 1 may include, but is not limited to, a memory 11, a processor 12, and a display 13. FIG. 4 only shows the electronic device 1 having the components 11-13, but it should be understood that it is not required to implement all the components shown, and more or fewer components may be implemented instead.

The memory 11 may be an internal storage unit of the electronic device 1 in some embodiments, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk equipped on the electronic device 1, a smart memory card (Smart, Media, Card, SMC), and a secure digital (SD) Cards, flash cards, etc. Further, the memory 11 may also include both an internal storage unit of the electronic device 1 and an external storage device. The memory 11 is used to store application software and various types of data installed in the electronic device 1, for example, program codes of the program 10 recommended for communication. The memory 11 can also be used to temporarily store data that has been or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), microprocessor, or other data processing chip, which is used to run the program code or process data stored in the memory 11, for example, to execute the interactive technology Recommended program 10 etc.

In some embodiments, the display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light emitting diode) touch device, or the like. The display 13 is used to display information processed in the electronic device 1 and to display a visual user interface. The components 11-13 of the electronic device 1 communicate with each other through a program bus.

Please refer to FIG. 5, which is a program block diagram of an embodiment of the application recommended program 10 of this application. In this embodiment, the application recommendation program 10 can be divided into one or more modules, and the one or more modules are stored in the memory 11 and are processed by one or more processors (in this embodiment, the processor 12 ) To complete this application. For example, in FIG. 5, the conversational skill recommendation program 10 may be divided into a real-time acquisition module 101, a first extraction module 102, a first recognition module 103, and a recommendation module 104. The module referred to in this application refers to a series of computer program instruction segments capable of performing specific functions, and is more suitable than the program for describing the execution process of the recommended program 10 in the electronic device 1, wherein:

The real-time acquiring module 101 is configured to acquire the voice streams of the incoming customer and the agent in real time after receiving the incoming customer.

When a customer enters the line and connects with the agent, the call center customer service system obtains in real time the voice stream of the incoming customer and the agent (that is, the interactive audio stream between the customer and the agent) through the telephone voice platform.

The first extraction module 102 is configured to perform feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to the voice stream Speech segments entered by customers.

The first recognition module 103 is configured to input the first audio feature vector into a preset voice emotion analysis model for analysis, and recognize the first voice content and the first emotion classification corresponding to the first audio feature vector.

The recommendation module 104 is configured to input the recognized first speech content and the first emotion classification into a pre-trained dialogue-recommendation recommendation model for analysis, so as to obtain a recommended dialogue-reaction technique, and convert the recommended Dialogue is sent to the agent's terminal in real time.

In this embodiment, for the training method of the conversational recommendation model, refer to the description in the above embodiment of the conversational recommendation method, and details are not described herein.

Referring to FIG. 6, in this embodiment, the conversational skill recommendation program further includes a second extraction module 105, a second recognition module 106 and a reminder module 107.

The second extraction module 105 is configured to perform feature extraction on the voice stream to extract a second audio feature vector of a second voice segment in the voice stream, where the second voice segment corresponds to the voice stream Voice segment entered by the agent.

The second recognition module 106 is configured to input the second audio feature vector into a preset voice emotion analysis model for analysis, and recognize a second emotion classification corresponding to the second audio feature vector.

The reminder module 107 is configured to send preset first reminder information to the terminal of the agent when the second emotion category is a preset abnormal emotion category.

When it is found that the emotion classification of the agent (ie, the second emotion classification) is an abnormal emotion classification (for example, irritability, anger, and other emotional classifications with inactive emotions), the call center customer service system will send the preset The first reminder message is to remind the agent to pay attention to emotions and adjust the service attitude in time. The first reminder message is, for example: "Your service attitude is detected to be negative, please pay attention to adjust the service attitude", and so on.

Referring to FIG. 7, in this embodiment, the second recognition module 106 is further configured to input the second audio feature vector into a preset voice sentiment analysis model for analysis, and identify the second corresponding to the second audio feature vector Voice content (that is, the text content corresponding to the voice segment input by the agent in the acquired voice stream); the dialogue-speaking recommendation program further includes a first analysis module 108 and a second analysis module 109. among them,

The first analysis module 108 is configured to analyze whether the second speech content contains preset sensitive words when determining that the second emotion classification is a preset abnormal emotion classification.

The second analysis module 109 is configured to, when determining that the second voice content contains preset sensitive words, analyze whether the number of times the preset sensitive words appear in the second voice content is greater than a first threshold;

If it is determined that the second voice content contains preset sensitive words, it means that the current agent's words are used improperly. At this time, the agent's word use is further determined by analyzing the number of times the preset sensitive words appear in the second voice content Improper severity, according to the severity of the corresponding treatment. Specifically, the number of occurrences of the preset sensitive word is compared with a first threshold (for example, 3 times).

The reminder module 107 is further configured to send preset second reminder information to the terminal of the agent when it is determined that the number of preset sensitive words appearing in the second voice content is less than or equal to the first threshold.

The reminder module 107 is further configured to send preset third reminder information to the superior management terminal of the agent when it is determined that the number of times the preset sensitive words appear in the second voice content is greater than the first threshold.

When it is determined that the number of occurrences of the preset sensitive words in the second voice content is greater than the first threshold, the call center customer service system determines that the agent's words are used improperly, which may be due to the agent arguing with the customer. Abnormal situation, at this time, the call center customer service system sends the preset third reminder message to the superior management terminal of the agent (the terminal of the superior manager) to remind the superior leader or manager of the agent to pay special attention to the agent's call Happening. At the same time, the call center customer service system can also transfer the call voice of the agent and the customer to the superior management terminal of the agent in real time, so that the superior leader or manager can directly monitor the voice process of the call to quarrel with the customer at the agent Timely handling. The third reminder message is, for example: "There is a serious problem with the agent's speech, please deal with it in time", etc.

Further, the present application also proposes a computer-readable storage medium, the computer-readable storage medium storing a recommended program for conversational skills, the recommended program for conversational skills can be executed by at least one processor, so that the at least A processor executes the method for recommending interaction in any of the above embodiments.

The above are only the preferred embodiments of the present application, and therefore do not limit the patent scope of the present application. Any equivalent structural transformation or direct / indirect use of the content of the description and drawings of the present application under the inventive concept of the present application All other related technical fields are included in the patent protection scope of this application.

Claims

An electronic device, characterized in that the electronic device includes a memory and a processor, and the memory stores a recommended interactive program that can run on the processor, and the recommended recommended program is The processor implements the following steps during execution:

A1. After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

B1. Perform feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream ;

C1. Input the first audio feature vector into a preset voice emotion analysis model for analysis, and identify the first voice content and the first emotion classification corresponding to the first audio feature vector;

D1, input the recognized first speech content and the first sentiment classification into a pre-trained dialogue-recommended recommendation model for analysis, to obtain a recommended dialogue-reporting technique, and put the recommended dialogue-reported technique in real time Sent to the terminal of the agent.
The electronic device according to claim 1, characterized in that, after the step A1, when the recommended program for conversational techniques is executed by the processor, the following steps are further implemented:

E1: Perform feature extraction on the voice stream to extract a second audio feature vector of a second voice segment in the voice stream, where the second voice segment corresponds to a voice segment input by an agent in the voice stream ;

F1. Input the second audio feature vector into a preset voice emotion analysis model for analysis, and identify a second emotion classification corresponding to the second audio feature vector;

G1. If the second emotion classification is a preset abnormal emotion classification, send preset first reminder information to the terminal of the agent.
The electronic device according to claim 2, wherein the preset voice sentiment analysis model also recognizes the second voice content corresponding to the second audio feature vector. After the step F1, the application When the dialogue recommendation program is executed by the processor, the following steps are also implemented:

When the second emotion classification is a preset abnormal emotion classification, analyze whether the second speech content contains a preset sensitive word;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is less than or equal to the first threshold, then send preset second reminder information to the terminal of the agent;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is greater than the first threshold, the preset third reminder information is sent to the superior management terminal of the agent.
The electronic device of claim 1, wherein the preset voice sentiment analysis model includes a DNN-HMM acoustic model, an n-gram language model, and a wfst weighted graph sentiment analysis model.
The electronic device according to claim 1, characterized in that the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
The electronic device according to claim 2, characterized in that the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
The electronic device according to claim 3, characterized in that the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
A method for recommending dialogue technique is characterized by the following steps:

A2. After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

B2. Perform feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream ;

C2. The first audio feature vector is input into a preset voice emotion analysis model for analysis, and the first voice content and the first emotion classification corresponding to the first audio feature vector are identified;

D2. The recognized first speech content and the first emotion classification are input into a pre-trained conversational dialogue recommendation model for analysis to obtain a recommended conversational dialogue, and the recommended conversational dialogue is real-time Sent to the terminal of the agent.
The method for recommending conversational skills according to claim 8, wherein after the step A2, the method for recommending conversational skills further comprises:

E2. Perform feature extraction on the voice stream to extract a second audio feature vector of a second voice segment in the voice stream, where the second voice segment corresponds to a voice segment input by an agent in the voice stream ;

F2. Input the second audio feature vector into a preset voice emotion analysis model for analysis, and identify a second emotion classification corresponding to the second audio feature vector;

G2. If the second emotion classification is a preset abnormal emotion classification, send preset first reminder information to the terminal of the agent.
The method according to claim 9, wherein the preset speech emotion analysis model also recognizes the second speech content corresponding to the second audio feature vector, after step F2, The method for recommending conversational skills also includes:

When the second emotion classification is a preset abnormal emotion classification, analyze whether the second speech content contains a preset sensitive word;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is less than or equal to the first threshold, then send preset second reminder information to the terminal of the agent;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is greater than the first threshold, the preset third reminder information is sent to the superior management terminal of the agent.
The method according to claim 8, wherein the preset speech sentiment analysis model includes a DNN-HMM acoustic model, an n-gram language model and a wfst weighted graph sentiment analysis model.
The method for recommending conversational skills according to claim 8, wherein the training process of the recommendation model for conversational skills includes:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
The method for recommending conversational skills according to claim 9, wherein the training process of the recommendation model for conversational skills includes:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
The method for recommending conversational skills according to claim 10, wherein the training process of the recommendation model for conversational skills includes:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
A computer-readable storage medium, characterized in that the computer-readable storage medium stores an application-recommendation program, and the application-recommendation program can be executed by at least one processor so that the at least one processor Perform the following steps:

After receiving the incoming customer line, obtain the voice streams of the incoming customer and agent in real time;

Performing feature extraction on the voice stream to extract a first audio feature vector of a first voice segment in the voice stream, where the first voice segment corresponds to a voice segment input by a customer in the voice stream;

Input the first audio feature vector into a preset voice emotion analysis model for analysis, and identify the first voice content and the first emotion classification corresponding to the first audio feature vector;

Input the recognized first speech content and the first sentiment classification into a pre-trained conversational recommendation model for analysis to obtain a recommended conversational conversation, and send the recommended conversational conversation in real time to The terminal of the agent.
The computer-readable storage medium according to claim 15, wherein after acquiring the voice streams of the incoming client and the agent in real time, when the conversational recommendation program is executed by the processor, it is also implemented as follows step:

Performing feature extraction on the voice stream to extract a second audio feature vector of a second voice segment in the voice stream, where the second voice segment corresponds to a voice segment input by an agent in the voice stream;

Input the second audio feature vector into a preset voice emotion analysis model for analysis, and identify a second emotion classification corresponding to the second audio feature vector;

If the second emotion classification is a preset abnormal emotion classification, preset preset reminder information is sent to the terminal of the agent.
The computer-readable storage medium of claim 16, wherein the preset voice sentiment analysis model also recognizes the second voice content corresponding to the second audio feature vector, and the Two audio feature vectors are input into a preset voice sentiment analysis model for analysis, and after the step of recognizing the second sentiment classification corresponding to the second audio feature vector is performed, when the recommended conversational program is executed by the processor The following steps are also achieved:

When the second emotion classification is a preset abnormal emotion classification, analyze whether the second speech content contains a preset sensitive word;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is less than or equal to the first threshold, then send preset second reminder information to the terminal of the agent;

If the preset sensitive words are included, and the number of times the preset sensitive words appear in the second voice content is greater than the first threshold, the preset third reminder information is sent to the superior management terminal of the agent.
The computer-readable storage medium of claim 15, wherein the preset speech sentiment analysis model includes a DNN-HMM acoustic model, an n-gram language model, and a wfst weighted graph sentiment analysis model.
The computer-readable storage medium according to claim 15, wherein the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the preset number of values, and repeat steps S1 to S4.
The computer-readable storage medium according to claim 16, wherein the training process of the recommended model should include:

S1. Obtain a preset number of service tags from the call recording database as satisfactory recording data, and extract the first audio feature vector of the first speech segment and the second audio feature vector of the second speech segment in each recording data, Wherein, the first voice segment corresponds to a voice segment input by a customer in the recording data, and the second voice segment corresponds to a voice segment input by an agent in the recording data;

S2. Identify the first speech content and the first sentiment category corresponding to the first audio feature vector and the second speech content corresponding to the second audio feature vector using a preset speech emotion analysis model. The first voice content, the first emotion classification and the second voice content corresponding to the recorded data are used as a sample to obtain a preset number of samples;

S3. The first percentage of the preset number of samples is used as the training set, the second percentage is used as the verification set, and the sum of the first percentage and the second percentage is less than 100%;

S4. Use the samples in the training set to train the preset recommended conversational model, and after the training, use the samples in the verification set to verify the recommended conversational model;

S5. If the prediction accuracy rate is greater than a preset threshold, the model training ends;

S6. If the preset accuracy rate is less than or equal to the preset threshold, increase the value of the preset number and repeat steps S1 to S4.