CN113837594A

CN113837594A - Quality evaluation method, system, device and medium for customer service in multiple scenes

Info

Publication number: CN113837594A
Application number: CN202111103459.0A
Authority: CN
Inventors: 姚一鸣; 徐亮
Original assignee: OneConnect Financial Technology Co Ltd Shanghai
Current assignee: OneConnect Smart Technology Co Ltd; OneConnect Financial Technology Co Ltd Shanghai
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2021-12-24

Abstract

The invention relates to the technical field of artificial intelligence, and provides a method, a system, equipment and a medium for evaluating the quality of customer service under multiple scenes, wherein the method comprises the following steps: acquiring audio information and video information of customer service; according to the pre-configured assessment items, obtaining scores of the corresponding assessment items according to the audio information and/or the video information; the assessment items comprise speech matching, customer service emotion, facial expression or behavior posture; and the scores of all the assessment items are obtained based on the corresponding convolutional neural network; and carrying out weighted summation on the scores of the assessment items, and taking the result of the weighted summation as the service quality score of the customer service. The invention automatically pre-configures examination items for the customer service according to the identity information of the customer service, and scores the service quality of the customer service from many aspects such as speech matching, customer service emotion, facial expression, behavior posture and the like, thereby obtaining a comprehensive and accurate customer service evaluation result.

Description

Quality evaluation method, system, device and medium for customer service in multiple scenes

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method, a system, equipment and a medium for evaluating the quality of customer service under multiple scenes.

Background

Customer service, which is to provide services such as assistance and guidance from the actual needs of customers so as to improve the matching degree between the expected value and the actual experience of the customers. In the service type industry, the customer service quality of an enterprise plays an important role in market competition, enterprise benefit, enterprise development and the like, and is one of important means for the enterprise to know customer requirements, market sales and provide services.

Nowadays, customer service quality evaluation methods are generally divided from clients or enterprise terminals. The client side mainly depends on manual scoring of the client when the service is finished, and proposes or complains about the customer service in modes of voice, offline and the like; the enterprise end mainly depends on a quality inspection department for manual spot check. The evaluation method has certain problems, such as poor customer evaluation matching degree and strong subjectivity, which causes inaccurate evaluation scores; the evaluation dimension of the enterprise on customer service personnel is single, and the enterprise relies on manpower too much, so that the cost of the enterprise is increased, and the service quality of the customer service cannot be comprehensively evaluated.

Disclosure of Invention

In view of the above disadvantages of the prior art, an object of the present invention is to provide a method, a system, a device, and a medium for evaluating quality of customer service in multiple scenarios, which are used to solve the problems in the prior art that the evaluation dimension is single, the service quality of customer service cannot be comprehensively evaluated, and the evaluation data is inaccurate.

The first aspect of the present invention provides a quality evaluation method for customer service in multiple scenes, including: acquiring audio information and/or video information of customer service; according to the pre-configured assessment items, obtaining scores of the corresponding assessment items according to the audio information and/or the video information; the assessment items comprise speech matching, customer service emotion, facial expression or behavior posture; the grading of the speech technology matching is obtained based on a speech technology matching model, the grading of the customer service emotion is obtained based on a speech emotion recognition model, the grading of the facial expression is obtained based on an expression recognition model, and the grading of the behavior posture is obtained based on a behavior recognition model; and carrying out weighted summation on the scores of the assessment items, and taking the result of the weighted summation as the service quality score of the customer service.

In an embodiment of the present invention, the method further includes: inputting the audio information or the audio information carried in the video information into a voiceprint recognition model to extract voiceprint characteristics, calculating the similarity between the voiceprint characteristics and a voiceprint sample of the customer service, recognizing the voiceprint characteristics with the similarity larger than a first preset threshold value as voiceprint recognition, extracting the recognized customer service audio information, and recognizing the voiceprint characteristics with the similarity smaller than the first preset threshold value as customer audio information.

In an embodiment of the present invention, the obtaining the score of the corresponding assessment item according to the audio information and/or the video information includes one or more of the following steps:

converting the customer service audio information and the customer audio information into text information, calculating the similarity between the text information and a pre-stored speech technology template through a speech technology matching model, and calculating the speech technology matching score of the customer service according to the similarity and a preset scoring rule;

inputting the customer service audio information into a voice emotion recognition model to extract features, calculating the similarity between all the extracted features and emotion samples with labels, dividing the features with the similarity larger than a second preset threshold into emotion categories to which the corresponding emotion samples belong to obtain emotion classifications of the customer service audio information, and calculating the customer service emotion scores of the customer service according to classification results and preset rules;

inputting the video information into an expression recognition model to extract facial key point features, calculating the distance between all the facial key point features and each facial expression sample carrying a label, dividing the facial key point features with the distance being smaller than a third preset threshold value into expression categories to which the corresponding facial expression samples belong to obtain the facial expression classification of the video information, and calculating the facial expression score of the customer service according to a preset rule and a classification result;

inputting the video information into a behavior recognition model to extract gesture key point features, calculating the distance between all the gesture key point features and each gesture sample carrying a label, dividing the gesture key point features with the distance being smaller than a fourth preset threshold value into gesture categories to which the corresponding gesture samples belong to obtain human body gesture classifications of the video information, and calculating behavior gesture scores of the customer service according to a preset rule according to classification results.

In an embodiment of the present invention, the step of converting the customer service audio information and the customer audio information into text information includes:

and inputting the customer service audio information and the customer audio information into a voice-to-text model to extract sound characteristic quantities, matching all the sound characteristic quantities with standard sound data in a sound library, and selecting a text corresponding to the standard sound data with the highest matching degree as converted text information.

In an embodiment of the present invention, the step before the customer service audio information is input into the speech emotion recognition model includes: filtering the audio information of the customer service through a filter by adopting a preset sampling rate to obtain an effective range in the audio; and performing time domain segmentation on the long voice data in the effective range, calculating the segmented long voice data by using short-time Fourier transform to obtain a spectrogram, and inputting the spectrogram into a voice emotion recognition model.

In an embodiment of the present invention, the step of calculating the similarity between the text information and the pre-stored phonetics template by the phonetics matching model includes:

converting all the problems in the dialect template into vectors by adopting a word2vec algorithm, and simultaneously converting all the problems of the clients in the text information into vectors; calculating the similarity between the two types of vectors by adopting a Bert algorithm, and positioning each problem of the client to a single problem in the dialect template;

adopting a word2vec algorithm to convert answers corresponding to single questions in the dialect template into vectors, and simultaneously converting answers of customer service corresponding to each question of a customer in the text information into vectors; and calculating the similarity between the two types of vectors by adopting a Bert algorithm to obtain the similarity between the text information and the dialect template.

In an embodiment of the invention, the assessment items are preconfigured according to the identity information of each customer service.

The second aspect of the present invention also provides a system for predicting stroke risk, including:

the receiving module is used for acquiring audio information and video information of the customer service;

the assessment item scoring module is used for acquiring scores of corresponding assessment items according to the pre-configured assessment items and the audio information and/or the video information; the assessment items comprise speech matching, customer service emotion, facial expression or behavior posture; the grading of the speech technology matching is obtained based on a speech technology matching model, the grading of the customer service emotion is obtained based on a speech emotion recognition model, the grading of the facial expression is obtained based on an expression recognition model, and the grading of the behavior posture is obtained based on a behavior recognition model;

and the service quality evaluation module is used for weighting and summing the scores of the assessment items, and taking the result of the weighted summation as the service quality score of the customer service.

The third aspect of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any one of the method steps in the method for evaluating quality of customer service under multiple scenarios according to the first aspect of the present invention.

The fourth aspect of the present invention also provides a storage medium having stored thereon a computer program which, when being executed by a processor, implements the method steps according to any one of the multi-scenario quality-of-service evaluation methods of the first aspect of the present invention.

As described above, the method, system, device and medium for evaluating quality of customer service in multiple scenes according to the present invention have the following advantages:

the method and the system automatically score the service quality of the customer service from multiple aspects such as speech matching, customer service emotion, facial expression, behavior posture and the like according to the audio information and/or the video information of the customer service and the preconfigured assessment items, thereby obtaining a comprehensive and accurate customer service evaluation result.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments of the present application will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a view showing an application scenario of the quality evaluation method according to the first embodiment of the present invention.

Fig. 2 is a schematic flow chart of a quality evaluation method according to a first embodiment of the present invention.

FIG. 3 is a flow chart illustrating matching assessment items according to a first embodiment of the present invention.

FIG. 4 is a block diagram showing the structure of a quality evaluation system according to a second embodiment of the present invention

Fig. 5 is a schematic diagram of a computer device according to a third embodiment of the present invention.

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the drawings only show the components related to the present invention rather than the number, shape and size of the components in practical implementation, and the type, quantity and proportion of the components in practical implementation can be changed freely, and the layout of the components can be more complicated.

The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Referring to fig. 1, a first embodiment of the present invention relates to a method for evaluating quality of customer service in multiple scenarios, applicable to the application scenario shown in fig. 1, which includes the client terminal 101 and the server 102, the customer service terminal 101 is connected with one or more monitoring devices 103, in the service process of customer service personnel, each monitoring device 103 automatically transmits audio information or video information to the customer service terminal 101, the customer service terminal 101 and the server 102 can establish remote connection through a network, the audio information or the video information uploaded by each monitoring device 103 is uploaded to the server 102, and the server 102 scores the service quality of the customer service personnel according to the audio information or the video information from the aspects of speech matching degree, emotion, facial expression or behavior posture and the like and sends the service quality score of the customer service personnel to the customer service terminal 101. Among these, the server 102 includes but is not limited to: a network host, a single network server, multiple sets of network servers or multiple servers, etc. The scoring process is described in detail below:

referring to fig. 2, a first embodiment of the present invention relates to a method for evaluating quality of customer service in multiple scenes, which specifically includes:

step 201, audio information and/or video information of the customer service is obtained.

Specifically, after receiving the initial audio information or the initial video information uploaded by each monitoring device 103, preprocessing the received initial audio information or initial video information, where the preprocessing includes: if the received video information is the video information, carrying out audio-video separation on the initial video information, and transcribing the audio track data in the initial video information to obtain the audio information; for audio information that needs to be analyzed, the preprocessing further includes: audio information which is high in noise and cannot clearly distinguish human voice is deleted; formatting and energy normalization are carried out, and the formatting and energy normalization are converted into a uniform format; performing channel equalization and other processing on the regulated audio information to reduce the interference of noise on the voice signal and improve the quality of voice data; the audio information is framed.

It should be understood that the initial audio information and/or the initial video information is the conversation between the customer service and the client, and for the convenience of analyzing the subsequent quality of service of the customer service, the identity information of the customer service should be confirmed first, and then the audio information of the customer service and the audio information of the client are distinguished and stored separately.

The identity information generally includes the name and job number of the customer service, and if necessary, may also include information such as job post, working time or place. The audio information and/or video information uploaded by the customer service terminal 101 should carry corresponding identity information as a tag, so as to distinguish each customer service. In addition, the server 102 may also confirm the identity information of the customer service according to the received audio information and video information, for example, by means of login account identification, voiceprint identification, face identification, or the like.

Continuing, the step of distinguishing the audio information of the customer service and the customer comprises:

inputting the preprocessed audio information into a voiceprint recognition model, extracting a voiceprint characteristic sequence, calculating the similarity between the extracted voiceprint characteristic sequence and a voiceprint sample of the customer service, and if the similarity is greater than a first preset threshold value, determining that the speaker is the customer service; otherwise, the speaker is identified as the client. In addition, because each person has unique voiceprint characteristics and the change of the same speaker is kept relatively stable, when a plurality of speakers exist in the customer service, the voiceprint recognition model can also recognize different clients except the customer service.

Further, the voiceprint recognition model is obtained by training based on a GMM-SVM algorithm, wherein a Gaussian mixture model GMM is used for describing feature space distribution of a speaker by using weighted mixture of a plurality of Gaussian distributions, and a support vector machine SVM is used for carrying out classification judgment and judging whether the speaker is customer service or a client.

The training method of the voiceprint recognition model comprises the following steps: adopting audio information of customer service, and combining a voice recognition data set disclosed by a network as a training set and a testing set, wherein the training set comprises original audio data and audio data carrying a label; inputting the training set into an initial voiceprint recognition model to obtain a predicted value, calculating the predicted value and a loss value of a label value by adopting a loss function, correcting parameters of the initial voiceprint recognition model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained voiceprint recognition model; and inputting the test set into the trained voiceprint recognition model, and verifying the recognition accuracy.

Wherein, the initial GMM model formula is as follows:

therein, II_kAs a weight coefficient, N (x | μ)_k,Σ_k) Is the kth component in the mixture model.

The initial SVM model formula is:

wherein,

and representing the characteristic vector of x after kernel function mapping, wherein w and b are classification hyperplane parameters to be solved.

Step 202, according to the pre-configured assessment items, according to the audio information and/or the video information, obtaining scores of the corresponding assessment items.

Specifically, the customer service of different work posts has different working scenes, and examination items can be configured for each customer service in advance according to the identity information of the customer service.

Referring to fig. 3, in a telephone scene, the customer service and the client can only communicate by voice, and in a video scene or an off-line scene, the audio information of the customer service and the video information of the customer service can be acquired. Therefore, the working position of the customer service can be obtained according to the identity information of the customer service, and then a grading scheme is configured according to the working position. If the customer service works in a telephone scene, whether the customer service completely solves the questions of the customer, whether the service flow meets the flow requirements, whether the emotion is positive during answering or not can be analyzed according to the audio information, and the like. For example, when the customer service works in a video scene, the customer service can score the speech matching, the emotion and the facial expression in the scoring scheme of the customer service in consideration of the fact that the video information of the customer service cannot acquire complete limb actions. For example, in the customer service of a scene working on line, clear facial expressions cannot be acquired but body movements can be acquired in order to protect the privacy of customers, so that the grading scheme of the customer service can be used for grading speech matching, emotion and behavior postures. In this embodiment, the assessment items include conversational matching, customer service emotion, facial expression, or behavioral gesture.

Further, obtaining the score of the corresponding assessment item according to the audio information and/or the video information comprises one or more of the following steps:

step one, converting customer service audio information and customer audio information into text information, calculating the similarity between the text information and a pre-stored dialect template, and calculating the dialect matching score of the customer service according to the similarity and a preset scoring rule.

Specifically, customer service audio information and customer audio information are continuous spliced frames, the customer service audio information and the customer audio information are input into a voice-to-text model, sound characteristic quantities are extracted, each extracted sound characteristic quantity is matched with standard sound data in a sound library, and a text corresponding to the standard sound data with the highest matching degree is selected as converted text information. The voice-to-character model is obtained based on DNN-HMM neural network training, the deep neural network DNN is used for modeling the observation probability of input audio information, and the hidden Markov model HMM is used for modeling the jump relation among different states so as to depict the time sequence change of a voice signal and obtain the best matching between an unknown voice form and one model in a model library.

Further, the method for training the speech-to-text model comprises: adopting audio information of customer service, and combining a voice data set disclosed by a network as a training set and a testing set, wherein the training set comprises original sound data and standard sound data carrying a label; inputting the training set into an initial voice-to-character model to obtain a predicted value, calculating the predicted value and a loss value of a label value by adopting a loss function, correcting parameters of the initial voice-to-character model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained voice-to-character model; and inputting the test set into the trained voice-to-character model, and verifying the character conversion precision.

Aiming at the converted customer service text information and customer text information, storing the customer problems in the customer text information as first-class data and recording the first-class data as keys; storing answers corresponding to the questions of the client in the customer service text information as second type data, and recording the second type data as value; and storing the first type data and the second type data in a one-to-one correspondence manner.

Continuing to explain, in this embodiment, a dialect matching model is used to calculate similarity between the converted customer service text information and customer text information and a pre-stored dialect template, where the dialect matching model is obtained based on word2vec algorithm and Bert algorithm training, the word2vec algorithm is used to convert text information into vectors and predict a word with high relevance from context, and the Bert algorithm is used to calculate similarity between vectors. And the pre-stored dialect template adopts a dictionary storage structure, namely, the questions and the answers which are in one-to-one correspondence are established.

Specifically, the step of calculating the similarity between the converted customer service text information and the customer text information and the pre-stored dialect template comprises the following steps:

firstly, converting all problems in a speech technology template into vectors by adopting a word2vec algorithm, converting the stored first-class data into the vectors, calculating the similarity between the two vectors by adopting a Bert algorithm, and positioning a problem key of a client to a single problem in the speech technology template;

secondly, converting the answer corresponding to the single question in the dialect template into a vector by adopting a word2vec algorithm, converting the answer value of the customer service corresponding to the question key of the customer into a vector, and calculating the similarity between the two vectors by adopting a Bert algorithm to obtain the similarity between the text information and the pre-stored dialect template.

It should be noted that, in order to avoid redundant conversation in the customer service and too many nonsense words, which results in reduced matching precision, when processing the converted customer service text information and customer text information, the converted customer service text information and customer text information may be cleaned and segmented first, and the keywords corresponding to the question key of the customer and the keywords corresponding to the answer value of the customer service may be screened out; and correspondingly, screening out keywords corresponding to all the questions in the dialect template, and screening out keywords corresponding to answers of all the questions.

Converting keywords corresponding to all problems in the speech template and keywords corresponding to the first type of data into vectors through a speech matching model, calculating the similarity between the two vectors, and positioning the problem key of the client to a single problem in the speech template;

and secondly, converting the key words corresponding to the answers of the single question and the key words corresponding to the answer value of the customer service into vectors, and calculating the similarity between the two vectors to obtain the similarity between the text information and the pre-stored dialect template.

Further, the training method of the dialogical matching model comprises the following steps: adopting text information converted in customer service, and combining a customer service voice technology data set disclosed by a network as a training set and a testing set, wherein the training set comprises original text data and standard text data carrying labels; inputting the training set into an initial tactical matching model to obtain a predicted value, calculating a loss value of the predicted value and a label value by adopting a loss function, correcting parameters of the initial tactical matching model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained tactical matching model; and inputting the test set into a trained dialect matching model, and verifying the character conversion precision.

In a practical embodiment, the initial value of the phonetic matching score VoiceMatch is set to 100 points, the customer service audio information and the customer audio information are sequentially input into the voice-to-text model and the phonetic matching model to obtain the similarity with the phonetic template, and the voice template is scored according to the similarity according to a preset scoring rule, for example, the similarity is 60%, and the final phonetic matching score VoiceMatch is 60 points.

And step two, acquiring corresponding emotion classification according to the customer service audio information, and calculating the customer service emotion score of the customer service according to a preset rule according to a classification result.

Specifically, different acoustic features in human voice represent different emotions, and emotion recognition can be performed on voice signals by recognizing the acoustic features of the voice signals and adopting a trained voice emotion recognition model.

Before emotion recognition is carried out on customer service audio information, preprocessing is carried out on the customer service audio information, and the preprocessing comprises the following steps:

filtering the audio information of the customer service through a filter by adopting a preset sampling rate to obtain an effective range in a voice frequency spectrum;

and performing time domain segmentation on the long voice data in the effective range, and calculating the long voice data by using short-time Fourier transform to obtain a spectrogram.

The step of obtaining the corresponding emotion classification according to the spectrogram corresponding to the customer service audio information comprises the following steps:

and inputting the spectrogram into a voice emotion recognition model for feature extraction, calculating the similarity between all extracted features and an emotion sample carrying a label, and if the similarity between a certain feature and an emotion sample is greater than a second preset threshold value, dividing the feature into emotion categories to which the emotion sample belongs. And calculating the percentage of each emotion in the total voice time, wherein the negative emotion is a subtraction term, the positive emotion is an addition term, and weighting and summing the negative emotion and the positive emotion to obtain the customer service emotion score of the customer service.

Further, the speech emotion recognition model is based on the existing CRNN model, the CNN in the CRNN model performs feature extraction on an input spectrogram, the RNN learns the features, the predicted distribution is output, and the trained speech emotion recognition model is obtained through comparison and correction of a CTC loss function and a real label. The training method comprises the following steps: adopting audio information in customer service, combining a Chinese speech emotion data set disclosed by a network as a training set and a testing set, wherein the training set comprises original speech data and standard speech data carrying labels, and the standard speech data in the embodiment are manually marked according to three categories of 'positive', 'neutral' and 'negative'; inputting the training set into an initial speech emotion recognition model to obtain a predicted value, calculating the predicted value and a loss value of a tag value by adopting a CTC loss function, correcting parameters of the initial speech emotion recognition model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained speech emotion recognition model; and inputting the test set into the trained speech emotion recognition model, and verifying the emotion recognition precision.

In a feasible embodiment, the initial value of the customer service emotion score VoiceEmotion is set to be 100 minutes, the customer service audio information is preprocessed and then input into the voice emotion recognition model to obtain a corresponding emotion judgment result, and the score value is set according to the percentage of the negative emotion voice time to the total voice time. If the total duration of the voice data is 1 minute, wherein the negative emotion is 30 seconds and accounts for 50%, the emotion score is deducted by 50 points, and finally the emotion score VoiceEmotion score is overcome by 50 points.

It will be appreciated that in practice, when voice data is manually tagged, the emotion classification may also be refined as desired, such as happy, sad, angry, puzzling, angry, etc.

And step three, acquiring facial expression classification of the customer service according to the video information, and calculating the facial expression score of the customer service according to a preset rule according to a classification result.

Specifically, the facial image of the customer service is extracted from the video information at preset time intervals, for example, once every five seconds, to obtain an image data set, facial expression recognition and classification are performed on the image data set by using an expression recognition model, and the steps of recognition and classification include:

carrying out face detection on the image data set by adopting an open-source setaface 6 face recognition algorithm, recognizing a face area in the image and extracting the face key point characteristics of the face area; inputting the facial key point features into an expression recognition model, calculating the distance between each facial key point feature and a facial expression sample carrying a label, and if the distance between a certain facial key point feature and one facial expression sample is smaller than a third preset threshold value, dividing the facial key point feature into expression categories to which the facial expression sample belongs. In the whole customer service process, an initial score can be set, and the facial expression score of the customer service is finally obtained every time the preset score of the negative expression buttons is detected.

Further, the expression recognition model is obtained based on the existing Vgg16 convolutional neural network training, and the training process includes: adopting facial images of customer service extracted from video information, and combining a facial expression data set disclosed by a network as a training set and a testing set, wherein the training set comprises an original facial image and a facial image carrying a label, and the standard facial image in the embodiment is artificially marked according to emotions such as peace, smile, anger and the like; inputting the training set into an initial expression recognition model to obtain a predicted value, calculating the loss values of the predicted value and the label value by adopting a coordinated cross entropy loss function, correcting the parameters of the initial expression recognition model according to the loss values, stopping the correction process when the loss values reach preset conditions, and selecting the parameter with the smallest loss value as the parameter of the trained expression recognition model; and inputting the test set into the trained expression recognition model, and verifying the expression recognition precision.

In one possible embodiment, the facial expression score may be calculated using the following formula:

wherein expressonsscore is the facial expression score; ExpinitScore is a preset facial expression initial score; t is the number of times of negative expressions in a single customer service process; n is a radical of_ExpThe total number of negative expressions in a single customer service process; ExpDectScore is a preset single negative expression score.

The facial expression score expressscore is set to an initial value of 100 points, and 20 points are deducted for each detection of a negative expression such as anger, and the like. If 2 negative expressions are detected, the facial expression score expressscore is 60 points.

And step four, acquiring human body posture classification according to the video information, and calculating behavior posture score of the customer service according to a preset rule according to a classification result.

Specifically, at preset time intervals, for example, once every five seconds, a body image of a customer service is extracted from video information to obtain an image data set, behavior posture recognition and classification are performed on the image data set by using a behavior recognition model, and the steps of recognition and classification include:

identifying key points of a human body, such as a trunk, feet, fingers and the like, of the image data set in real time by adopting an open-source OpenPose human body gesture identification algorithm, and extracting features of the key points of the gesture; inputting the gesture key point features into the behavior recognition model, calculating the distance between each gesture key point feature and a gesture sample carrying a label, and if the distance between a certain gesture key point feature and one gesture sample is smaller than a fourth preset threshold value, dividing the gesture key point feature into gesture categories to which the gesture sample belongs. In the whole customer service process, an initial score can be set, and the behavior posture score of the customer service is finally obtained when the preset score is deducted for each negative action.

Further, the behavior recognition model is obtained based on the ResNet50 residual network training, and the training process includes: the method comprises the steps of adopting body images of customer service extracted from video information, combining a human behavior classification data set disclosed by a network as a training set and a testing set, wherein the training set comprises original body images and body images carrying labels, and the standard body images in the embodiment are manually marked according to actions of beating a table, supporting cheeks and the like; inputting the training set into an initial behavior recognition model to obtain a predicted value, calculating the loss values of the predicted value and the label value by adopting a Cross Entropy loss function, correcting the parameters of the initial behavior recognition model according to the loss values, stopping the correction process when the loss values reach preset conditions, and selecting the parameter with the smallest loss value as the parameter of the trained behavior recognition model; and inputting the test set into the trained behavior recognition model, and verifying the behavior recognition accuracy.

In one possible embodiment, the behavioral pose score may be calculated using the following formula:

wherein, the ActScore is a behavior gesture score; the ActInitScore is a preset behavior gesture initial score; t is the number of times of negative actions in the single customer service process; n is a radical of_ActThe total times of the electrode eliminating actions in the single customer service process; actdelectscore scores a preset single negative action score.

The action posture score ActScore is set to an initial value of 100 points, and 30 points are deducted for each negative action detected, such as 'table pounding', 'chin rest', etc. If 2 negative actions are detected in total, the behavioral gesture score, ActScore, is 40.

It should be understood that the classification of the negative expressions or negative actions may be further subdivided according to the scoring requirements, and this embodiment is not particularly limited thereto.

And step 203, carrying out weighted summation on the scores of the assessment items, and taking the result of the weighted summation as the service quality score of the customer service.

Specifically, examination items are configured according to the working posts and the working scenes of the customer service, and a corresponding weight value is set for each examination item, wherein each weight value is given based on actual experience, and can be adjusted according to the needs of the user in application.

After the assessment items are configured, calculating the scores of the assessment items, then carrying out weighted summation on the scores, and taking the result of the weighted summation as the service quality score of the customer service staff; in practical application, the expression of weighted summation may be:

wherein, ComScore is service quality score; VoiceMatch is a dialect match score; VoiceEmotion is customer service emotion score; expressscore is the facial expression score; ActScore scores behavioral gestures; coef_vmMatching the dialect with a weight value; coef_veThe emotional weight value is served for the customer; coef_ExpA facial expression weight value; coef_ActIs a behavior attitude weight value.

In practical application, after the scores of some assessment items are calculated, the assessment items are weighted and summed, and then the weighted and summed result and the scores of other assessment items are weighted and summed, for example, after obtaining the speech matching score VoiceMatch and the customer service emotion score voiceevent, the speech evaluation score is calculated according to the following expression:

VoiceScore＝VoiceMatch×coef_vm+VoiceEmotion×coef_ve

wherein VoiceScore is a voice evaluation score; VoiceMatch is a dialect match score; VoiceEmotion is customer service emotion score; coef_vmMatching the dialect with a weight value; coef_veAnd the emotional weight value is served for the customer.

And weighting and summing the voice evaluation score and the facial expression score and/or behavior posture score to obtain the service quality score of the customer service staff.

In a feasible embodiment, if the working scene is a video scene, the complete human body gesture cannot be acquired at the moment, only the audio information and the facial expression can be acquired, through the calculation, the output speech matching score VoiceMatch is 70 points, the output customer service emotion score voiceevent is 90 points, the output facial expression evaluation score expressonsscore is 80 points, and as no behavior gesture evaluation is performed at the moment, the speech matching weight value Coef can be set_vmIs 0.3, the customer service emotional weighted value Coef_veIs 0.3, facial expression weight value Coef_ExpIf the service quality score is 0.4, the service single service quality score ComScore is:

therefore, according to the audio information and the video information of the customer service, the service quality of the customer service is automatically scored from multiple aspects such as dialect matching, customer service emotion, facial expression, behavior posture and the like according to the preconfigured assessment items, and therefore a comprehensive and accurate customer service evaluation result is obtained.

Referring to fig. 4, a second embodiment of the present invention relates to a quality evaluation system for customer service in multiple scenes, including:

the receiving module 401 is configured to obtain audio information and/or video information of the customer service.

Specifically, the receiving module 401 includes a preprocessing unit, an identity confirming unit, and a voiceprint recognition unit;

the preprocessing unit is configured to perform preprocessing on the initial audio information or the initial video information uploaded by each monitoring device 103 after receiving the initial audio information or the initial video information, where the preprocessing includes: if the received video information is the video information, carrying out audio-video separation on the initial video information, and transcribing the audio track data in the initial video information to obtain the audio information; for audio information that needs to be analyzed, the preprocessing further includes: audio information which is high in noise and cannot clearly distinguish human voice is deleted; formatting and energy normalization are carried out, and the formatting and energy normalization are converted into a uniform format; performing channel equalization and other processing on the regulated audio information to reduce the interference of noise on the voice signal and improve the quality of voice data; the audio information is framed.

And the identity confirmation unit is used for confirming the identity information of the customer service. The identity information generally includes the name and job number of the customer service, and if necessary, may also include information such as job post, working time or place. The audio information and the video information uploaded by the customer service terminal 101 should carry corresponding identity information as a tag, so as to distinguish each customer service. In addition, the server 102 may also confirm the identity information of the customer service according to the received audio information and video information, for example, by means of login account identification, voiceprint identification, face identification, or the like.

And the voiceprint recognition unit is used for distinguishing the audio information of the customer service and the audio information of the client. The distinguishing step comprises the following steps: inputting the preprocessed audio information into a voiceprint recognition model, extracting a voiceprint characteristic sequence, calculating the similarity between the extracted voiceprint characteristic sequence and a voiceprint sample of the customer service, and if the similarity is greater than a first preset threshold value, determining that the speaker is the customer service; otherwise, the speaker is identified as the client. In addition, because each person has unique voiceprint characteristics and the change of the same speaker is kept relatively stable, when a plurality of speakers exist in the customer service, the voiceprint recognition model can also recognize different clients except the customer service.

Further, the voiceprint recognition unit includes a first model training subunit, configured to train the voiceprint recognition model, and the training method includes: adopting audio information of customer service, and combining a voice recognition data set disclosed by a network as a training set and a testing set, wherein the training set comprises original audio data and audio data carrying a label; inputting the training set into an initial voiceprint recognition model to obtain a predicted value, calculating the predicted value and a loss value of a label value by adopting a loss function, correcting parameters of the initial voiceprint recognition model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained voiceprint recognition model; and inputting the test set into the trained voiceprint recognition model, and verifying the recognition accuracy.

And the assessment item scoring module 402 is used for acquiring scores of corresponding assessment items according to the pre-configured assessment items and audio information and/or video information.

Specifically, the assessment item scoring module 402 comprises an assessment item configuration unit, a conversational matching scoring unit, a customer service emotion scoring unit, a facial expression scoring unit and a behavior posture scoring unit.

And the examination item configuration unit is used for configuring examination items for the customer services in advance according to the identity information of the customer services.

It should be understood that in a telephone scenario, the customer service and the client can only communicate by voice, and in a video scenario or an off-line scenario, in addition to the audio information of the customer service, the video information of the customer service can also be obtained. Therefore, the working position of the customer service can be obtained according to the identity information of the customer service, and then a grading scheme is configured according to the working position. If the customer service works in a telephone scene, whether the customer service completely solves the questions of the customer, whether the service flow meets the flow requirements, whether the emotion in answering is positive or not can be analyzed according to the audio information, and the like. For example, when the customer service works in a video scene, the customer service can score the speech matching, the emotion and the facial expression in the scoring scheme of the customer service in consideration of the fact that the video information of the customer service cannot acquire complete limb actions. For example, in the customer service of a scene working on line, clear facial expressions cannot be acquired but body movements can be acquired in order to protect the privacy of customers, so that the grading scheme of the customer service can be used for grading speech matching, emotion and behavior postures. In this embodiment, the assessment items include conversational matching, customer service emotion, facial expression, or behavioral gesture.

And the speech matching scoring unit is used for converting the customer service audio information and the customer audio information into text information, calculating the similarity between the text information and a pre-stored speech template, and calculating the speech matching scoring of the customer service according to the similarity and a preset scoring rule.

Specifically, the customer service audio information and the customer audio information are continuous spliced frames, the speech matching and scoring unit firstly inputs the customer service audio information and the customer audio information into a speech-to-text model, extracts sound characteristic quantities, matches each extracted sound characteristic quantity with standard sound data in a sound library, and selects a text corresponding to the standard sound data with the highest matching degree as converted text information.

Further, the step of calculating the similarity between the converted customer service text information and the customer text information and the pre-stored speech template by the speech matching scoring unit includes:

It should be noted that, in order to avoid that the dialogues in the customer service are redundant, and there are too many nonsense words to cause the reduction of matching precision, the dialogues matching scoring unit can firstly clean and score the converted customer service text information and the customer text information when processing the converted customer service text information and the customer text information, and screen out keywords corresponding to the question key of the customer and keywords corresponding to the answer value of the customer service; and correspondingly, screening out keywords corresponding to all the questions in the dialect template, and screening out keywords corresponding to answers of all the questions.

Further illustratively, the tactical matching scoring unit includes a second model training subunit and a third model training subunit.

The second model training subunit is used for training the voice-to-character model, and the training method comprises the following steps: adopting audio information of customer service, and combining a voice data set disclosed by a network as a training set and a testing set, wherein the training set comprises original sound data and standard sound data carrying a label; inputting the training set into an initial voice-to-character model to obtain a predicted value, calculating the predicted value and a loss value of a label value by adopting a loss function, correcting parameters of the initial voice-to-character model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained voice-to-character model; and inputting the test set into the trained voice-to-character model, and verifying the character conversion precision.

The third model training subunit is used for training the tactical matching model, and the training method comprises the following steps: adopting text information converted in customer service, and combining a customer service voice technology data set disclosed by a network as a training set and a testing set, wherein the training set comprises original text data and standard text data carrying labels; inputting the training set into an initial tactical matching model to obtain a predicted value, calculating a loss value of the predicted value and a label value by adopting a loss function, correcting parameters of the initial tactical matching model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained tactical matching model; and inputting the test set into a trained dialect matching model, and verifying the character conversion precision.

And the customer service emotion scoring unit is used for acquiring corresponding emotion classification according to the customer service audio information and calculating the customer service emotion scoring of the customer service according to a classification result and a preset rule.

Specifically, the customer service emotion scoring unit firstly adopts a preset sampling rate, and filters the audio information of the customer service through a filter to obtain an effective range in a voice frequency spectrum; and performing time domain segmentation on the long voice data in the effective range, and calculating the long voice data by using short-time Fourier transform to obtain a spectrogram.

Secondly, the customer service emotion scoring unit inputs the spectrogram into the voice emotion recognition model for feature extraction, similarity between all extracted features and the emotion samples with the labels is calculated, and if the similarity between one feature and one emotion sample is larger than a second preset threshold value, the feature is divided into the emotion types to which the emotion samples belong. And calculating the percentage of each emotion in the total voice time, wherein the negative emotion is a subtraction term, the positive emotion is an addition term, and weighting and summing the negative emotion and the positive emotion to obtain the customer service emotion score of the customer service.

Further, the customer service emotion scoring unit comprises a fourth model training subunit for training the speech emotion recognition model, and the training method comprises the following steps: adopting audio information in customer service, combining a Chinese speech emotion data set disclosed by a network as a training set and a testing set, wherein the training set comprises original speech data and standard speech data carrying labels, and the standard speech data in the embodiment are manually marked according to three categories of 'positive', 'neutral' and 'negative'; inputting the training set into an initial speech emotion recognition model to obtain a predicted value, calculating the predicted value and a loss value of a tag value by adopting a CTC loss function, correcting parameters of the initial speech emotion recognition model according to the loss value, stopping the correction process when the loss value reaches a preset condition, and selecting the parameter with the smallest loss value as the parameter of the trained speech emotion recognition model; and inputting the test set into the trained speech emotion recognition model, and verifying the emotion recognition precision.

And the facial expression scoring unit is used for acquiring facial expression classification of the customer service according to the video information and calculating the facial expression score of the customer service according to a preset rule according to a classification result.

Specifically, the facial expression scoring unit extracts facial images of customer service from video information at preset time intervals, for example, once every five seconds, to obtain an image data set, and performs facial expression recognition and classification on the facial images by using an expression recognition model, wherein the recognition and classification steps include:

Further, the facial expression scoring unit includes a fifth model training subunit configured to train an expression recognition model, and the training process includes: adopting facial images of customer service extracted from video information, and combining a facial expression data set disclosed by a network as a training set and a testing set, wherein the training set comprises an original facial image and a facial image carrying a label, and the standard facial image in the embodiment is artificially marked according to emotions such as peace, smile, anger and the like; inputting the training set into an initial expression recognition model to obtain a predicted value, calculating the loss values of the predicted value and the label value by adopting a coordinated cross entropy loss function, correcting the parameters of the initial expression recognition model according to the loss values, stopping the correction process when the loss values reach preset conditions, and selecting the parameter with the smallest loss value as the parameter of the trained expression recognition model; and inputting the test set into the trained expression recognition model, and verifying the expression recognition precision.

And the behavior posture scoring unit is used for acquiring human body posture classification according to the video information and calculating the behavior posture score of the customer service according to the classification result and a preset rule.

Specifically, the behavior gesture scoring unit extracts the body image of the customer service from the video information at preset time intervals, for example, once every five seconds, to obtain an image data set, and performs behavior gesture recognition and classification on the body image by using a behavior recognition model, wherein the steps of recognition and classification include:

Further, the behavior posture scoring unit includes a sixth model training subunit, configured to train the behavior recognition model, where the training process includes: the method comprises the steps of adopting body images of customer service extracted from video information, combining a human behavior classification data set disclosed by a network as a training set and a testing set, wherein the training set comprises original body images and body images carrying labels, and the standard body images in the embodiment are manually marked according to actions of beating a table, supporting cheeks and the like; inputting the training set into an initial behavior recognition model to obtain a predicted value, calculating the loss values of the predicted value and the label value by adopting a Cross Entropy loss function, correcting the parameters of the initial behavior recognition model according to the loss values, stopping the correction process when the loss values reach preset conditions, and selecting the parameter with the smallest loss value as the parameter of the trained behavior recognition model; and inputting the test set into the trained behavior recognition model, and verifying the behavior recognition accuracy.

And the service quality evaluation module 403 is configured to perform weighted summation on the scores of the assessment items, and use the result of the weighted summation as the service quality score of the customer service.

Specifically, the service quality evaluation module 403 allocates assessment items according to the job posts and the job scenes of the customer service, and sets a corresponding weight value for each assessment item, where each weight value is given based on actual experience and can be adjusted according to the needs of the user in the application. And after the assessment items are configured, carrying out weighted summation on the assessment items, and taking the result of the weighted summation as the service quality score of the customer service staff.

Referring to fig. 5, a third embodiment of the present invention relates to a computer apparatus, which includes a memory 501, a processor 502 and a computer program stored in the memory 501 and executable on the processor 502, wherein the processor 502 executes the computer program to implement the following steps:

acquiring audio information and video information of customer service;

according to the pre-configured assessment items, obtaining scores of the corresponding assessment items according to the audio information and/or the video information; the assessment items comprise speech matching, customer service emotion, facial expression or behavior posture; the scores of the conversational matching and the customer service emotion are obtained according to audio information or video information, and the scores of the facial expressions and behavior postures are obtained according to the video information;

and carrying out weighted summation on the scores of the assessment items, and taking the result of the weighted summation as the service quality score of the customer service.

The memory 501 and the processor 502 are coupled by a bus, which may include any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 502 and the memory 501. The bus may also connect various other circuits such as peripheral device 503, voltage regulator 504, and power management circuits to one another, as is well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 502 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 502.

The processor 502 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. While memory 501 may be used to store data used by processor 502 in performing operations.

Those skilled in the art will appreciate that the architecture shown in fig. 5 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

A fourth embodiment of the present invention relates to a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring audio information and video information of customer service;

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In summary, according to the method, the system, the equipment and the medium for evaluating the quality of the customer service under multiple scenes, the quality of the customer service is scored according to audio information and video information of the customer service and pre-configured assessment items from the aspects of speech matching, customer service emotion, facial expression, behavior posture and the like, so that a comprehensive and accurate customer service evaluation result is obtained. Therefore, the invention effectively overcomes various defects in the prior art and has high industrial utilization value.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A quality evaluation method of customer service under multiple scenes is characterized by comprising the following steps:

acquiring audio information and/or video information of customer service;

according to the pre-configured assessment items, obtaining scores of the corresponding assessment items according to the audio information and/or the video information; the assessment items comprise speech matching, customer service emotion, facial expression or behavior posture; the grading of the speech technology matching is obtained based on a speech technology matching model, the grading of the customer service emotion is obtained based on a speech emotion recognition model, the grading of the facial expression is obtained based on an expression recognition model, and the grading of the behavior posture is obtained based on a behavior recognition model;

2. The method for evaluating the quality of customer service in multiple scenes according to claim 1, further comprising:

inputting the audio information or the audio information carried in the video information into a voiceprint recognition model to extract voiceprint characteristics, calculating the similarity between the voiceprint characteristics and the voiceprint sample of the customer service, recognizing the voiceprint characteristics with the similarity larger than a first preset threshold value as the customer service audio information, and recognizing the voiceprint characteristics with the similarity smaller than the first preset threshold value as the customer audio information.

3. The method for evaluating the quality of customer service in multiple scenes according to claim 2, wherein the step of obtaining the score of the corresponding assessment item according to the audio information and/or the video information comprises one or more of the following steps:

4. The method of claim 3, wherein the step of converting the customer service audio information and the customer audio information into text information comprises:

5. The method for evaluating the quality of customer service in multiple scenes according to claim 3, wherein the step before the customer service audio information is input into the speech emotion recognition model comprises the following steps:

filtering the audio information of the customer service through a filter by adopting a preset sampling rate to obtain an effective range in the audio;

and performing time domain segmentation on the long voice data in the effective range, calculating the segmented long voice data by using short-time Fourier transform to obtain a spectrogram, and inputting the spectrogram into a voice emotion recognition model.

6. The method for evaluating the quality of customer service in multiple scenes according to claim 3, wherein the step of calculating the similarity between the text information and the pre-stored speech template by the speech matching model comprises the following steps:

7. The method for evaluating the quality of customer services under multiple scenes according to claim 1, wherein the assessment items are preconfigured according to the identity information of each customer service.

8. A quality evaluation system of customer service under multiple scenes is characterized by comprising:

the receiving module is used for acquiring audio information and/or video information of the customer service;

9. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein: the processor, when executing the computer program, realizes the steps of the method of any of claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.