CN112966568A - Video customer service quality analysis method and device - Google Patents

Video customer service quality analysis method and device Download PDF

Info

Publication number
CN112966568A
CN112966568A CN202110174207.0A CN202110174207A CN112966568A CN 112966568 A CN112966568 A CN 112966568A CN 202110174207 A CN202110174207 A CN 202110174207A CN 112966568 A CN112966568 A CN 112966568A
Authority
CN
China
Prior art keywords
video
customer service
audio
voice
recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110174207.0A
Other languages
Chinese (zh)
Inventor
陈艳婷
暨光耀
张�浩
吴晓茵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial and Commercial Bank of China Ltd ICBC
Original Assignee
Industrial and Commercial Bank of China Ltd ICBC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial and Commercial Bank of China Ltd ICBC filed Critical Industrial and Commercial Bank of China Ltd ICBC
Priority to CN202110174207.0A priority Critical patent/CN112966568A/en
Publication of CN112966568A publication Critical patent/CN112966568A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Business, Economics & Management (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Signal Processing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Primary Health Care (AREA)
  • Marketing (AREA)
  • Human Resources & Organizations (AREA)
  • Economics (AREA)
  • Game Theory and Decision Science (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The application provides a video customer service quality analysis method and a device, which relate to the field of artificial intelligence and can also be used in the field of finance, and the method comprises the following steps: carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training; carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training; and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result. The method and the device can analyze the service process of the customer service by combining the audio characteristics with the video characteristics by utilizing the voice recognition technology and the face recognition technology, thereby evaluating the service quality of the video customer service in real time, and further scientifically and efficiently improving the service quality of the industry and the satisfaction degree of customers.

Description

Video customer service quality analysis method and device
Technical Field
The application relates to the field of artificial intelligence, can also be used in the field of finance, and particularly relates to a video customer service quality analysis method and device.
Background
With the accelerated development of new generation wireless mobile communication technology and the continuous popularization of mobile intelligent terminals, electronic service contents tend to be multimedia, and banks and financial institutions are developing innovative research in a dispute to realize transformation development to intelligent banks and provide all-round intelligent services for customers. At present, the transaction amount of electronic channels such as an internet bank, a mobile phone bank and the like is far beyond the transaction amount of traditional physical bank outlets, and the integration of each service channel can be realized by taking a remote video bank as another extension of the electronic channels, so that the external service time of the bank is prolonged. The video customer service is not limited by space, can visually display commodities, can meet the requirements of the supervision policy of online transactions, greatly simulates the scene of offline business handling, can give more comprehensive 'immersion experience' service to customers, and gives consideration to the mood of the customers.
However, the video customer service process has real-time performance, most of the service processes lack effective supervision, and illegal behaviors occur frequently. Due to the instantaneity of the video customer service process and the complexity of audio and video data, no effective method is available for effectively monitoring the violation of video customer service at present. The existing video customer service quality analysis method is characterized in that service processes are scored through customer reply short messages, service evaluation is carried out by selecting satisfaction degrees on an interactive interface by customers, more intelligent evaluation means are few, and the service quality is more difficult to analyze through facial expressions and the like of both video parties.
Disclosure of Invention
Aiming at the problems in the prior art, the application provides a video customer service quality analysis method and a video customer service quality analysis device, which can analyze the video customer service quality according to customer service voice characteristics and customer service video characteristics.
In order to solve the technical problem, the application provides the following technical scheme:
in a first aspect, the present application provides a method for analyzing quality of service of video customer service, including:
carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training;
carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
Further, the video customer service quality analysis method further includes:
dividing the obtained customer service voice data into a plurality of audio samples according to a set time length;
performing voiceprint recognition on the plurality of audio samples, and extracting a plurality of corresponding audio characteristic data;
and comparing the first audio characteristic data in the plurality of audio characteristic data with pre-stored customer service sound characteristic data to obtain an identity characteristic identification result.
Further, the performing speech feature recognition on the customer service speech by using the feedback neural network model obtained by pre-training includes:
inputting the audio characteristic data into a feedback neural network model obtained by pre-training to obtain an audio emotion classification matrix;
and comparing the audio emotion classification result in the audio emotion classification matrix with a preset threshold value by using a boundary detection algorithm to obtain the voice feature recognition result.
Further, the step of training the feedback neural network model includes:
performing framing and windowing processing on the acquired audio sample data to obtain a voice frame feature vector;
carrying out segmentation processing on the audio sample data to obtain a voice section feature vector;
extracting emotion cognitive features according to the voice frame feature vectors and the voice section feature vectors;
and inputting the emotion cognitive characteristics into an original feedback neural network for training to obtain the feedback neural network model.
Further, the video feature recognition of the customer service video by using the face recognition model obtained by pre-training comprises:
inputting the acquired video data into a face recognition model obtained by pre-training to obtain face recognition characteristics;
carrying out feature enhancement extraction on the face recognition features by using a feature enhancement classifier to obtain a face recognition feature map;
inputting the face recognition characteristic graph into a Faster R-CNN classification network to obtain a video expression classification matrix;
and obtaining the video feature identification result according to the video expression classification matrix.
Further, the analyzing the video customer service quality according to the voice feature recognition result and the video feature recognition result includes:
inputting the audio emotion classification matrix and the video expression classification matrix into a service quality analysis model which is constructed in advance to obtain a plurality of first evaluation scores;
determining a second evaluation score according to each first evaluation score and the corresponding evaluation weight;
and obtaining a video customer service quality analysis result according to the second evaluation score.
Further, the step of pre-building a quality of service analysis model comprises:
determining the weight of each classification in the audio emotion classification matrix and the video expression classification matrix;
and determining a service quality analysis model according to the final influence factor and the weight.
Further, the video customer service quality analysis method further includes:
performing character conversion on the audio sample to obtain character contents corresponding to the audio sample;
and comparing the text content with a preset violation vocabulary to obtain a violation result.
In a second aspect, the present application provides a video customer service quality analysis apparatus, including:
the voice recognition unit is used for carrying out voice feature recognition on the customer service voice by utilizing a feedback neural network model obtained by pre-training;
the video recognition unit is used for carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
and the quality analysis unit is used for analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
Further, the video customer service quality analysis device further comprises:
the audio dividing unit is used for dividing the acquired customer service voice data into a plurality of audio samples according to set duration;
the voiceprint recognition unit is used for carrying out voiceprint recognition on the plurality of audio samples and extracting a plurality of corresponding audio characteristic data;
and the identity recognition unit is used for comparing the first audio characteristic data in the plurality of audio characteristic data with pre-stored customer service sound characteristic data to obtain an identity characteristic recognition result.
Further, the speech recognition unit includes:
the audio emotion matrix generation module is used for inputting the audio characteristic data into a feedback neural network model obtained by pre-training to obtain an audio emotion classification matrix;
and the voice feature recognition module is used for comparing the audio emotion classification result in the audio emotion classification matrix with a preset threshold value by using a boundary detection algorithm to obtain the voice feature recognition result.
Further, the video customer service quality analysis device further comprises:
the frame vector generating unit is used for performing framing and windowing processing on the acquired audio sample data to obtain a voice frame feature vector;
the segment vector generating unit is used for carrying out segmentation processing on the audio sample data to obtain a voice segment feature vector;
the emotion cognition feature extraction unit is used for extracting emotion cognition features according to the voice frame feature vector and the voice section feature vector;
and the neural network model generating unit is used for inputting the emotion cognitive characteristics into an original feedback neural network for training to obtain the feedback neural network model.
Further, the video identification unit includes:
the face feature recognition module is used for inputting the acquired video data into a face recognition model obtained by pre-training to obtain face recognition features;
the face feature enhancement module is used for carrying out feature enhancement extraction on the face recognition features by using a feature enhancement classifier to obtain a face recognition feature map;
the video expression matrix generation module is used for inputting the face recognition characteristic graph into a Faster R-CNN classification network to obtain a video expression classification matrix;
and the video feature identification module is used for obtaining the video feature identification result according to the video expression classification matrix.
Further, the mass analysis unit includes:
the first evaluation score generation module is used for inputting the audio emotion classification matrix and the video expression classification matrix into a service quality analysis model which is constructed in advance to obtain a plurality of first evaluation scores;
the second evaluation score generation module is used for determining a second evaluation score according to each first evaluation score and the evaluation weight corresponding to the first evaluation score;
and the quality analysis result generation module is used for obtaining a video customer service quality analysis result according to the second evaluation score.
Further, the video customer service quality analysis device further comprises:
the classification weight determining unit is used for determining the weight of each classification in the audio emotion classification matrix and the video expression classification matrix;
and the service quality analysis model generation unit is used for determining a service quality analysis model according to the final influence factor and the weight.
Further, the video customer service quality analysis device further comprises:
the character conversion unit is used for performing character conversion on the audio sample to obtain character contents corresponding to the audio sample;
and the violation result generating unit is used for comparing the text content with a preset violation vocabulary to obtain a violation result.
In a third aspect, the present application provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the video customer service quality analysis method when executing the program.
In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of the video customer service quality analysis method.
Aiming at the problems in the prior art, the application provides a video customer service quality analysis method and a video customer service quality analysis device, which can analyze the service process of customer service by combining audio features and video features by utilizing a voice recognition technology and a face recognition technology, thereby evaluating the video customer service quality in real time and further scientifically and efficiently improving the industry service quality and the customer satisfaction.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method for analyzing quality of service of video customer service in an embodiment of the present application;
FIG. 2 is a second flowchart of a method for analyzing quality of service of video customer service in an embodiment of the present application;
FIG. 3 is a flow chart of speech feature recognition performed in an embodiment of the present application;
FIG. 4 is a flow chart of training a feedback neural network model according to an embodiment of the present application;
FIG. 5 is a flow chart of video feature recognition performed in an embodiment of the present application;
FIG. 6 is a flowchart illustrating an example of analyzing video quality of service in an embodiment of the present application;
FIG. 7 is a flow chart of pre-constructing a quality of service analysis model in an embodiment of the present application;
FIG. 8 is a third flowchart of a method for analyzing quality of service of video customer service in an embodiment of the present application;
FIG. 9 is a block diagram of an embodiment of an apparatus for analyzing quality of video customer service;
FIG. 10 is a second block diagram of an apparatus for analyzing quality of video customer service in an embodiment of the present application;
FIG. 11 is a block diagram of a speech recognition unit in an embodiment of the present application;
FIG. 12 is a third block diagram of an apparatus for analyzing quality of video customer service in an embodiment of the present application;
FIG. 13 is a block diagram of a video recognition unit in an embodiment of the present application;
FIG. 14 is a block diagram of a mass spectrometry unit in the embodiment of the present application;
FIG. 15 is a fourth block diagram of an embodiment of an apparatus for analyzing quality of video customer service;
FIG. 16 is a fifth drawing showing the structure of an apparatus for analyzing quality of video customer service in the embodiment of the present application;
fig. 17 is a schematic structural diagram of an electronic device in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the video customer service quality analysis method and apparatus described in the present application may be used in the financial field, and may also be used in any field other than the financial field.
Referring to fig. 1, in order to analyze video customer service quality according to customer service voice characteristics and customer service video characteristics, the present application provides a video customer service quality analysis method, including:
s101: carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training;
s102: carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
s103: and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
It can be understood that the video customer service quality analysis method provided by the application can utilize the feedback neural network model and the face recognition model obtained by pre-training to respectively analyze the voice data and the video data in the video customer service process, so as to grasp the service quality of the video customer service in real time. When the video customer service provides service, relevant devices on the customer service workbench are in a power-on state, and the devices include but are not limited to a voice recording device (usually a microphone) and a video recording device (usually a camera). The voice recording equipment and the video recording equipment can record voice data and video data of a video client in service in real time, send the voice data and the video data to the computer, and the computer analyzes the voice data and the video data, so that the service quality of the video client service is evaluated.
The feedback neural network model is used during voice feature recognition, and voice features of customer service voice can be extracted after the feedback neural network model is trained in advance, so that whether the customer service voice accords with customer service voice feature data prestored in a computer or not is judged according to the customer service voice feature data prestored in the computer, and voice feature recognition is completed. According to the embodiment of the application, the face recognition model is used during video feature recognition, and after the face recognition model is trained in advance, the face features of the customer service video can be extracted, so that the video feature recognition is completed.
From the above description, the video customer service quality analysis method provided by the application can analyze the service process of the customer service by combining the audio features and the video features by using the voice recognition technology and the face recognition technology, so as to evaluate the video customer service quality in real time, and further scientifically and efficiently improve the industry service quality and the customer satisfaction.
Referring to fig. 2, the video customer service quality analysis method provided by the present application further includes:
s201: dividing the obtained customer service voice data into a plurality of audio samples according to a set time length;
it will be appreciated that in actual practice, customer service may provide services to customers using a customer service workbench. The customer service workstation may include a voice acquisition device that may acquire customer service voice data in the call audio. The customer service voice data is divided into a plurality of audio samples according to a set time length, wherein the set time length can be one segment per minute, and the application is not limited to this. Defining customer service voice data U ═ U1,U2…UqU is a set of multiple audio samples, UqThe segment q is labeled as the qth audio sample, where q is the sample number q ═ 1, 2, …, n, where n is a natural number.
S202: performing voiceprint recognition on the plurality of audio samples, and extracting a plurality of corresponding audio characteristic data;
it can be understood that, when a plurality of audio samples are input into the voiceprint recognition module for voiceprint recognition, a plurality of corresponding audio feature data can be extracted. The voiceprint recognition module is located in the execution main body computer in the embodiment of the application, and is used for extracting a sound wave frequency spectrum carrying speech information in a received audio sample to obtain audio characteristic data.
S203: and comparing the first audio characteristic data in the plurality of audio characteristic data with pre-stored customer service sound characteristic data to obtain an identity characteristic identification result.
It can be understood that when the identity characteristic identification is performed on the customer service, the identification can be completed only according to the first audio characteristic data in the audio characteristic data. And comparing the first audio characteristic data with the pre-stored customer service sound characteristic data to obtain an identity characteristic identification result. The identification result may include, but is not limited to, a comparison result and a customer service number. If the result of the identity characteristic identification is abnormal, namely the verification of the customer service identity fails, the prompt message can be returned, the customer service is terminated, and the violation mark is registered.
From the above description, the video customer service quality analysis method provided by the application can identify the identity characteristics of the customer service.
Referring to fig. 3, the speech feature recognition of the customer service speech by using the feedback neural network model obtained by pre-training includes:
s301: inputting the audio characteristic data into a feedback neural network model obtained by pre-training to obtain an audio emotion classification matrix;
it is understood that, in order to classify the emotion expressed in the audio feature data, a feedback neural network model needs to be trained in advance, and the specific training steps are described in S401 to S404. And inputting the audio characteristic data into a feedback neural network model obtained by pre-training to obtain an audio emotion classification matrix.
Specifically, according to the emotional four-dimensional theory in psychology, four nodes are arranged on an output layer corresponding to the feedback neural network model in the embodiment of the present application. These four nodes correspond to the four dimensions of human emotion: pleasure, tension, excitement and certainty. And inputting the audio characteristic data into a feedback neural network model obtained by pre-training, and outputting an audio emotion classification matrix (Shap, Sner, Sexc, Scon).
It should be noted that the feedback neural network model obtained by pre-training in the embodiment of the present application is a feedback neural network (CIRNN) based on a cognitive mechanism, and the network includes an input layer, a hidden layer, a memory layer, and an output layer. The input layer inputs the speech frame characteristics and the speech segment characteristics respectively, and the memory layer is a set of some neurons fed back from the hidden layer and is used for recording the content of the hidden layer at a moment. The activation function of the neuron is Sigmoid function.
Let t be the current time of the network, f (t) represent the speech frame characteristics, g (t) represent the speech segment characteristics in the time period t, e (t) represent the cognitive window characteristics, and x (t) and z (t) represent the outputs of the two hidden layers respectively. W1A weight matrix W connecting the input layers f (t), g (t) to the hidden layer x (t)2Weight matrices, W, connecting the hidden layer x (t) to the hidden layer z (t)3Weight matrices, W, connecting the hidden layer z (t) to the output layer y (t)4For connecting the memory layer xc(t) weight matrix to hidden layer x (t), W5For the weight matrix from the cognitive window features input layer e (t) to the hidden layer z (t), then the hidden layer x (t) f (W)1(f(t)+g(t)+W4xc(t))). Wherein f is a Sigmoid function:
Figure BDA0002940041610000081
then, the memory layer is xc(t) ═ x (t-1); hidden layer z (t) is z (t) ═ f (W)2x(t)+W5E (t)); the output layer y (t) is y (t) f (W)3z(t))。
And the output result of the output layer is the audio emotion classification matrix at the moment t, and the audio emotion classification result at the moment t is represented.
S302: and comparing the audio emotion classification result in the audio emotion classification matrix with a preset threshold value by using a boundary detection algorithm to obtain a voice feature recognition result.
It can be understood that the audio emotion classification matrix is an audio emotion classification result of audio feature data acquired before the node at the current time. And comparing the audio emotion classification result with a preset threshold value by using a boundary detection algorithm to determine whether an abnormality occurs. If the output value of the tension or the excitement exceeds a preset threshold value, and the pleasure or the certainty factor is lower than the preset threshold value, violation marking can be carried out, violation time and violation types are recorded, the number of violation times is accumulated in a database, and when the number of violation times exceeds the preset violation time threshold value, a monitoring alarm can be triggered to send an alarm prompt to an upper-level supervision role.
From the above description, the video customer service quality analysis method provided by the present application can perform speech feature recognition on customer service speech by using a feedback neural network model obtained through pre-training.
Referring to fig. 4, the step of training the feedback neural network model includes:
s401: performing framing and windowing processing on the acquired audio sample data to obtain a voice frame feature vector;
it can be understood that the feedback neural network model needs to be trained in the scenario in which the present embodiment is located. The method comprises the steps of firstly acquiring audio sample data during training, and analyzing the audio sample data in short time including frame processing and windowing processing due to the fact that the physical movement of a pronunciation organ has close influence on the formation of voice and the voice characteristics are relatively stable in short time, so that a time-discrete and amplitude-discrete voice sequence is obtained. For example, in one embodiment, a hamming window of 25ms may be used to frame the audio sample data with a frame shift of 10 ms; then, the short-time characteristics and the corresponding first-order difference of each frame signal are extracted by taking a frame as a unit. The short-term features can comprise prosodic features, nonlinear features, timbre features, spectral features and the like; these features may be normalized after acquisition.
Let the feature vector of the speech frame be f (t), wherein
Figure BDA0002940041610000091
Is an audio feature parameter. Assuming that there are m audio feature parameters in sample f (t), each of which is acquired at time t, where t is 1 and 2 … … n, the method further includes the steps of
Figure BDA0002940041610000092
S402: carrying out segmentation processing on the audio sample data to obtain a voice section feature vector;
it is understood that the audio sample data may be segmented with a segment length of 100 frames/segment, and after the segmentation, the speech segment feature extraction is performed on the audio sample data. The feature extraction is carried out by taking the segment as a unit, so that the correlation of characters is removed by statistical features based on the length of the segment, and meanwhile, the important speech prosodic feature expression is not weakened. By extracting the pitch frequency and the contours of the first three formants of the audio sample data, the voice section characteristics and the short-time voice frame characteristics can be fused, and the emotion recognition rate is improved.
Let the feature vector of the speech segment be g (t), wherein
Figure BDA0002940041610000101
Is an audio feature parameter. Assuming that there are m audio feature parameters in sample g (t), each of which is acquired at time t, where t is 1 and 2 … … n, the method further includes the steps of
Figure BDA0002940041610000102
S403: extracting emotion cognitive features according to the voice frame feature vectors and the voice section feature vectors;
it can be understood that, in order to reflect the influence of the dynamic process of human emotion expression on emotion recognition more accurately, the embodiment of the application utilizes a gaussian function to perform fitting, simulates the expression process of human emotion, extracts an emotion recognition window on the basis of voice segment characteristics, namely, loads a gaussian function on a plurality of voice segment characteristics, strengthens emotion located in the middle segment of audio sample data, weakens emotion located in the first segment and the last segment of the audio sample data, and thus obtains emotion recognition characteristics.
Let the speech segment be characterized by XNAnd performing convolution calculation with a Gaussian function G (x) to obtain the emotional cognitive characteristics: e ═ G (w)i)×XNWherein N is the total number of segments of the audio sample data, wiThe position of the gaussian function corresponding to the ith speech segment.
S404: and inputting the emotion cognitive characteristics into the original feedback neural network for training to obtain a feedback neural network model.
It can be understood that based on the cognitive rule of the human brain on emotion, the human brain not only can analyze the speech features, but also can compare the speech features equally by using the prior experience model and the probability system of the additional lobe so as to improve the orderliness and the accuracy of information processing. Therefore, in the embodiment of the present application, model training may be performed based on an original feedback Neural Network such as a Recurrent Neural Network (RNN), so that voice features located at different time units of audio sample data all participate in the model training, and a feedback Neural Network (CIRNN) based on a cognitive mechanism and fusing multi-granularity features is obtained as a feedback Neural Network model in the embodiment of the present application. Therefore, the time sequence of the emotion can be highlighted, the influence of the speech context on the emotion is emphasized, and the effect of the global characteristics on emotion recognition can be reserved.
In addition, a classic Error Back Propagation (BP) algorithm can be used for training the feedback neural network based on the cognitive mechanism. And calculating the weight of the leading layer of the adjusting output layer by using the error between the output result of the neural network and the input layer, and updating the connection weight matrix between the neural networks of the previous layer by using the error estimation between the output of the neural network and the training target, so that the error iteration modification is performed from the output to the input of the layer.
From the above description, the video customer service quality analysis method provided by the present application can complete the training of the feedback neural network model.
Referring to fig. 5, the video feature recognition of the customer service video by using the face recognition model obtained by pre-training includes:
s501: inputting the acquired video data into a face recognition model obtained by pre-training to obtain face recognition characteristics;
it will be appreciated that in actual practice, customer service may provide services to customers using a customer service workbench. The customer service workstation may include a video acquisition device, such as a surveillance camera. The monitoring camera can be automatically started and enters a video recording state, and continuous front videos are recorded for customer service by taking one-time customer service as a unit. The recorded video may be composed into a video stream set V ═ V1,V2…Vq}. The video data and the customer service voice data are divided by the same preset time length, for example, one segment per minute, and the video data is output at a rate of 25 frames per second. VqUsing an l multiplied by w matrix, wherein l is the row number of the video data matrix, and w is the column number of the video data matrix; vqRepresents the qth video sample, i.e., the sample number, q is 1, 2 … ….
Continuously inputting each frame of image of the video data after data preprocessing into a human face recognition model obtained by pre-training, carrying out real-time human face detection on human face image information, and if the number N <1 or N >1 of the detected human faces is obtained, sending an abnormal instruction and executing alarm operation; and if the number of the detected human faces is equal to 1, entering the next step. The method for obtaining the face recognition model through pre-training can be the prior art, and the application is not limited to this.
Preferably, the monitoring camera can be installed on a customer service workbench and faces the face of a video customer service worker, if the number N <1 or N >1 of the obtained faces is detected, the customer service worker can be reminded to make adjustment through voice or pop-up windows, the face of the customer service worker is aligned to the camera, then the face is detected again, if a unique clear face is not detected after three times of prompting, the service switch automatically jumps back to a closed state, and the customer service worker is reminded to make adjustment, so that only one clear face is required to be in front of the camera.
S502: carrying out feature enhancement extraction on the face recognition features by using a feature enhancement classifier to obtain a face recognition feature map;
it can be understood that 13 feature points in the customer service facial organ can be found for each frame of video data by using a Gabor feature-based feature-enhanced classifier, including: inner eye angular position of the left and right eyes, outer eye angular position of the left and right eyes, highest point position of the left and right eyes, lowest point position of the left and right eyes, nose tip position, leftmost end and rightmost end position of the mouth angle, uppermost end and lowermost end where the center line of the lips intersects the contour of the lips. According to the features, the segmented eye, eyebrow and mouth images can be obtained, and the images can be called face recognition feature maps.
S503: inputting the face recognition characteristic graph into a Faster R-CNN classification network to obtain a video expression classification matrix;
it is understood that the Faster R-CNN classification network includes RDN (residual scaled network) residual expansion network and RPN (region pro-potential network) region generation network.
Firstly, a face recognition feature map is sent to an RDN residual error expansion network, a video preprocessing image is input to a layer 1 convolutional layer (Conv1), and preliminary feature extraction is carried out on the video preprocessing image. Here, an image having a size of 400 × 400 is exemplified as an example. The convolution kernel size of Conv1 was constructed to be 3 × 3, and the number of convolution kernels was set to 64. Then, the feature map output by the Conv1 enters the Conv2, the size of the convolution kernel of the layer is the same as that of the Conv1, and the number of the convolution kernels is set to be 128. Then, the feature map output by the Conv2 enters the Conv3, the size of the convolution kernel of this layer is set to 3 × 3, and the number of convolution kernels is set to 256. Next, the feature map output by Conv3 enters RDN4, the size of the convolution kernel of this layer is set to 3 × 3, the number of convolution kernels is set to 512, and meanwhile, dilation convolution is introduced, and dilation parameters d-1 and d-2 are set. Next, the feature map output by RDN4 enters RDN5, the size of the convolution kernel in this layer is set to 3 × 3, the number of convolution kernels is still set to 512, and dilation convolution is also introduced, and dilation parameters d-2 and d-4 are set, so that feature detection of a micro expression is ensured. The RDN residual error expansion network is adopted to improve the algorithm robustness by using a deeper residual error network structure, and the shallow layer characteristics and the deep layer characteristics are accumulated to ensure that the micro expression is kept in the continuous convolution process, so that more accurate output information is obtained. Therefore, the detection performance of the network can be greatly improved under the condition of basically not increasing the calculated amount of the original model.
And inputting the result of the preliminary feature extraction in the last step into a pre-trained RPN region generation network, and outputting a video expression classification matrix (namely (VDis, VHap, VQui) after identification and classification, wherein three elements respectively represent the probability of appearance of three emotions, namely dysphoria, pleasure and calmness, in video data. Specifically, the RPN region generation network is used for carrying out target detection and accurate positioning on the result of the primary feature extraction to obtain a candidate frame, the RoI pooling layer in the RPN region generation network is used for carrying out maximum pooling operation on the candidate frame, and finally a group of feature sets comprising a plurality of feature vectors with the same dimensionality are output.
It should be noted that, the training steps of the RPN network model are as follows: the objective function of the network is set as a binary cross entropy function (binary _ cross entropy) and the optimization method is Adam. Where Adam's learning rate is set to 0.001, the exponential decay rate of the mean of the gradient is set to 0.9, the exponential decay rate of the uncancelled variance of the gradient is set to 0.999, and the batch size is set to 200. And setting a training set, a verification set and a test set of data according to a certain proportion, testing the verification set in each round after multiple rounds of training, storing the training model with the best result and using the training model for testing the test set, wherein the final result is the result of the whole training.
S504: and obtaining a video characteristic identification result according to the video expression classification matrix.
It can be understood that the video expression classification matrix can represent the probability of appearance of three emotions, namely, dysphoria, pleasure and calmness, in emotion recognition output corresponding to the customer service facial expression, and a facial expression classification result is obtained. Preferably, the facial expression classification results may be further classified into categories of calm, happy, surprised, sad, fear, anger, disgust, and the like.
From the above description, the video customer service quality analysis method provided by the application can perform video feature recognition on the customer service video by using the face recognition model obtained by pre-training.
Referring to fig. 6, analyzing the video service quality according to the voice feature recognition result and the video feature recognition result includes:
s601: inputting the audio emotion classification matrix and the video expression classification matrix into a service quality analysis model which is constructed in advance to obtain a plurality of first evaluation scores;
it can be understood that the audio emotion classification matrix UqAnd video expression classification matrix VqAnd inputting a pre-constructed service quality analysis model, and calculating a first evaluation score based on each specific emotion characteristic classification. In particular, the first rating score characterizes the quality of service embodied for a particular emotional feature classification during the customer service. These emotional feature classifications include, but are not limited to: an audio happy emotion classification, an audio catatonic emotion classification, an audio angry emotion classification, an audio aversion emotion classification, a video happy emotion classification, a video catatonic emotion classification, a video angry emotion classification, and a video aversion emotion classification. Therefore, each emotional feature classification can calculate a first rating score, and finally, M first rating scores are calculated.
S602: determining a second evaluation score according to each first evaluation score and the corresponding evaluation weight;
it is understood that the M first evaluation scores may correspond to M evaluation weights, respectively, and the second evaluation score may be determined according to each first evaluation score and the corresponding evaluation weight. In particular, the second evaluation score characterizes a quality of service of the customer service over a particular time period. In the present embodiment, the second evaluation score is obtained by calculating a weighted sum of all the first evaluation scores:
Figure BDA0002940041610000141
wherein, OwiIs a first evaluation score, OsIs the second evaluation score.
Obviously, the higher the second evaluation score evaluated for the service period, the higher the quality of the service for the customer, and conversely, the worse the quality of the service.
S603: and obtaining a video customer service quality analysis result according to the second evaluation score.
It can be understood that, through the above process, the quality of service of the customer service can be evaluated. When the second evaluation score falls within the threshold ranges respectively corresponding to "poor", "general", "better", and "very good", the video customer service quality analysis result may respectively correspond to "poor", "general", "better", and "very good".
From the above description, the video customer service quality analysis method provided by the present application can analyze the video customer service quality according to the voice feature recognition result and the video feature recognition result.
Referring to fig. 7, the step of pre-building a quality of service analysis model includes:
s701: determining the weight of each classification in the audio emotion classification matrix and the video expression classification matrix;
s702: and determining a service quality analysis model according to the final influence factor and the weight.
It can be understood that the weight of each category in the audio emotion classification matrix and the video expression classification matrix can be determined according to the importance degree of each category in the audio emotion classification matrix and the video expression classification matrix on the video customer service quality analysis result.
Each first evaluation score is calculated by the following formula:
Figure BDA0002940041610000142
wherein, OwiRepresents the ithThe first evaluation score, i is more than or equal to 1 and less than or equal to M and is a positive integer, i is changed from 1 to M in sequence. WiWeights indicating the type to which the ith specific emotion feature belongs, for example, the audio happy emotion classification feature weight is set to 0.6, the audio nervous emotion classification feature weight is set to-0.15, the video expressive feature is set to 0.3 for happy weight, the video expressive feature is set to-0.3 for angry weight, and the video expressive feature is set to-0.2 for disgust weight. x is the number ofA1 denotes the ith specific feature action occurrence, xA0 means that the ith special feature action does not occur.
TAIndicating the time at which a particular characteristic behavior occurs, TDRepresents less than TAThe dynamic reference time of (2) is set by a person skilled in the art on a case-by-case basis, for example, based on the start time of each customer service connection. K is a time decreasing factor, K>1 is set to any value of 1.5 to 3, for example.
XARepresents TAThe number of specific emotional characteristic behaviors that occurred during the previous time period. The reference time period is, for example, 1 minute, and the present application is not limited thereto. The formula adds final influence factors
Figure BDA0002940041610000151
So that XAThe larger, OwiThe smaller, XAThe smaller is, thewiThe larger the service is, the influence of rigid and unchangeable emotional behaviors on evaluation during the customer service period is reduced, the customer service is encouraged to respond to customer demands in time, and the timely change is achieved.
From the above formula, WiThe influence of different emotional states represented by different types of specific emotional characteristic classifications on the service quality of the client is represented. Due to K>1, and therefore the more distant the occurrence time is from the dynamic reference time, the greater the impact on the evaluation of the quality of service for the same type of emotional feature. Furthermore, the ratio of the historical behavior in the evaluation can be reduced by the final influence factor of the time.
From the above description, the video customer service quality analysis method provided by the application can construct a service quality analysis model.
Referring to fig. 8, the video customer service quality analysis method provided by the present application further includes:
s801: performing character conversion on the audio sample to obtain character contents corresponding to the audio sample;
s802: and comparing the text content with a preset violation vocabulary to obtain a violation result.
It is understood that the collected audio sample during the service period may be converted into text content, and the text content may be matched with the offending vocabulary in the offending vocabulary library, where the offending vocabulary may include "complaints", "dissatisfaction", and the like. And if the matching is successful, marking violation, and recording violation time and violation type.
From the above description, the video customer service quality analysis method provided by the application can find the illegal vocabulary appearing in the audio sample and mark the illegal behavior.
Based on the same inventive concept, the embodiment of the present application further provides a video customer service quality analysis apparatus, which can be used to implement the method described in the foregoing embodiment, as described in the following embodiment. Because the principle of the video customer service quality analysis device for solving the problems is similar to the video customer service quality analysis method, the implementation of the video customer service quality analysis device can refer to the implementation of a software performance reference determination method, and repeated parts are not repeated. As used hereinafter, the term "unit" or "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
Referring to fig. 9, in order to analyze video customer service quality according to customer service voice characteristics and customer service video characteristics, the present application provides a video customer service quality analysis apparatus, including:
a speech recognition unit 901, configured to perform speech feature recognition on the customer service speech by using a feedback neural network model obtained through pre-training;
the video identification unit 902 is used for performing video feature identification on the customer service video by using a face identification model obtained by pre-training;
and the quality analysis unit 903 is used for analyzing the video customer service quality according to the voice feature recognition result and the video feature recognition result.
Referring to fig. 10, the video customer service quality analysis apparatus further includes:
the audio dividing unit 1001 is configured to divide the acquired customer service voice data into a plurality of audio samples according to a set duration;
a voiceprint recognition unit 1002, configured to perform voiceprint recognition on the multiple audio samples, and extract multiple corresponding audio feature data;
the identity recognition unit 1003 is configured to compare a first audio feature data of the multiple audio feature data with pre-stored customer service sound feature data to obtain an identity feature recognition result.
Referring to fig. 11, the speech recognition unit 901 includes:
an audio emotion matrix generation module 1101, configured to input the audio feature data into a pre-trained feedback neural network model to obtain an audio emotion classification matrix;
and the voice feature recognition module 1102 is configured to compare the audio emotion classification result in the audio emotion classification matrix with a preset threshold value by using a boundary detection algorithm, so as to obtain the voice feature recognition result.
Referring to fig. 12, the video customer service quality analysis apparatus further includes:
a frame vector generating unit 1201, configured to perform framing and windowing on the obtained audio sample data to obtain a feature vector of a speech frame;
a segment vector generating unit 1202, configured to perform segmentation processing on the audio sample data to obtain a speech segment feature vector;
an emotion recognition feature extraction unit 1203, configured to extract emotion recognition features according to the voice frame feature vectors and the voice segment feature vectors;
and a neural network model generating unit 1204, configured to input the emotion cognitive features into an original feedback neural network for training, so as to obtain the feedback neural network model.
Referring to fig. 13, the video recognition unit 902 includes:
the face feature recognition module 1301 is configured to input the acquired video data into a face recognition model obtained through pre-training to obtain face recognition features;
a face feature enhancement module 1302, configured to perform feature enhancement extraction on the face recognition features by using a feature enhancement classifier, so as to obtain a face recognition feature map;
the video expression matrix generation module 1303 is used for inputting the face recognition feature map into a Faster R-CNN classification network to obtain a video expression classification matrix;
and the video feature identification module 1304 is configured to obtain the video feature identification result according to the video expression classification matrix.
Referring to fig. 14, the mass analysis unit 903 includes:
a first evaluation score generation module 1401, configured to input the audio emotion classification matrix and the video expression classification matrix into a service quality analysis model that is constructed in advance, so as to obtain a plurality of first evaluation scores;
a second evaluation score generation module 1402, configured to determine a second evaluation score according to each first evaluation score and the evaluation weight corresponding to the first evaluation score;
a quality analysis result generating module 1403, configured to obtain a video customer service quality analysis result according to the second evaluation score.
Referring to fig. 15, the video customer service quality analysis apparatus further includes:
a classification weight determining unit 1501, configured to determine weights of each classification in the audio emotion classification matrix and the video expression classification matrix;
a service quality analysis model generating unit 1502, configured to determine a service quality analysis model according to the final impact factor and the weight.
Referring to fig. 16, the video customer service quality analysis apparatus further includes:
a text conversion unit 1601, configured to perform text conversion on the audio sample to obtain text content corresponding to the audio sample;
and a violation result generating unit 1602, configured to compare the text content with a preset violation vocabulary, and obtain a violation result.
In order to analyze video service quality according to the characteristics of the customer service voice and the characteristics of the customer service video on a hardware level, the present application provides an embodiment of an electronic device for implementing all or part of the contents in the video service quality analysis method, where the electronic device specifically includes the following contents:
a Processor (Processor), a Memory (Memory), a communication Interface (Communications Interface) and a bus; the processor, the memory and the communication interface complete mutual communication through the bus; the communication interface is used for realizing information transmission between the video customer service quality analysis device and relevant equipment such as a core service system, a user terminal, a relevant database and the like; the logic controller may be a desktop computer, a tablet computer, a mobile terminal, and the like, but the embodiment is not limited thereto. In this embodiment, the logic controller may be implemented with reference to the embodiment of the video service quality analysis method and the embodiment of the video service quality analysis apparatus in the embodiment, and the contents thereof are incorporated herein, and repeated descriptions are omitted.
It is understood that the user terminal may include a smart phone, a tablet electronic device, a network set-top box, a portable computer, a desktop computer, a Personal Digital Assistant (PDA), an in-vehicle device, a smart wearable device, and the like. Wherein, intelligence wearing equipment can include intelligent glasses, intelligent wrist-watch, intelligent bracelet etc..
In practical applications, part of the video customer service quality analysis method may be performed on the electronic device side as described above, or all operations may be performed in the client device. The selection may be specifically performed according to the processing capability of the client device, the limitation of the user usage scenario, and the like. This is not a limitation of the present application. The client device may further include a processor if all operations are performed in the client device.
The client device may have a communication module (i.e., a communication unit), and may be in communication connection with a remote server to implement data transmission with the server. The server may include a server on the side of the task scheduling center, and in other implementation scenarios, the server may also include a server on an intermediate platform, for example, a server on a third-party server platform that is communicatively linked to the task scheduling center server. The server may include a single computer device, or may include a server cluster formed by a plurality of servers, or a server structure of a distributed apparatus.
Fig. 17 is a schematic block diagram of a system configuration of an electronic device 9600 according to an embodiment of the present application. As shown in fig. 17, the electronic device 9600 can include a central processor 9100 and a memory 9140; the memory 9140 is coupled to the central processor 9100. Notably, this fig. 17 is exemplary; other types of structures may also be used in addition to or in place of the structure to implement telecommunications or other functions.
In one embodiment, the video customer service quality analysis method function may be integrated into the central processor 9100. The central processor 9100 may be configured to control as follows:
s101: carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training;
s102: carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
s103: and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
From the above description, the video customer service quality analysis method provided by the application can analyze the service process of the customer service by combining the audio features and the video features by using the voice recognition technology and the face recognition technology, so as to evaluate the video customer service quality in real time, and further scientifically and efficiently improve the industry service quality and the customer satisfaction.
In another embodiment, the video service quality analyzer may be configured separately from the central processing unit 9100, for example, the video service quality analyzer may be configured as a chip connected to the central processing unit 9100, and the function of the video service quality analysis method may be realized by the control of the central processing unit.
As shown in fig. 17, the electronic device 9600 may further include: a communication module 9110, an input unit 9120, an audio processor 9130, a display 9160, and a power supply 9170. It is noted that the electronic device 9600 also does not necessarily include all of the components shown in fig. 17; in addition, the electronic device 9600 may further include components not shown in fig. 17, which can be referred to in the related art.
As shown in fig. 17, a central processor 9100, sometimes referred to as a controller or operational control, can include a microprocessor or other processor device and/or logic device, which central processor 9100 receives input and controls the operation of the various components of the electronic device 9600.
The memory 9140 can be, for example, one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, or other suitable device. The information relating to the failure may be stored, and a program for executing the information may be stored. And the central processing unit 9100 can execute the program stored in the memory 9140 to realize information storage or processing, or the like.
The input unit 9120 provides input to the central processor 9100. The input unit 9120 is, for example, a key or a touch input device. Power supply 9170 is used to provide power to electronic device 9600. The display 9160 is used for displaying display objects such as images and characters. The display may be, for example, an LCD display, but is not limited thereto.
The memory 9140 can be a solid state memory, e.g., Read Only Memory (ROM), Random Access Memory (RAM), a SIM card, or the like. There may also be a memory that holds information even when power is off, can be selectively erased, and is provided with more data, an example of which is sometimes called an EPROM or the like. The memory 9140 could also be some other type of device. Memory 9140 includes a buffer memory 9141 (sometimes referred to as a buffer). The memory 9140 may include an application/function storage portion 9142, the application/function storage portion 9142 being used for storing application programs and function programs or for executing a flow of operations of the electronic device 9600 by the central processor 9100.
The memory 9140 can also include a data store 9143, the data store 9143 being used to store data, such as contacts, digital data, pictures, sounds, and/or any other data used by an electronic device. The driver storage portion 9144 of the memory 9140 may include various drivers for the electronic device for communication functions and/or for performing other functions of the electronic device (e.g., messaging applications, contact book applications, etc.).
The communication module 9110 is a transmitter/receiver 9110 that transmits and receives signals via an antenna 9111. The communication module (transmitter/receiver) 9110 is coupled to the central processor 9100 to provide input signals and receive output signals, which may be the same as in the case of a conventional mobile communication terminal.
Based on different communication technologies, a plurality of communication modules 9110, such as a cellular network module, a bluetooth module, and/or a wireless lan module, may be disposed in the same electronic device. The communication module (transmitter/receiver) 9110 is also coupled to a speaker 9131 and a microphone 9132 via an audio processor 9130 to provide audio output via the speaker 9131 and receive audio input from the microphone 9132, thereby implementing ordinary telecommunications functions. The audio processor 9130 may include any suitable buffers, decoders, amplifiers and so forth. In addition, the audio processor 9130 is also coupled to the central processor 9100, thereby enabling recording locally through the microphone 9132 and enabling locally stored sounds to be played through the speaker 9131.
An embodiment of the present application further provides a computer-readable storage medium capable of implementing all the steps in the video service quality analysis method with the execution subject being the server or the client in the foregoing embodiment, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements all the steps in the video service quality analysis method with the execution subject being the server or the client in the foregoing embodiment, for example, when the processor executes the computer program, the processor implements the following steps:
s101: carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training;
s102: carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
s103: and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
From the above description, the video customer service quality analysis method provided by the application can analyze the service process of the customer service by combining the audio features and the video features by using the voice recognition technology and the face recognition technology, so as to evaluate the video customer service quality in real time, and further scientifically and efficiently improve the industry service quality and the customer satisfaction.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The principle and the implementation mode of the invention are explained by applying specific embodiments in the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (14)

1. A video customer service quality analysis method is characterized by comprising the following steps:
carrying out voice feature recognition on customer service voice by using a feedback neural network model obtained by pre-training;
carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
and analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
2. The video customer service quality analysis method according to claim 1, further comprising:
dividing the obtained customer service voice data into a plurality of audio samples according to a set time length;
performing voiceprint recognition on the plurality of audio samples, and extracting a plurality of corresponding audio characteristic data;
and comparing the first audio characteristic data in the plurality of audio characteristic data with pre-stored customer service sound characteristic data to obtain an identity characteristic identification result.
3. The method of claim 2, wherein the performing speech feature recognition on the speech of the customer service using the pre-trained recurrent neural network model comprises:
inputting the audio characteristic data into a feedback neural network model obtained by pre-training to obtain an audio emotion classification matrix;
and comparing the audio emotion classification result in the audio emotion classification matrix with a preset threshold value by using a boundary detection algorithm to obtain the voice feature recognition result.
4. The method of claim 3, wherein the step of training the reward neural network model comprises:
performing framing and windowing processing on the acquired audio sample data to obtain a voice frame feature vector;
carrying out segmentation processing on the audio sample data to obtain a voice section feature vector;
extracting emotion cognitive features according to the voice frame feature vectors and the voice section feature vectors;
and inputting the emotion cognitive characteristics into an original feedback neural network for training to obtain the feedback neural network model.
5. The method for analyzing the quality of the video customer service according to claim 1, wherein the performing the video feature recognition on the customer service video by using the face recognition model obtained by the pre-training comprises:
inputting the acquired video data into a face recognition model obtained by pre-training to obtain face recognition characteristics;
carrying out feature enhancement extraction on the face recognition features by using a feature enhancement classifier to obtain a face recognition feature map;
inputting the face recognition characteristic graph into a Faster R-CNN classification network to obtain a video expression classification matrix;
and obtaining the video feature identification result according to the video expression classification matrix.
6. The method of claim 5, wherein analyzing the video quality of service based on the speech feature recognition result and the video feature recognition result comprises:
inputting the audio emotion classification matrix and the video expression classification matrix into a service quality analysis model which is constructed in advance to obtain a plurality of first evaluation scores;
determining a second evaluation score according to each first evaluation score and the corresponding evaluation weight;
and obtaining a video customer service quality analysis result according to the second evaluation score.
7. The method of claim 6, wherein the step of pre-building a quality of service analysis model comprises:
determining the weight of each classification in the audio emotion classification matrix and the video expression classification matrix;
and determining a service quality analysis model according to the final influence factor and the weight.
8. The video customer service quality analysis method according to claim 2, further comprising:
performing character conversion on the audio sample to obtain character contents corresponding to the audio sample;
and comparing the text content with a preset violation vocabulary to obtain a violation result.
9. A video customer service quality analysis device, comprising:
the voice recognition unit is used for carrying out voice feature recognition on the customer service voice by utilizing a feedback neural network model obtained by pre-training;
the video recognition unit is used for carrying out video feature recognition on the customer service video by using a face recognition model obtained by pre-training;
and the quality analysis unit is used for analyzing the video customer service quality according to the voice characteristic recognition result and the video characteristic recognition result.
10. The video customer service quality analysis device of claim 9, further comprising:
the audio dividing unit is used for dividing the acquired customer service voice data into a plurality of audio samples according to set duration;
the voiceprint recognition unit is used for carrying out voiceprint recognition on the plurality of audio samples and extracting a plurality of corresponding audio characteristic data;
and the identity recognition unit is used for comparing the first audio characteristic data in the plurality of audio characteristic data with pre-stored customer service sound characteristic data to obtain an identity characteristic recognition result.
11. The video customer service quality analysis device of claim 10, further comprising:
the frame vector generating unit is used for performing framing and windowing processing on the acquired audio sample data to obtain a voice frame feature vector;
the segment vector generating unit is used for carrying out segmentation processing on the audio sample data to obtain a voice segment feature vector;
the emotion cognition feature extraction unit is used for extracting emotion cognition features according to the voice frame feature vector and the voice section feature vector;
and the neural network model generating unit is used for inputting the emotion cognitive characteristics into an original feedback neural network for training to obtain the feedback neural network model.
12. The video customer service quality analysis device of claim 11, further comprising:
the classification weight determining unit is used for determining the weight of each classification in the audio emotion classification matrix and the video expression classification matrix;
and the service quality analysis model generation unit is used for determining a service quality analysis model according to the final influence factor and the weight.
13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the video customer service quality analysis method according to any one of claims 1 to 8 are implemented when the processor executes the program.
14. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the video service quality analysis method according to any one of claims 1 to 8.
CN202110174207.0A 2021-02-09 2021-02-09 Video customer service quality analysis method and device Pending CN112966568A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110174207.0A CN112966568A (en) 2021-02-09 2021-02-09 Video customer service quality analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110174207.0A CN112966568A (en) 2021-02-09 2021-02-09 Video customer service quality analysis method and device

Publications (1)

Publication Number Publication Date
CN112966568A true CN112966568A (en) 2021-06-15

Family

ID=76284273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110174207.0A Pending CN112966568A (en) 2021-02-09 2021-02-09 Video customer service quality analysis method and device

Country Status (1)

Country Link
CN (1) CN112966568A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869773A (en) * 2021-10-13 2021-12-31 北京卓思天成数据咨询股份有限公司 Method and device for measuring satisfaction degree of hidden passenger
CN114554015A (en) * 2022-02-25 2022-05-27 马上消费金融股份有限公司 Call center system and communication establishing method
CN114677650A (en) * 2022-05-25 2022-06-28 武汉卓鹰世纪科技有限公司 Intelligent analysis method and device for pedestrian illegal behaviors of subway passengers
CN116614431A (en) * 2023-07-19 2023-08-18 中国电信股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN117726231A (en) * 2023-12-20 2024-03-19 万物信通(广州)通信信息技术有限公司 Video customer service quality analysis method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113869773A (en) * 2021-10-13 2021-12-31 北京卓思天成数据咨询股份有限公司 Method and device for measuring satisfaction degree of hidden passenger
CN114554015A (en) * 2022-02-25 2022-05-27 马上消费金融股份有限公司 Call center system and communication establishing method
CN114677650A (en) * 2022-05-25 2022-06-28 武汉卓鹰世纪科技有限公司 Intelligent analysis method and device for pedestrian illegal behaviors of subway passengers
CN116614431A (en) * 2023-07-19 2023-08-18 中国电信股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN116614431B (en) * 2023-07-19 2023-10-03 中国电信股份有限公司 Data processing method, device, electronic equipment and computer readable storage medium
CN117726231A (en) * 2023-12-20 2024-03-19 万物信通(广州)通信信息技术有限公司 Video customer service quality analysis method

Similar Documents

Publication Publication Date Title
CN112966568A (en) Video customer service quality analysis method and device
CN108305641B (en) Method and device for determining emotion information
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN109767765A (en) Talk about art matching process and device, storage medium, computer equipment
CN112465935A (en) Virtual image synthesis method and device, electronic equipment and storage medium
Seng et al. Video analytics for customer emotion and satisfaction at contact centers
CN110556130A (en) Voice emotion recognition method and device and storage medium
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN109658923A (en) Voice quality detecting method, equipment, storage medium and device based on artificial intelligence
CN111081280B (en) Text-independent speech emotion recognition method and device and emotion recognition algorithm model generation method
US10755704B2 (en) Information processing apparatus
CN109119069B (en) Specific crowd identification method, electronic device and computer readable storage medium
CN114372701A (en) Method and device for evaluating customer service quality, storage medium and equipment
CN113268994B (en) Intention identification method and device based on capsule network
CN112233698A (en) Character emotion recognition method and device, terminal device and storage medium
CN112749869A (en) Adaptive job vacancy matching system and method
CN112837669B (en) Speech synthesis method, device and server
CN115293132B (en) Dialog of virtual scenes a treatment method device, electronic apparatus, and storage medium
CN114218488A (en) Information recommendation method and device based on multi-modal feature fusion and processor
Bhosale et al. Deep encoded linguistic and acoustic cues for attention based end to end speech emotion recognition
CN111344717A (en) Interactive behavior prediction method, intelligent device and computer-readable storage medium
CN109961152B (en) Personalized interaction method and system of virtual idol, terminal equipment and storage medium
CN114429767A (en) Video generation method and device, electronic equipment and storage medium
CN107545898B (en) Processing method and device for distinguishing speaker voice

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination