CN114218427A

CN114218427A - Voice quality inspection analysis method, device, equipment and medium

Info

Publication number: CN114218427A
Application number: CN202111518554.7A
Authority: CN
Inventors: 刘攀伟
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2021-12-13
Filing date: 2021-12-13
Publication date: 2022-03-22
Anticipated expiration: 2041-12-13
Also published as: CN114218427B

Abstract

The invention relates to the technical field of voice processing, and discloses a voice quality inspection analysis method, a device, equipment and a medium, wherein the method comprises the following steps: predicting a client identification by acquiring reserved client data and a business service list, establishing connection of a server by a Native connection method, sending the reserved client data to the server and adding the reserved client data into the client service list, performing audio coding conversion by using an advanced audio coding algorithm when receiving an audio stream file from the server to obtain an audio file, and enabling the server to inform a badge to clear a space; performing audio quality inspection on the data of the appointed client and the audio file through a quality inspection detection model to obtain a quality inspection result; and inputting the quality inspection result and each historical quality inspection result into a quality inspection clustering model, and carrying out graph clustering analysis to obtain a quality inspection analysis result. Therefore, the invention improves the accuracy of the quality inspection result and automatically analyzes the quality inspection analysis result of the insufficient business items.

Description

Voice quality inspection analysis method, device, equipment and medium

Technical Field

The invention relates to the technical field of semantic analysis of artificial intelligence, in particular to a voice quality inspection analysis method, a voice quality inspection analysis device, voice quality inspection analysis equipment and voice quality inspection analysis media.

Background

At present, most of staff in a bank hall are lack of effective supervision means when communicating for service of a client, the experience of the client is not good due to insufficient explanation of language, tone and professional knowledge business of the staff, and the shortage and the promotion direction of the business capability of field service cannot be found from conversation of the client, so that the quality of the service of the field client service is checked, so that financial institutions such as banks can adjust the quality of the service, the development of related business is facilitated, and the existing quality inspection mode is generally self-written or human-based summary analysis for the staff, so that the problems of low reliability of quality inspection results, objective and unreal quality inspection analysis results and inaccuracy exist.

Disclosure of Invention

The invention provides a voice quality inspection analysis method, a device, equipment and a medium, which realize the real-time reception of an audio file returned by a chest card and the quality inspection of the audio file, improve the accuracy of a quality inspection result, automatically analyze the quality inspection analysis result of insufficient business items and improve the service quality of subsequent customer service.

A voice quality inspection analysis method includes:

acquiring reservation client data and a business service list, wherein the reservation client data comprises a reservation identifier, client information and business items, and the business service list comprises a customer service identifier and a customer service list associated with the customer service identifier;

predicting the customer service identification matched with the customer information and the service items from the service list, establishing the connection of a service end corresponding to the customer service identification through a Native connection method, sending the client reservation data to the service end and adding the client reservation data into the customer service list associated with the customer service identification so as to enable the service end to inform a chest card associated with the customer identification to display the client reservation identification and start recording;

when receiving an audio stream file from the server, performing audio coding conversion on the received audio stream file by using an advanced audio coding algorithm to obtain an audio file associated with the reserved client data, removing the reserved client data from the client service list, and enabling the server to inform the chest card of clearing the space;

performing audio quality inspection based on conversation emotion on the reservation client data and the audio file through a quality inspection detection model to obtain a quality inspection result corresponding to the business items;

and inputting the quality inspection result and each historical quality inspection result into a quality inspection clustering model, and carrying out graph clustering analysis through the quality inspection clustering model to obtain a quality inspection analysis result.

A voice quality inspection analysis apparatus comprising:

the system comprises an acquisition module, a service acquisition module and a service processing module, wherein the acquisition module is used for acquiring reservation client data and a service list, the reservation client data comprises a reservation identifier, client information and service items, and the service list comprises a customer service identifier and a customer service list associated with the customer service identifier;

the prediction module is used for predicting the customer service identification matched with the customer information and the service items from the service list, establishing the connection of a service end corresponding to the customer service identification through a Native connection method, sending the reserved customer data to the service end and adding the reserved customer data into the customer service list associated with the customer service identification so as to enable the service end to inform a chest card associated with the customer identification to display the reserved identification and start recording;

the conversion module is used for performing audio coding conversion on the received audio stream file by using an advanced audio coding algorithm when receiving the audio stream file from the server to obtain an audio file associated with the reserved client data, removing the reserved client data from the client service list and enabling the server to inform the chest card of clearing a space;

the quality inspection module is used for performing audio quality inspection based on conversation emotion on the reservation client data and the audio file through a quality inspection detection model to obtain a quality inspection result corresponding to the business items;

and the analysis module is used for inputting the quality inspection results and each historical quality inspection result into a quality inspection clustering model and carrying out graph clustering analysis through the quality inspection clustering model to obtain quality inspection analysis results.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the voice quality testing analysis method when executing the computer program.

A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned voice quality control analysis method.

The method comprises the steps of predicting a client identifier matched with client information and business items from a business service list by acquiring reserved client data and a business service list, establishing connection of a server corresponding to the client identifier by a Native connection method, sending the reserved client data to the server and adding the reserved client data into the client service list associated with the client identifier so as to enable the server to inform a chest card associated with the client identifier to display the reserved identifier and start recording; when receiving an audio stream file from the server, performing audio coding conversion on the received audio stream file by using an advanced audio coding algorithm to obtain an audio file associated with the reserved client data, removing the audio file from the client service list, and enabling the server to inform the chest card of clearing the space; performing audio quality inspection based on conversation emotion on the reservation client data and the audio file through a quality inspection detection model to obtain a quality inspection result corresponding to the business items; the quality inspection results and the historical quality inspection results are input into a quality inspection clustering model, and image clustering analysis is carried out through the quality inspection clustering model to obtain quality inspection analysis results, so that the accuracy of the quality inspection results is improved by receiving the audio files returned by the chest card in real time, carrying out quality inspection on the audio files, integrating the text matching degree of customer service and the emotional response of customers, and automatically analyzing the quality inspection analysis results of insufficient business items by using image clustering analysis to improve the service quality of subsequent customer service and improve the experience satisfaction of the customers.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a voice quality inspection analysis method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a voice quality inspection analysis method according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating the step S10 of the voice quality inspection analysis method according to an embodiment of the present invention;

FIG. 4 is a schematic block diagram of a voice quality inspection analyzer according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The voice quality inspection analysis method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client (computer equipment or terminal) communicates with a server through a network. The client (computer device or terminal) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In an embodiment, as shown in fig. 2, a voice quality inspection analysis method is provided, which mainly includes the following steps S10-S50:

s10, obtaining the client data and the service list, wherein the client data includes the reservation label, the client information and the service item, the service list includes the customer service label and the customer service list related to the customer service label.

Understandably, the client data to be reserved is data related to business handling associated with a client to be reserved, the client data to be reserved includes a reservation identifier, client information and business items, the business service list includes a client service identifier and a client service list associated with the client service identifier, the reservation identifier is a unique identifier returned by the client in a reservation, the client information is basic information related to the client, the business items are business categories to be handled by the client, and the business service list is a set of client lists of all current client services.

In an embodiment, as shown in fig. 3, before the step S10, that is, before the obtaining the subscribed customer data and the service list, the method includes:

s11, receiving the collected video; the collected video is the video collected by the video collecting device and sent by the client.

Understandably, the collected video is the video collected by the video collecting device when the client goes in and out, the video collecting device is arranged near the entrance of the business handling, the video acquisition equipment is used for monitoring clients entering and exiting a business handling doorway, can acquire a section of video at regular time as the acquired video, and transmits the collected video to the server through the preset interface, or the server collects the opening and closing state of the service handling doorway, the data is transmitted to the video acquisition equipment through the preset interface, the video acquisition equipment detects the opening and closing state of the business handling doorway through the preset interface for acquisition, for example, when the opening state of the business handling doorway is detected through the preset interface, the acquisition is started, when the closed state of the business handling doorway is detected through the preset interface, stopping acquisition, and recording the video between two time points of starting acquisition and next adjacent acquisition stopping as the acquired video.

The collected video can be received by the mode of receiving the collected video through a preset interface so as to obtain the collected video, and the preset interface is an interface for transmitting a server side and video collecting equipment.

And S12, performing framing processing on the acquired video to obtain a plurality of frame images.

Understandably, the captured video is an image set composed of a plurality of one-frame images and played in time sequence, the framing processing is an operation process of segmenting or extracting one-frame image on each frame or a preset interval frame of an input video, the framing processing is performed on the captured video, a plurality of frame images can be obtained, and the frame image is an image of one frame in the captured video.

And S13, performing human body attribute recognition and face recognition on each frame image through a human body attribute detection model to obtain a recognition result of each frame image, wherein the recognition result comprises a client identifier and an access type corresponding to the client identifier.

Understandably, the human body attribute detection model is a trained target detection network for identifying human bodies in input images and identifying the in-out type and the client identifier of each human body, and the training process of the human body attribute detection model may be as follows: by collecting historical sample images, wherein the sample images comprise a plurality of human bodies, each human body is associated with a group of tag groups, each tag group comprises a human body area tag, an access tag and a client tag, the sample images are input into a human body attribute detection model containing initial parameters, the sample images are identified by the human body attribute detection model for a single human body, human body areas of the single human body are identified, the moving characteristics and the face characteristics of each human body area are extracted, the access type and the client identifier of the human body area are identified according to the extracted moving characteristics and the face characteristics, a loss function is applied to calculate the area loss values of the human body areas and the corresponding human body area tags, the access type and the access loss value of the human body areas and the corresponding access tags, and the client identifier and the identifier loss value of the human body areas and the corresponding client tags, and determining a total loss value according to the area loss value, the in-out loss value and the identification loss value, iteratively updating initial parameters of the human body attribute detection model when the total loss value does not reach a convergence condition, continuously executing the step of carrying out single human body identification on the sample image through the human body attribute detection model, continuously training until the total loss value reaches the convergence condition, and stopping training to obtain the trained human body attribute detection model.

The human body attribute detection model comprises a target detection network, a human body attribute branch network and a human face branch network, and the processes of human body attribute identification and human face identification on each frame image are as follows: firstly, identifying a human body of the frame image, and identifying a region containing a single human body in the frame image, wherein the human body identification process may be as follows: firstly, identifying a single human body region in the frame image through a target detection network, and extracting the image of the region to obtain a plurality of human body images; secondly, extracting the movement characteristics of the human body image through a human body attribute branch network, wherein the movement characteristics are the characteristics of front movement or back movement of the shot human body, and identifying the in-out type of the human body image according to the extracted movement characteristics; thirdly, carrying out face recognition on the human body image through the face branch network to obtain a client identifier of the human body image, wherein the client identifier comprises a unique identifier of a real client stored in a server, and a new client identifier is given to a new client which is not stored in the server; and finally, establishing a corresponding relation between the client identification and the access type of the human body image so as to obtain one client identification and one access type corresponding to one human body image, summarizing the corresponding relation between all the client identifications and the access types of the frame image so as to obtain the identification result, wherein the identification result embodies the access type and the client identification of each human body in one frame image, and which human bodies in the frame image are old clients/new clients entering or exiting the door can be distinguished through the identification result.

In an embodiment, in step S13, that is, performing human body attribute recognition and face recognition on each of the frame images to obtain a recognition result of each of the frame images, where the recognition result includes a client identifier and an entry and exit type corresponding to the client identifier, and the method includes:

and carrying out human body identification on the frame images to obtain a plurality of human body images.

Understandably, the process of identifying the human body from the frame images may be implemented by an object detection network, the object detection network is used to identify a network model of a coordinate region of a single human body in each frame image, a network structure of the object detection network may be set according to requirements, for example, the network structure of the object detection network may be a network structure such as fast R-CNN, SSD, YOLO, and the like, and preferably, the network structure of the object detection network is a network structure of CenterNet, and a processing process of the object detection network is to scale an input frame image to a preset size, that is, to apply an image scaling technique to scale a long side and a short side of an image to a preset size in a manner of zero padding the long side and the short side, and then to input the scaled frame image into the object detection network, so as to extract human body features through a res net50 network in the CenterNet-based object detection network, the target detection network detects a human body as a point, namely, the center point of a target area represents the human body target, the offset (offset) of the center point of the human body target and the width (size) of the human body target are predicted to obtain a human body actual area, the human body characteristics are related characteristics which are specific to the human body, such as the characteristics of a head, a hair, a hand, a face, a trunk, clothes, legs, feet and the like which can represent one human body, then the extracted human body characteristics are subjected to up-sampling, namely deconvolution, the up-sampling can be carried out for three times to obtain a predicted characteristic diagram, finally the predicted characteristic diagram is subjected to prediction of three branch networks which are a thermodynamic diagram prediction network, a length-width prediction network and an area center offset prediction network respectively, the human body prediction is carried out on the predicted characteristic diagram through the thermodynamic diagram, and the center point prediction of the target area of each target object and the radius calculation of a Gaussian circle are carried out, and with the central point as a circle center, outwards decreasing according to a Gaussian function along the calculated radius to obtain a thermodynamic diagram corresponding to the predicted characteristic diagram, predicting the length and width regions of a plurality of human bodies through a length and width prediction network to obtain a length and width diagram of the target object corresponding to the predicted characteristic diagram, predicting the deviation values of the plurality of target objects through a region center deviation prediction network to obtain a center deviation value of the human body corresponding to the predicted characteristic diagram, and determining the region of each human body according to the thermodynamic diagram, the length and width diagram and the center deviation value so as to obtain a human body image in the frame of image.

Extracting the moving features of the human body image through a human body attribute branch network, and identifying the entering and exiting types of the human body image according to the extracted moving features; the human body attribute detection model comprises a human body attribute branch network and a human face branch network.

Understandably, the human body attribute branch network is a trained network model for identifying the in-out type of the human body in the input image, the network structure of the human body attribute branch network may be set according to requirements, for example, the network structure of the human body attribute branch network may be a network structure of ResNet, CNN, VGG, and the like, preferably, the network structure of the human body attribute branch network is a network structure of VGG16, extracting the moving features of the human body image through the human body attribute branch network, namely, the human body image is convoluted, the moving characteristic is the characteristic that the front side of the shot human body moves or the back side moves, and performing two classification treatments on the feature graph obtained by convolution, classifying probability distribution of the access types, wherein the access types comprise access and exit, and recording the access type corresponding to the maximum probability as the access type of the human body image.

And carrying out face recognition on the human body image through the face branch network to obtain the client identification of the human body image.

Understandably, the face branch network is a trained network model for recognizing a face in an input image, a network structure of the face branch network may be set according to requirements, for example, a network structure of the face branch network may be a network structure of ResNet, CNN, YOLO, and the like, and preferably, the network structure of the face branch network is a network structure of YOLO v2, the face branch network extracts the face features of the human body image, that is, convolves the human body image, the face features are features related to a face including eyes, mouths, noses, eyebrows, hairs, and the like, extracts face regions from feature maps obtained by convolution, performs identification of a client identifier on the face regions, that is, performs similarity matching on face maps of historical clients on the face regions, and the similarity matching method may be a cosine similarity matching method comparing amounts between face feature maps between two images And matching the client identification corresponding to the face image with the maximum similarity reaching the preset threshold value by the average value of the degrees, recording the client identification as the client identification of the human body image, if the face image with the maximum similarity reaching the preset threshold value is not matched, indicating that the human body image is a new client, and endowing the human body image with the client identification according to the naming format of the new client.

The face image of the historical client is stored at the server side by the user transacting business in the history.

And establishing a corresponding relation between the client identification and the access type of the human body image. The invention realizes that a plurality of human body images are obtained by carrying out human body identification on the frame images; extracting the moving features of the human body image through a human body attribute branch network, and identifying the entering and exiting types of the human body image according to the extracted moving features; carrying out face recognition on the human body image through the face branch network to obtain a client identifier of the human body image; the corresponding relation between the client identification and the access type of the human body image is established, so that the access type and the client identification of the human body in the frame image can be automatically identified, manual identification is not needed, clients entering a business gate can be rapidly distinguished, and the identification accuracy and efficiency are improved.

And S14, acquiring all the client identifications corresponding to the entry and exit types, performing duplicate removal processing on the acquired client identifications to obtain duplicate-removed client identifications, and searching the reserved client data corresponding to the duplicate-removed client identifications from a reserved database.

Understandably, screening out the client identification with the entering and exiting type as the entering from all the identification results, removing repeated items from all the screened client identifications, removing the repeated client identification, only reserving one unrepeated client identification so as to obtain the client identification after duplication removal, searching the reserved client data corresponding to the client identification after duplication removal from the reservation database in the service end, wherein the reservation database stores the data related to the client identification which is reserved, and the reserved client data is the data required by the reserved client for handling the related reserved service.

The invention realizes the collection of video by receiving; the collected video is a video collected by a video collecting device and sent by a client; performing frame processing on the acquired video to obtain a plurality of frame images; carrying out human body attribute identification and face identification on each frame image through a human body attribute detection model to obtain an identification result of each frame image; the method comprises the steps of acquiring all client identifications corresponding to the in-out types, performing deduplication processing on the acquired client identifications to obtain client identifications after deduplication, and searching reservation client data corresponding to the client identifications after deduplication from a reservation database.

S20, predicting the customer service identification matching with the customer information and the service items from the service list, establishing the connection with the service end corresponding to the customer service identification through a Native connection method, sending the client reservation data to the service end and adding the client reservation data into the customer service list associated with the customer service identification, and informing the service end of the chest card associated with the customer identification to display the reservation identification and start the recording.

Understandably, a customer service matrix is generated according to the customer information and the business items, the customer service matrix and the business service list are input into a service distribution prediction model, the customer service matrix and the business service list are subjected to matching prediction through the service distribution prediction model, an optimal customer service identifier is predicted, a Native connection method is used for establishing the connection of a server corresponding to the customer service identifier, namely a Native connection H5 page mode is used, connection is initiated to an H5 page of the server, the reserved customer data is sent to an H5 page of the server, and the received reserved customer data is added to the customer service list on the H5 page where the connection is established.

The method comprises the steps that a server side informs a chest card associated with a client identification to display an appointment identification and start recording, the server side is connected with the chest card in a Bluetooth mode, after the server side is detected to be connected with the chest card in the Bluetooth mode, the server side informs the chest card to display the appointment identification, a sectional recording mode is used, the server side enables the chest card to start recording, when a recorded sound response aiming at the end fed back by a starting instruction is detected, a file list after recording is obtained through the server side, and an audio stream file corresponding to the file list is downloaded through the server side.

In an embodiment, in the step S20, the causing the server to notify the chest card associated with the customer service identifier to display the reservation identifier and start recording includes:

and establishing Bluetooth connection between the server and the chest card by using an asymmetric encryption algorithm according to the Bluetooth code and the connection key associated with the customer service identifier.

Understandably, one customer service identifier is associated with one bluetooth code and one connection key to obtain the bluetooth code and the connection key associated with the customer service identifier, the bluetooth code is a code of a unique identifier given by a bluetooth device worn on a chest card of the customer service identifier, the bluetooth code may be a Media Access Control Address (MAC Address) of the bluetooth device of the chest card, the connection key is a key code required for connecting with the bluetooth device, and the bluetooth connection process is as follows: and the server encrypts the pairing data through a public key corresponding to the connection key to obtain encrypted pairing data, sends the encrypted pairing data in a broadcasting mode, decrypts the pairing data based on the connection key after receiving the encrypted pairing data, obtains decrypted pairing data, feeds the decrypted pairing data back to the server, and determines the chest card corresponding to the correct decrypted pairing data as the chest card of the customer service identification and establishes Bluetooth connection with the chest card.

Wherein the asymmetric encryption algorithm comprises two keys: public key (public key) and private key (private key), the public key and the private key are a pair, the asymmetric encryption algorithm is that if the public key is used to encrypt data, only the corresponding private key can be used to decrypt the data; an algorithm that can only be decrypted with the corresponding public key if the data is encrypted with the private key.

And after the Bluetooth connection between the server and the chest card is detected, the server informs the chest card to display the reservation identification.

Understandably, after the service end is detected to be connected with the chest card in a Bluetooth mode, a display instruction containing the reservation identification is sent to the chest card which is connected in the Bluetooth mode through the service end, and the reservation identification is displayed in a display area or a display of the chest card after the chest card receives the display instruction, so that a customer can quickly find out service for providing service.

And sending a starting instruction to the chest card command through the server by using a segmented recording mode so as to enable the chest card to start recording.

Understandably, after the chest card displays the reservation identification, sending a starting instruction for starting recording through the server, wherein the starting instruction comprises a segmented recording mode for sending a starting instruction to the chest card to enable the chest card to start a recording function, the segmented recording mode is a mode of recording a segment of a storage partitioned area according to a Bluetooth device, namely, a partitioned area is automatically jumped to a next serial number after one partitioned area is full, if the recorded partitioned area is not full when the recording is finished, the next time the recording is started is automatically jumped to the next serial number partitioned area of the unfilled partitioned area, so that a certain range of the partitioned area can be ensured to store a segment of recording content, and after a customer service person corresponding to the customer service identification has a conversation with a customer corresponding to the customer identification, the customer service staff can automatically trigger the recording ending response of the ending recording by touching the ending button of the chest card and send the recording ending response to the server.

And when detecting that the sound recording is finished according to the feedback of the opening instruction, acquiring a file list after sound recording through the server.

Understandably, the recording-ending response includes the opening command, the recording-ending response further includes a division region corresponding to the opening command and a division region corresponding to the recording-ending response and ending storage, the server uses the division region corresponding to the opening command in the recording-ending response as a starting division region and the division region corresponding to the recording-ending response as an ending division region, generates an obtaining command according to the starting division region and the ending division region and sends the obtaining command to the chest card, and the chest card receives the obtaining command and generates the division region between the starting division region and the ending division region into the file list, wherein the file list represents a list (or a list of regions) of storage spaces of recording contents stored between the starting division region and the ending division region, the chest card sends the file list to the server, and the server receives the file list to obtain the recorded file list.

And downloading the audio stream file corresponding to the file list through the server, and acquiring the audio stream file.

Understandably, the downloading process is to send a reading instruction to the chest card, the reading instruction includes the file list and a reading Code, the chest card receives the reading instruction and then executes the reading instruction, and sends the read audio stream file stored in the file list to a processing process of a server, so that the server can obtain the audio stream file, and the audio stream file is audio data in a corresponding PCM (Pulse Code Modulation) format obtained by starting recording.

The invention realizes that the service end establishes Bluetooth connection with the chest card by applying an asymmetric encryption algorithm according to the Bluetooth code and the connection key associated with the customer service identification; after the situation that the server side establishes Bluetooth connection with the chest card is detected, the server side informs the chest card to display the reservation identification; sending a starting instruction to the chest card order through the server by using a segmented recording mode so as to enable the chest card to start recording; when detecting that the sound recording finishing response fed back according to the starting instruction is detected, acquiring a file list after sound recording through the server; the server downloads the audio stream files corresponding to the file list and acquires the audio stream files, so that the Bluetooth connection with the chest card is established by using an asymmetric encryption algorithm, the connection safety is improved, and the audio stream files can be reasonably stored in a storage area of the chest card by using a segmented recording mode for downloading of the server so as to acquire accurate audio stream files.

S30, when receiving the audio stream file from the server, using the advanced audio coding algorithm to perform audio coding conversion on the received audio stream file to obtain the audio file associated with the reserved client data, removing the reserved client data from the client service list, and enabling the server to inform the chest card to clear the space.

Understandably, the Advanced Audio Coding algorithm (AAC) is to compress and code Audio data in PCM format and convert the Audio data into Audio data in AAC format, the Audio data in PCM format is uncompressed Audio raw data, the Audio data in AAC format is lossy compressed data, the Audio file is a file in AAC format, the reserved client data is removed from the client service list to indicate that the client with the reserved identifier is already served, and the server notifies the chest card to clear the space to ensure that sufficient space exists for the next recording.

And S40, performing audio quality inspection based on conversation emotion on the reservation client data and the audio file through a quality inspection detection model to obtain a quality inspection result corresponding to the business item.

Understandably, the dialogue emotion-based audio quality inspection is to perform emotion recognition on text in a dialogue, identify emotion categories, and a quality inspection process for comparing the text in the conversation with standard words, phrases and term interpretations, the audio quality inspection process is to apply a voice segmentation algorithm to perform role segmentation processing on the audio file, and voice recognition is carried out on the audio files after the role segmentation processing to obtain a customer service text file and a customer text file, a context semantic recognition algorithm is applied, comparing the service keywords of the customer service text file through a text quality inspection model to obtain a customer service quality inspection result, and performing emotion semantic recognition on the client text file through an emotion detection model to obtain a client emotion result, and integrating the client quality inspection result and the client emotion result to obtain a quality inspection result.

Wherein the context semantic recognition algorithm is an algorithm for recognizing words according to a forward semantic prediction mode and a reverse semantic prediction mode,

in one embodiment, the step S40 of performing an audio quality inspection based on conversational emotion on the reservation client data and the audio file by a quality inspection model to obtain a quality inspection result corresponding to the business item includes:

and performing role segmentation processing on the audio file by using a voice segmentation algorithm, and performing voice recognition on the audio file subjected to the role segmentation processing to obtain a customer service text file and a customer text file.

Understandably, the voice segmentation algorithm is an algorithm for performing segmentation processing on the audio file to obtain a plurality of recording segments, then performing role identification on each recording segment to identify a recording segment of customer service and a recording segment of a customer, the segmentation processing is a process of detecting segmentation points in the audio file by using a BIC algorithm, filtering voices among the segmentation points by using a VAD (voice Activity detection) method to obtain a plurality of recording segments, the VAD method is a process of performing VAD detection on the voices among two segmentation points, and if voice endpoints are detected by the VAD, the processing is not performed; if the VAD detects that no voice end point exists, removing voice between the two division points, dividing a plurality of recording segments with voice through the segmentation processing, removing the soundless part at intervals, only reserving useful recording segments, obtaining an audio sample corresponding to the customer service identification after dividing a plurality of recording segments for the audio file, wherein the audio sample is voice sent by the customer service recorded in history, comparing each recording segment with the obtained audio sample through a role identification model to obtain the similarity between the audio sample and each recording segment, marking the recording segment corresponding to the similarity which is larger than or equal to a preset similarity threshold as a customer service role, and marking the remaining recording segments as client roles, wherein the role identification model is used for identifying whether the input audio segment is similar to the input audio sample or not after training, by calculating the similarity between the two audio samples and judging whether the input audio segment is a customer service or client model according to the similarity, extracting the voiceprint features of the recording segment and the audio sample through the role recognition model respectively, and comparing the extracted voiceprint features of the recording segment with the extracted voiceprint features of the audio sample to obtain the similarity between the audio sample and the recording segment, so that the similarity between the audio sample and each recording segment can be compared, the voiceprint features are features related to a sound wave spectrum sent by a person, and the preset similarity threshold is a preset threshold meeting the similarity requirement, such as: 92%, 95%, etc.

The process of performing voice recognition on the audio file subjected to the role segmentation processing comprises the following steps: splicing all the recording segments marked as customer service roles, performing text conversion on the spliced recording segments marked as the customer service roles by using a voice Recognition technology (Automatic Speech Recognition, ASR, which is a technology for converting human voice into text), so as to obtain customer service text files, simultaneously splicing all the recording segments marked as the customer roles, and performing text conversion on the spliced recording segments marked as the customer roles by using the voice Recognition technology, so as to obtain the customer text files.

And acquiring a text quality inspection model corresponding to the business items in the client data.

Understandably, the data of the reservation client includes the business items, the business items are the businesses required to be handled by the reservation client, one business item corresponds to one text quality inspection model, the model between the texts is a trained model for performing quality inspection on the corresponding business items on the input text, and the text quality inspection model is a trained model for identifying the keywords of the corresponding business items and comparing the quality of the input text.

And comparing the service keywords of the customer service text file through the acquired text quality inspection model to obtain a customer service quality inspection result.

Understandably, performing word segmentation processing on the customer service text file through the text quality inspection model, wherein the word segmentation processing is a processing process of words divided into minimum units to obtain a plurality of word segmentation units, performing service keyword identification on each word segmentation unit to identify keywords related to the service items, the service keyword comparison is a comparison process of comparing the identified keywords with each service template word stored in the text quality inspection model to compare coverage rates of the identified keywords in all the service template words, the service template words are templates of words or words related to services corresponding to the service items, and the compared coverage rates are recorded as the customer service quality inspection results.

And performing emotion recognition on the client text file through an emotion detection model to obtain a client emotion result.

Understandably, the emotion detection model is a trained neural network model for recognizing emotion of an input text, and the emotion semantic recognition process is to perform word vector labeling, part of speech labeling and tone labeling on the client text file to obtain labeling information; then, performing emotion semantic recognition on the tagged information by using a context semantic recognition algorithm to obtain an emotion semantic result; performing emotion intonation recognition on the labeled information to obtain an emotion intonation result; and determining the emotion result of the client according to the emotion semantic result and the emotion intonation result, wherein the emotion result of the client reflects the emotion of the client in the service process.

In an embodiment, the performing emotion semantic recognition on the client text file through an emotion detection model to obtain a client emotion result includes:

and performing word vector labeling, part of speech labeling and tone labeling on the client text file through the emotion detection model to obtain labeling information associated with the client text file.

Understandably, the Word vector label is for utilizing Word2vec technique (embedding Word vector technique) to convert each Word or Word into in predetermineeing the dictionary bank with the labeling process of its corresponding vector, right the customer text file goes on the Word vector label obtains Word vector label information, the part of speech label is for labeling the labeling process of its corresponding part of speech to every Word or Word, right the customer text file goes on the part of speech label obtains part of speech label information, the tone label is for labeling the labeling process of its flat tone to every Word, right every Word in the customer text file goes on the tone label obtains tone label information, will Word vector label information part of speech label information with tone label information record does label information.

And performing emotion semantic recognition on the labeled information through the emotion detection model by using a context semantic recognition algorithm to obtain an emotion semantic result.

Understandably, the context semantic recognition algorithm is a recognition algorithm for recognizing emotion categories of semantics of each forward word vector and each reverse word vector in combination with the part of speech thereof, the emotion semantic recognition is a recognition process for recognizing probability distribution of emotion categories to which each word vector label information and corresponding part of speech label information belong in the label information by applying the context semantic recognition algorithm to each word vector, then clustering the emotion categories of each word vector to obtain the emotion category with the highest probability, and recording the emotion category as the emotion semantic result.

And performing emotion intonation recognition on the labeled information through the emotion detection model to obtain an emotion intonation result.

Understandably, the emotion intonation recognition is to perform convolution through the sequence of tone labeling information in the labeling information, extract the intonation characteristics of each sentence, determine the recognition process of emotion categories according to the extracted intonation characteristics, wherein the intonation characteristics are emotion regular characteristics reflected by tones in a sentence, so that the emotion intonation result is obtained, and the emotion intonation result reflects the emotion category results of the client text file in the intonation dimension.

And determining the emotion result of the client according to the emotion semantic result and the emotion intonation result.

Understandably, combining the emotion semantic result and the emotion intonation result to obtain the emotion result of the client.

And recording the customer quality inspection result and the customer emotion result as the quality inspection result.

Understandably, the client quality inspection result and the client emotion result are combined into a one-dimensional array in a one-to-one mode, and the one-dimensional array is determined as the quality inspection result, wherein the quality inspection result embodies the coverage rate of customer service and the reverberation result of the client.

The invention realizes the role segmentation processing of the audio file by using a voice segmentation algorithm and the voice recognition of the audio file after the role segmentation processing to obtain a customer service text file and a customer text file; acquiring a text quality inspection model corresponding to business items in the appointed client data; comparing the service keywords of the customer service text file through the obtained text quality inspection model to obtain a customer service quality inspection result; performing emotion recognition on the client text file through an emotion detection model to obtain a client emotion result; and recording the client quality inspection result and the client emotion result as the quality inspection result, so that the voice segmentation algorithm is used for automatically distinguishing roles of the audio file, segmenting the customer service text file and the client text file, automatically acquiring a corresponding text quality inspection model, performing service keyword comparison and emotion semantic identification by using a context semantic identification algorithm, outputting the quality inspection result and providing data for subsequent quality inspection analysis.

And S50, inputting the quality inspection results and each historical quality inspection result into a quality inspection clustering model, and carrying out graph clustering analysis through the quality inspection clustering model to obtain quality inspection analysis results.

Understandably, the quality inspection clustering model is a trained model for identifying analysis results of insufficient service items in a service map, the graph clustering analysis process is a process for establishing graph nodes of the service map of the quality inspection results and the historical quality inspection results based on the service items corresponding to the quality inspection results, assigning and measuring the node values and the side lengths of the graph nodes based on the quality inspection results, and then performing graph clustering processing on the service map to obtain the quality inspection analysis results, wherein the quality inspection analysis results analyze the insufficient service items.

The invention realizes that the client identification matched with the client information and the business items is predicted from the business service list by acquiring the client data to be reserved and the business service list, the connection of the server corresponding to the client identification is established by a Native connection method, the client data to be reserved is sent to the server and the client data to be reserved is added into the client service list associated with the client identification, so that the server informs the chest card associated with the client identification to display the reserved identification and start the recording; when receiving an audio stream file from the server, performing audio coding conversion on the received audio stream file by using an advanced audio coding algorithm to obtain an audio file associated with the reserved client data, removing the audio file from the client service list, and enabling the server to inform the chest card of clearing the space; performing audio quality inspection based on conversation emotion on the reservation client data and the audio file through a quality inspection detection model to obtain a quality inspection result corresponding to the business items; the quality inspection results and the historical quality inspection results are input into a quality inspection clustering model, and image clustering analysis is carried out through the quality inspection clustering model to obtain quality inspection analysis results, so that the accuracy of the quality inspection results is improved by receiving the audio files returned by the chest card in real time, carrying out quality inspection on the audio files, integrating the text matching degree of customer service and the emotional response of customers, and automatically analyzing the quality inspection analysis results of insufficient business items by using image clustering analysis to improve the service quality of subsequent customer service and improve the experience satisfaction of the customers.

In an embodiment, the step S50, namely, inputting the quality inspection result and each historical quality inspection result into a quality inspection clustering model, and performing graph clustering analysis by using the quality inspection clustering model to obtain a quality inspection analysis result, includes:

and establishing the quality inspection result and a graph node of the service graph of each historical quality inspection result based on the service items corresponding to the quality inspection result.

Understandably, the historical quality inspection result is a historical quality inspection result, preferably a quality inspection result generated from the time of the day zero to the current time period, the service items corresponding to the quality inspection result and the service items corresponding to each historical quality inspection result are classified in service item dimensionality, the quality inspection results of the same service item category are connected with the central point of the service item, and the connected quality inspection results are used as graph nodes, so that the service graph is constructed.

And assigning values to the graph nodes based on the customer quality inspection results in the quality inspection results to obtain the node values of the graph nodes.

Understandably, according to the customer quality inspection result in the quality inspection results corresponding to each graph node, each graph node is given a value mapped with the customer quality inspection result, which is used as the node value of each graph node, for example, a value mapped into a level range of 1 to 10 according to the percentage of the customer quality inspection results, for example, 82% is mapped into 8, and 69% is mapped into 9.

And performing node measurement on each graph node based on the client emotion result in the quality inspection result, and determining the side length of each graph node.

Understandably, according to the client emotion result in the quality inspection result corresponding to each graph node, each graph node is given a metric value mapped with the client emotion result, the metric value is used as the side length of each graph node, for example, the client emotion result is [ joy-joy ], the metric value is mapped to [10-10], the side length is 20 (addition) or 100 (multiplication), the client emotion result is [ anger-peace ], the metric value is mapped to [1-5], and the side length is 6 (addition) or 5 (multiplication).

And carrying out graph clustering on the service map added with the node value and the side length to obtain the quality inspection analysis result.

Understandably, after the node value and the side length of each graph node are completed, graph clustering is performed on the service graph, that is, the node value and the side length are multiplied on each graph node, mean clustering processing is performed on the multiplication result of each graph node of the same central point to obtain the value of each central point, and the service item corresponding to the central point corresponding to the minimum value is recorded as the quality inspection analysis result, which is the service item for which the shortage is analyzed.

The invention realizes that the quality inspection result and the graph node of the service graph of each historical quality inspection result are established based on the service items corresponding to the quality inspection result; assigning values to the graph nodes based on the customer quality inspection results in the quality inspection results to obtain node values of the graph nodes; performing node measurement on each graph node based on a client emotion result in a quality inspection result, and determining the side length of each graph node; and carrying out graph clustering on the service graph added with the node value and the side length to obtain the quality inspection analysis result, so that the quality inspection analysis result can be automatically identified through establishing the service graph, assigning values and node measurement to graph nodes and carrying out graph clustering, and the accuracy and reliability of the quality inspection analysis result output are improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a voice quality inspection analysis device is provided, and the voice quality inspection analysis device corresponds to the voice quality inspection analysis method in the above embodiments one to one. As shown in fig. 4, the voice quality inspection analysis apparatus includes an acquisition module 11, a prediction module 12, a conversion module 13, a quality inspection module 14, and an analysis module 15. The functional modules are explained in detail as follows:

an obtaining module 11, configured to obtain reservation client data and a service list, where the reservation client data includes a reservation identifier, client information, and service items, and the service list includes a customer service identifier and a customer service list associated with the customer service identifier;

the prediction module 12 is configured to predict the customer service identifier matching both the customer information and the service item from the service list, establish a connection with a service end corresponding to the customer service identifier through a Native connection method, send the reserved customer data to the service end, add the reserved customer data to the customer service list associated with the customer service identifier, and enable the service end to notify a chest card associated with the customer identifier to display the reserved identifier and start recording;

a conversion module 13, configured to, when receiving an audio stream file from the server, perform audio transcoding on the received audio stream file by using an advanced audio coding algorithm to obtain an audio file associated with the reserved client data, remove the reserved client data from the client service list, and enable the server to notify the badge cleaning space;

a quality inspection module 14, configured to perform audio quality inspection based on conversation emotion on the subscribed client data and the audio file through a quality inspection detection model, so as to obtain a quality inspection result corresponding to the service item;

and the analysis module 15 is used for inputting the quality inspection results and each historical quality inspection result into a quality inspection clustering model, and performing graph clustering analysis through the quality inspection clustering model to obtain quality inspection analysis results.

For the specific limitations of the voice quality inspection analysis apparatus, reference may be made to the above limitations of the voice quality inspection analysis method, which are not described herein again. All or part of the modules in the voice quality inspection analysis device can be realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a client or a server, and its internal structure diagram may be as shown in fig. 5. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the readable storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a voice quality inspection analysis method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the voice quality inspection analysis method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the speech quality analysis method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A voice quality inspection analysis method is characterized by comprising the following steps:

2. The voice quality inspection analysis method of claim 1, wherein the obtaining the list of subscribed customer data and business services comprises, prior to:

receiving a collected video; the collected video is a video collected by a video collecting device and sent by a client;

performing frame processing on the acquired video to obtain a plurality of frame images;

carrying out human body attribute recognition and face recognition on each frame image through a human body attribute detection model to obtain a recognition result of each frame image, wherein the recognition result comprises a client identifier and an access type corresponding to the client identifier;

and acquiring all the client identifications corresponding to the in-out type, performing duplicate removal processing on the acquired client identifications to obtain duplicate-removed client identifications, and searching the reserved client data corresponding to the duplicate-removed client identifications from a reserved database.

3. The voice quality inspection analysis method according to claim 2, wherein the performing human body attribute recognition and face recognition on each frame image to obtain a recognition result of each frame image, the recognition result including a client identifier and an entry and exit type corresponding to the client identifier, comprises:

carrying out human body identification on the frame images to obtain a plurality of human body images;

extracting the moving features of the human body image through a human body attribute branch network, and identifying the entering and exiting types of the human body image according to the extracted moving features; the human body attribute detection model comprises a human body attribute branch network and a human face branch network;

carrying out face recognition on the human body image through the face branch network to obtain a client identifier of the human body image;

and establishing a corresponding relation between the client identification and the access type of the human body image.

4. The voice quality inspection analysis method of claim 1, wherein the instructing the server to notify the chest card associated with the customer service identifier to display the reservation identifier and start recording comprises:

according to the Bluetooth code and the connection key associated with the customer service identification, an asymmetric encryption algorithm is applied to enable the server side to establish Bluetooth connection with the chest card;

after the situation that the server side establishes Bluetooth connection with the chest card is detected, the server side informs the chest card to display the reservation identification;

sending a starting instruction to the chest card order through the server by using a segmented recording mode so as to enable the chest card to start recording;

when detecting that the sound recording finishing response fed back according to the starting instruction is detected, acquiring a file list after sound recording through the server;

5. The voice quality inspection analysis method according to claim 1, wherein the performing of the voice quality inspection based on a conversation emotion on the reservation client data and the audio file by the quality inspection model to obtain a quality inspection result corresponding to the business item comprises:

performing role segmentation processing on the audio file by using a voice segmentation algorithm, and performing voice recognition on the audio file subjected to the role segmentation processing to obtain a customer service text file and a customer text file;

acquiring a text quality inspection model corresponding to business items in the appointed client data;

comparing the service keywords of the customer service text file through the obtained text quality inspection model to obtain a customer service quality inspection result;

performing emotion recognition on the client text file through an emotion detection model to obtain a client emotion result;

6. The voice quality inspection analysis method of claim 5, wherein the emotion semantic recognition of the client text file by an emotion detection model to obtain a client emotion result comprises:

performing word vector labeling, part of speech labeling and tone labeling on the client text file through the emotion detection model to obtain labeling information associated with the client text file;

performing emotion semantic recognition on the labeled information through the emotion detection model by using a context semantic recognition algorithm to obtain an emotion semantic result;

performing emotion intonation recognition on the labeled information through the emotion detection model to obtain an emotion intonation result;

7. The method according to claim 1, wherein the step of inputting the quality test results and the historical quality test results into a quality test clustering model and performing graph clustering analysis by the quality test clustering model to obtain quality test analysis results comprises:

establishing a quality inspection result and a graph node of a service graph of each historical quality inspection result based on service items corresponding to the quality inspection result;

assigning values to the graph nodes based on the customer quality inspection results in the quality inspection results to obtain node values of the graph nodes;

performing node measurement on each graph node based on a client emotion result in a quality inspection result, and determining the side length of each graph node;

8. A voice quality control analysis device, comprising:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the voice quality test analysis method according to any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, implements the voice quality inspection analysis method according to any one of claims 1 to 7.