CN112966082A

CN112966082A - Audio quality inspection method, device, equipment and storage medium

Info

Publication number: CN112966082A
Application number: CN202110253354.7A
Authority: CN
Inventors: 赵情恩; 曾新贵; 熊新雷; 陈蓉; 肖岩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2021-06-15

Abstract

The application discloses an audio quality inspection method, device, equipment and storage medium, and relates to the field of artificial intelligence such as voice recognition, natural language processing and deep learning. One embodiment of the method comprises: acquiring conversation audio, wherein the conversation audio records a conversation between a client and a customer service; carrying out voice separation on conversation audios to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise a speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; judging roles of the first text and the second text, and selecting a text corresponding to the customer service; and carrying out semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the conversation audio. The embodiment can realize fully automatic audio quality inspection.

Description

Audio quality inspection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the field of computers, in particular to the field of artificial intelligence such as voice recognition, natural language processing and deep learning, and particularly relates to an audio quality inspection method, device, equipment and storage medium.

Background

The quality inspection in the call center mainly aims at detecting the working quality of the customer service and effectively improving the overall level and quality of the customer service. The quality inspector is the standard post of the call center and has the responsibility of supervising service, finding problems, summarizing experience, proposing suggestions and supervising improvement.

Generally, a quality inspector randomly samples the conversation audio of a large number of clients and customer services, and then performs listening discrimination to score the conversation content of the clients and the customer services according to a given scoring rule template.

Disclosure of Invention

The embodiment of the application provides an audio quality inspection method, an audio quality inspection device, audio quality inspection equipment and a storage medium.

In a first aspect, an embodiment of the present application provides an audio quality inspection method, including: acquiring conversation audio, wherein the conversation audio records a conversation between a client and a customer service; carrying out voice separation on conversation audios to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise a speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; judging roles of the first text and the second text, and selecting a text corresponding to the customer service; and carrying out semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the conversation audio.

In a second aspect, an embodiment of the present application provides an audio quality inspection apparatus, including: an acquisition module configured to acquire conversation audio, wherein the conversation audio records a conversation between a customer and a customer service; the separation module is configured to perform voice separation on the conversation audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise a speaker; the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; the judging module is configured to judge roles of the first text and the second text and select a text corresponding to the customer service; and the classification module is configured to perform semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described in any one of the implementations of the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes a computer program that, when executed by a processor, implements the method as described in any implementation manner of the first aspect.

According to the audio quality inspection method, the device, the equipment and the storage medium provided by the embodiment of the application, firstly, the voice of the obtained conversation audio is separated to obtain a first audio and a second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, the role of the first text and the second text is judged, and a text corresponding to the customer service is selected; and finally, semantic classification is carried out on the text corresponding to the customer service to obtain the quality inspection result of the dialogue audio, so that full-automatic audio quality inspection can be realized.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of an audio quality inspection method according to the present application;

FIG. 3 is a flow chart of yet another embodiment of an audio quality inspection method according to the present application;

fig. 4 is an application scenario diagram of an audio quality inspection method that can implement the embodiment of the present application.

FIG. 5 is a schematic diagram of an embodiment of an audio quality inspection device according to the present application;

fig. 6 is a block diagram of an electronic device for implementing the audio quality inspection method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the audio quality inspection method or audio quality inspection apparatus of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

101, 102, 103 to interact with the server 105 over the network 104 to receive or transmit video frames or the like. Various client applications, such as a recording application, an audio quality inspection application, and the like, may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may provide various services. For example, the server 105 may analyze and process conversation audio acquired from the

terminal apparatuses

101, 102, 103, and generate a processing result (e.g., a quality inspection result of the conversation audio).

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the audio quality inspection method provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the audio quality inspection apparatus is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of an audio quality inspection method according to the present application is shown. The audio quality inspection method comprises the following steps:

step 201, obtaining conversation audio.

In the present embodiment, the execution subject of the audio quality inspection method (e.g., the server 105 shown in fig. 1) may acquire the dialogue audio. Wherein the conversation audio may be audio that records a conversation between the customer and the customer service.

Typically, when a call center receives an incoming call from a customer, it can be automatically assigned to customer service. When the customer establishes a call with the customer service, the terminal device of the customer service (for example, the

terminal devices

101, 102, 103 shown in fig. 1) may start the recording function to record the conversation between the customer and the customer service until the call is ended, and then the conversation audio may be obtained. For businesses that sell products (e.g., physical goods, virtual services, etc.), a call center is typically set up to provide pre-sale and post-sale services for their products. In order to improve the service quality of customer service, enterprises need to perform quality inspection on recorded conversation audio. According to the quality inspection result, the method is refined and popularized in the favorable aspect, and the method is supervised to correct the unfavorable aspect. For rapidly developing enterprises, the traffic volume of the call center is continuously increased. If the quality inspection is carried out on the full amount of dialogue audio, the workload is huge. In order to improve the quality inspection efficiency, partial dialogue audio needs to be extracted from the whole dialogue audio in proportion for quality inspection. For example, the average duration of the dialogue audio is about 6 minutes, and the dialogue audio is randomly extracted from the whole dialogue audio according to the proportion of 1% -2% for quality inspection.

Step 202, performing voice separation on the voice frequency to obtain a first audio frequency and a second audio frequency.

In this embodiment, the executing body may perform voice separation on the paired voice frequencies to obtain the first audio and the second audio. Wherein the first audio and the second audio comprise only one speaker.

Since the conversation audio records the conversation between the client and the customer service, two speakers are usually included. The voiceprints of different speakers are different, and the voice frequency is subjected to voice separation based on the voiceprints, so that the first audio and the second audio which only comprise one speaker can be separated. Wherein the first audio and the second audio comprise only one speaker in the customer and the customer service. For example, the first audio is the audio of a customer and the second audio is the audio of a customer service.

It should be noted that, performing voice separation on the voice audio can cut out the audio containing only one speaker, but cannot identify the specific speaker contained in the audio.

Step 203, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.

In this embodiment, the execution main body may perform speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.

Specifically, the vocabulary contents in the first audio and the second audio may be converted into corresponding words by using a speech recognition technique, so as to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. The first text may include words corresponding to the vocabulary content in the first audio. The second text may include words corresponding to lexical content in the second audio.

And 204, judging the roles of the first text and the second text, and selecting a text corresponding to the customer service.

In this embodiment, the execution main body may perform role determination on the first text and the second text, mark out roles corresponding to the first text and the second text, and then select a text corresponding to the customer service from the roles.

Specifically, the content in the first text and the second text may be analyzed to determine the roles corresponding to the first text and the second text. For example, a role corresponding to text where a welcome or an endword exists is typically customer service. For another example, a character corresponding to a text having a large number of pieces of inquiry content for a product is generally a client, and a character corresponding to a text having a large number of pieces of answer content for a product is generally a customer service.

And step 205, performing semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In this embodiment, the execution main body may perform semantic classification on text content corresponding to the customer service, so as to obtain a quality inspection result of the dialogue audio. The quality inspection result can be used for representing the service quality of the customer service in the conversation.

In some optional implementation manners of this embodiment, an additional classification category set and a deduction category set may be preset, text content semantic classification may be performed on the text corresponding to the customer service, at least one additional classification category and at least one deduction category to which the text corresponding to the customer service belongs are determined, and then a quality inspection result of the dialogue audio is obtained. Wherein the bonus categories in the set of bonus categories can be active, worthy of promotion categories. The classification categories in the set of classification categories may be negative categories that need to be corrected. Taking quality inspection of service flow compliance and language compliance as an example, the set of bonus categories may include categories of speaking welcome words by standard, speaking closing words by standard, confirming customer information, calming customer complaints of emotions, and the like. The set of ratings categories may include categories of presence service banners, outbound negative recommended products, aggressive recommended products, induced deceptive customers, and the like.

And determining the quality inspection result of the dialogue audio based on at least one bonus category and at least one deduction category to which the text corresponding to the customer service belongs. For example, at least one bonus category and at least one deduction category to which the text corresponding to the customer service belongs are directly used as quality inspection results. The method can be refined and popularized for at least one bonus category, and can supervise and urge to correct for at least one deduction category, so that the service quality of customer service is improved. For another example, the bonus category in the bonus category set is labeled with a corresponding bonus score, and the bonus category in the bonus category set is labeled with a corresponding bonus score. The difference between the sum of the bonus scores corresponding to at least one bonus category and the sum of the deduction scores corresponding to at least one deduction category can be further calculated, and the obtained difference is used as a quality inspection result of the dialogue audio. Generally, the larger the difference, the higher the service quality of the customer service in the current session, and the smaller the difference, the lower the service quality of the customer service in the current session. The dialogue audio with higher service quality can be refined and popularized, and the dialogue audio with lower service quality can be supervised and prompted to be corrected, so that the service quality of customer service is improved.

In some optional implementation manners of this embodiment, the execution main body may further count quality inspection results of a plurality of dialogue audios of the same customer service, and perform tracking analysis on the service quality of the customer service to form historical service data of the customer service. In addition, the statistical results can also be used for performance assessment of the customer service.

According to the audio quality inspection method provided by the embodiment of the application, firstly, voice separation is carried out on the obtained conversation audio to obtain a first audio and a second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, the role of the first text and the second text is judged, and a text corresponding to the customer service is selected; and finally, semantic classification is carried out on the text corresponding to the customer service to obtain the quality inspection result of the dialogue audio, so that full-automatic audio quality inspection can be realized. Regardless of voice separation, voice recognition, role judgment and final quality inspection, the labor cost is greatly reduced, the voice can be analyzed quickly, problems exist in accurate positioning, and the stable and efficient customer service working quality is guaranteed. Compared with manual audio quality inspection, the method has the advantages that the time consumed by audio quality inspection is reduced, the audio quality inspection efficiency is improved, the audio quality inspection cost is reduced, the audio quality inspection accuracy is improved, and the audio quality inspection subjectivity is eliminated. Can support a large amount of complicated quality inspection work so as to adapt to the rapid growth steps of enterprises.

With further reference to fig. 3, a flow 300 of yet another embodiment of an audio quality inspection method according to the present application is shown. The audio quality inspection method comprises the following steps:

step 301, dialog audio is obtained.

In this embodiment, the specific operation of step 301 has been described in detail in step 201 in the embodiment shown in fig. 2, and is not described herein again.

Step 302, inputting the dialogue audio into a pre-trained human voice separation model to obtain a first audio and a second audio.

In this embodiment, the executing entity (e.g., the server 105 shown in fig. 1) of the audio quality inspection method may input the dialogue audio into the pre-trained human voice separation model, so as to obtain the first audio and the second audio. The audio frequencies of different roles are segmented through the human voice separation model, and the cost of manual listening, distinguishing and labeling is reduced. The human voice separation model may include, but is not limited to: an AI (Artificial Intelligence) Model such as an Xvector-AHC (vocal print Model-aggregate Hierarchical Clustering), a GMM (Gaussian Mixed Model), an HMM (hidden markov Model), and the like is obtained by training a neural network using a training sample set. The training samples in the training sample set may be sample dialogue audios labeled with speakers.

In some optional implementations of this embodiment, the human voice separation model is an Xvector-AHC. The XVector-AHC may include XVector and AHC, among others. The corresponding voice separation step may include:

first, the dialogue audio is divided into a plurality of audio pieces.

Typically, the dialog audio may be uniformly segmented. For example, for a 10 second dialog audio, the segmentation may be performed every 500 milliseconds, resulting in 20 audio segments.

And then, respectively inputting the plurality of audio segments into an XVector to obtain the characteristics of the plurality of audio segments.

The network structure of a common Xvector includes a frame-level layer (frame-level), a pooling layer (statistics-level), a segment-level layer (segment-level), and an activation function layer (softmax) in this order. Here Xvector removes the activation function layer of the neural network that has been trained. The Xvector feature output by the segment-level layer is the feature of the audio segment.

Then, the features of the plurality of audio pieces are clustered using the AHC, and categories of the plurality of audio pieces are determined based on the clustering result.

In general, AHCs can be classified into two categories according to the way they are clustered: top-down and bottom-up. For the bottom-up clustering algorithm, it is assumed that each sample is a separate class initially, then classes are merged successively until there is only one class at the end. A tree-like structure will eventually result, with the root of the tree being a category that contains all the sample points, and the leaves being clusters of only one sample. Here, the categories are merged based on a distance metric approach. The two categories are merged into one category during each iteration. Wherein a feature of an audio piece is a sample. The root of the tree, which is obtained by clustering the features of the plurality of audio pieces using the AHC, includes two child nodes. The characteristics of the audio segments in the same child node are similar, and the characteristics of the audio segments in different child nodes are different. Thus, an audio piece corresponding to one sub-node belongs to one category, while an audio piece corresponding to another sub-node belongs to another category.

And finally, combining the audio clips of the same category to obtain a first audio and a second audio.

Generally, audio segments of the same category can be combined in the order in which they are in the dialog audio, resulting in corresponding audio. Audio segments corresponding to one sub-node may be combined into a first audio and audio segments corresponding to another sub-node may be combined into a second audio.

Step 303, inputting the first audio and the second audio into a pre-trained speech recognition model respectively to obtain a first text and a second text.

In this embodiment, the executing entity may input the first audio and the second audio into a pre-trained speech recognition model respectively to obtain a first text and a second text. The audio content is identified through the end-to-end voice identification model, and the content acquisition efficiency is greatly improved. The speech recognition model may include, but is not limited to including: AI models such as LSTM-CTC (Long Short-Term Memory-connection time Classifier), GMM (Gaussian mixture model), HMM (hidden Markov model) and the like are obtained by training a neural network by using a training sample set. The training samples in the training sample set herein may include sample audio and corresponding sample text.

In some optional implementations of the present embodiment, the speech recognition model is an LSTM-CTC. Among them, LSTM-CTCs may include LSTM and CTCs. The corresponding speech recognition step comprises:

first, the first audio and the second audio are respectively input to the LSTM, and the characteristics of the first audio and the second audio are obtained.

The LSTM is a time-cycle neural network, can avoid the problems of gradient extinction and gradient explosion in a common cycle neural network, and has the main core idea that: channels called "cell state" (state) are used throughout the time sequence. Information is removed or added to the cellular state by designing the structure of the "gate". Among them, there are three gates in the LSTM, which are the "forgetting gate", the "input gate", and the "output gate", respectively.

Then, the characteristics of the first audio and the second audio are respectively input into the CTC, and a first text and a second text are obtained.

Among other things, CTCs are used primarily to solve the alignment problem of input features with output tags.

And 304, respectively inputting the first text and the second text into a pre-trained role judgment model to obtain a role corresponding to the first text and a role corresponding to the second text, and selecting a text corresponding to the customer service.

In this embodiment, the execution body may input the first text and the second text to a pre-trained role determination model respectively, obtain a role corresponding to the first text and a role corresponding to the second text, and select a text corresponding to the customer service. The role is judged through the role judgment model, and compared with the method adopting keyword matching, the role judgment model has the advantages of better effect and higher robustness. The role determination model may include, but is not limited to: TextCNN (Text-level Convolutional Neural Network), CharCNN (character-level Convolutional Neural Network), RCNN (Region-based Convolutional Neural Network), Transformer (converter), ELMO (deep context Representation Model), BERT (Bidirectional Encoder) Representation, and other AI models are obtained by training a Neural Network using a training sample set. The training samples in the training sample set may be sample texts of the annotated characters.

And 305, inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain a quality inspection result.

In this embodiment, the execution subject may input the text corresponding to the customer service to a pre-trained semantic classification model to obtain a quality inspection result. The semantic classification model is used for content classification judgment, so that the quality inspection effect can be achieved. Among other things, semantic classification models may include, but are not limited to: AI models such as BERT, ELMO, TextCNN, CharCNN, RCNN, Transformer, etc. are obtained by training a neural network by using a training sample set. The training samples in the training sample set may include sample customer service texts for marking quality inspection results.

In some optional first ways of this embodiment, the semantic classification model may be BERT. The BERT is a bidirectional Transformer model, and the semantic relation between contexts is finely described, so that the semantic classification result is better obtained, namely the purpose of quality inspection is achieved.

According to the audio quality inspection method provided by the embodiment of the application, the audio of different roles is segmented through the human voice separation model, so that the cost of manual listening, distinguishing and marking is reduced; then, the audio content is identified through an end-to-end voice identification model, so that the content acquisition efficiency is greatly improved; then, the role is judged through the role judgment model, and compared with the method adopting keyword matching, the role judgment model has better effect and higher robustness; the semantic classification model is used for content classification judgment, so that the quality inspection effect can be achieved.

The embodiment of the application provides an intelligent quality inspection mode, an AI technology is used as an application core, and the AI technology is used for replacing standardization work. And monitoring massive conversation audios according to the working content of quality inspection, grading according to set rules, generating a standardized analysis document, and accurately positioning the conversation audios with problems. Moreover, all dialogue audios of quality inspection without dead angles can be achieved in a short time. According to the working requirements of quality inspectors, the requirements of fairness, quality control, combination of business knowledge and the like are met. Compared with manual quality inspection, the intelligent quality inspection has the advantages of stable and efficient working quality and reduces the workload of quality inspectors for analyzing basic data. Therefore, the full-scale quality inspection and real-time quality inspection by applying the AI technology are more obvious than the advantages of a manual quality inspector.

For convenience of understanding, fig. 4 shows an application scenario diagram of an audio quality inspection method that may implement the embodiment of the present application. As shown in fig. 4, firstly, the dialogue audio is input to the Xvector-AHC to perform voice separation, and the first audio and the second audio are obtained. And then, respectively inputting the first audio and the second audio into an LSTM-CTC for voice recognition to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. And then, respectively inputting the first text and the second text into the TextCNN for role judgment, and selecting the text corresponding to the customer service. And finally, inputting the text corresponding to the customer service into the BERT for intelligent quality inspection to obtain a quality inspection result.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of an audio quality inspection apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied to various electronic devices.

As shown in fig. 5, the audio quality inspection apparatus 500 of the present embodiment may include: an acquisition module 501, a separation module 502, an identification module 503, a determination module 504, and a classification module 505. Wherein the obtaining module 501 is configured to obtain a conversation audio, wherein the conversation audio records a conversation between a client and a customer service; a separation module 502 configured to perform voice separation on the conversation audio to obtain a first audio and a second audio, where the first audio and the second audio only include a speaker; the recognition module 503 is configured to perform speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; a determination module 504 configured to perform role determination on the first text and the second text, and select a text corresponding to the customer service; and the classification module 505 is configured to perform semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In the present embodiment, in the audio quality inspection apparatus 500: the specific processing and the technical effects thereof of the obtaining module 501, the separating module 502, the identifying module 503, the determining module 504 and the classifying module 505 can refer to the related descriptions of step 201 and step 205 in the corresponding embodiment of fig. 2, which are not described herein again.

In some optional implementations of this embodiment, the separation module 502 includes: a separation submodule configured to input dialogue audio into a pre-trained human voice separation model, resulting in a first audio and a second audio, wherein the human voice separation model comprises one of: voiceprint model-aggregate hierarchical clustering XVector-AHC, Gaussian mixture model GMM, hidden Markov model HMM.

In some optional implementations of this embodiment, the human voice separation model is an Xvector-AHC, and the Xvector-AHC includes an Xvector and an AHC; and the separation sub-module is further configured to: dividing the dialogue audio into a plurality of audio segments; respectively inputting the plurality of audio segments into an XVector to obtain the characteristics of the plurality of audio segments; clustering the characteristics of the plurality of audio clips by using the AHC, and determining the categories of the plurality of audio clips based on the clustering result; and combining the audio clips of the same category to obtain a first audio and a second audio.

In some optional implementations of this embodiment, the identifying module 503 is further configured to: respectively inputting the first audio and the second audio into a pre-trained speech recognition model to obtain a first text and a second text, wherein the speech recognition model comprises one of the following items: long and short term memory network-join time classifier LSTM-CTC, GMM, HMM.

In some optional implementations of this embodiment, the determining module 504 is further configured to: respectively inputting the first text and the second text into a role judgment model trained in advance to obtain a role corresponding to the first text and a role corresponding to the second text, wherein the role judgment model comprises one of the following items: text level convolutional neural network TextCNN, character level convolutional neural network CharCNN, region convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, and converter output bi-directional encoder representation BERT.

In some optional implementations of this embodiment, the classification module 505 is further configured to: inputting a text corresponding to customer service into a pre-trained semantic classification model to obtain a quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, TextCNN, CharCNN, RCNN, Transformer.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 6 illustrates a schematic block diagram of an example electronic device 600 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601, which can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 can also be stored. The calculation unit 601, the ROM 602, and the RAM 603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

A number of components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, a mouse, or the like; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 601 performs the respective methods and processes described above, such as an audio quality inspection method. For example, in some embodiments, the audio quality inspection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into the RAM 603 and executed by the computing unit 601, one or more steps of the audio quality inspection method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the audio quality inspection method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. An audio quality inspection method, comprising:

obtaining conversation audio, wherein the conversation audio records a conversation between a client and a customer service;

carrying out voice separation on the conversation audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise a speaker;

performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;

judging roles of the first text and the second text, and selecting a text corresponding to the customer service;

and carrying out semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the conversation audio.

2. The method of claim 1, wherein the separating the dialogue audio into the first audio and the second audio comprises:

inputting the dialogue audio into a pre-trained human voice separation model to obtain the first audio and the second audio, wherein the human voice separation model comprises one of the following items: voiceprint model-aggregate hierarchical clustering XVector-AHC, Gaussian mixture model GMM, hidden Markov model HMM.

3. The method of claim 2, wherein the human voice separation model is an Xvector-AHC, the Xvector-AHC comprising an Xvector and an AHC; and

the inputting the dialogue audio into a pre-trained human voice separation model to obtain the first audio and the second audio comprises:

dividing the dialog audio into a plurality of audio segments;

respectively inputting the audio segments into an XVector to obtain the characteristics of the audio segments;

clustering features of the plurality of audio clips by using AHC, and determining categories of the plurality of audio clips based on clustering results;

and combining the audio clips of the same category to obtain the first audio and the second audio.

4. The method of claim 1, wherein the performing speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio comprises:

inputting the first audio and the second audio into a pre-trained speech recognition model respectively to obtain the first text and the second text, wherein the speech recognition model comprises one of the following items: long and short term memory network-join time classifier LSTM-CTC, GMM, HMM.

5. The method of claim 1, wherein the role determination of the first text and the second text comprises:

inputting the first text and the second text into a pre-trained role judgment model respectively to obtain a role corresponding to the first text and a role corresponding to the second text, wherein the role judgment model comprises one of the following items: text level convolutional neural network TextCNN, character level convolutional neural network CharCNN, region convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, and converter output bi-directional encoder representation BERT.

6. The method of claim 1, wherein the semantically classifying the text corresponding to the customer service to obtain the quality inspection result of the dialogue audio comprises:

inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain the quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, TextCNN, CharCNN, RCNN, Transformer.

7. An audio quality inspection apparatus comprising:

an acquisition module configured to acquire conversation audio, wherein the conversation audio records a conversation between a customer and a customer service;

the separation module is configured to perform voice separation on the conversation audio to obtain a first audio and a second audio, wherein the first audio and the second audio only contain one speaker;

the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;

the judging module is configured to perform role judgment on the first text and the second text and select a text corresponding to the customer service;

and the classification module is configured to perform semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

8. The apparatus of claim 7, wherein the separation module comprises:

a separation submodule configured to input the dialogue audio into a pre-trained human voice separation model, resulting in the first audio and the second audio, wherein the human voice separation model comprises one of: voiceprint model-aggregate hierarchical clustering XVector-AHC, Gaussian mixture model GMM, hidden Markov model HMM.

9. The apparatus of claim 8, wherein the vocal separation model is an Xvector-AHC, the Xvector-AHC comprising an Xvector and an AHC; and

the separation submodule is further configured to:

dividing the dialog audio into a plurality of audio segments;

10. The apparatus of claim 7, wherein the identification module is further configured to:

11. The apparatus of claim 7, wherein the determination module is further configured to:

12. The apparatus of claim 7, wherein the classification module is further configured to:

13. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.

14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.

15. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-6.