CN116206593A - Voice quality inspection method, device and equipment - Google Patents

Voice quality inspection method, device and equipment Download PDF

Info

Publication number
CN116206593A
CN116206593A CN202111453798.1A CN202111453798A CN116206593A CN 116206593 A CN116206593 A CN 116206593A CN 202111453798 A CN202111453798 A CN 202111453798A CN 116206593 A CN116206593 A CN 116206593A
Authority
CN
China
Prior art keywords
dialect
voice
determining
emotion
detected
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111453798.1A
Other languages
Chinese (zh)
Inventor
钟天宇
丁俊勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Shanghai ICT Co Ltd
CM Intelligent Mobility Network Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Shanghai ICT Co Ltd
CM Intelligent Mobility Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Shanghai ICT Co Ltd, CM Intelligent Mobility Network Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN202111453798.1A priority Critical patent/CN116206593A/en
Publication of CN116206593A publication Critical patent/CN116206593A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/005Language recognition
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/60Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Child & Adolescent Psychology (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a voice quality inspection method and device, and relates to the technical field of voice recognition. The method comprises the following steps: acquiring voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user; performing data preprocessing on the voice sample set, and determining multi-modal characteristics of the processed voice sample set; determining a speech emotion recognition model according to the multi-modal characteristics; and inputting the voice signal set to be detected into a voice emotion recognition model, and determining the emotion state of the user. Compared with the existing voice emotion recognition scheme, the technical scheme has the advantages that multiple dialect emotions can be recognized, and meanwhile, the dialect-specific local mandarin has good effect.

Description

Voice quality inspection method, device and equipment
Technical Field
The present disclosure relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a device for detecting speech quality.
Background
At present, telephone service is an indispensable part of various business services, and the service of many traditional industries is changed from counter service to telephone service. In telephone service, the manual service has the advantages of direct and convenient communication, high handling speed, low learning cost of service handling clients and the like, and occupies a large proportion in telephone service; meanwhile, many problems occur in the manual telephone service process, such as poor service quality and poor service attitude of telephone service, and the problems of overdriving languages and the like of telephone clients exist in the process, so that the method becomes an important link in telephone service for the dialogue and the memorial voice quality inspection of telephone service.
The current voice quality inspection modes mainly comprise the following steps: 1) Comparatively traditional manual mode: randomly extracting a certain sample from all voice customer service records by a manual screening mode, and manually detecting to find out voices with service problems; 2) By means of text content extraction: through transferring the voice dialogue, then, text extraction is carried out in the modes of machine learning and the like, and the service quality of telephone customer service is judged through the extracted text; 3) By means of a match of keywords. Generally, a speech dialogue is transcribed, and then through the dictionary memory matching with a pre-constructed sensitive word, the situation that more than a given number of sensitive words appear is judged to be poor in service attitude. At present, most of systems adopt an automatic voice quality inspection mode except that a manual inspection mode is used for voice customer service quality inspection in the industries with older systems or smaller customer service flow.
The prior art has the following problems: 1) Meanwhile, a voice recognition result is required, and the network structure is complex; 2) The effect can not be ensured when the voice recognition result is not very accurate; 3) Dialects or dialogue audio with heavy accents cannot be processed.
Disclosure of Invention
The embodiment of the application provides a voice quality inspection method, device and equipment, which are used for solving the problem that dialects or dialogue audio with heavier accents cannot be processed in the prior art.
In order to solve the technical problems, the application adopts the following technical scheme:
the embodiment of the application provides a voice quality inspection method, which comprises the following steps:
acquiring voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
performing data preprocessing on the voice sample set, and determining multi-modal characteristics of the processed voice sample set; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
determining a speech emotion recognition model according to the multi-modal features;
and inputting the voice signal set to be detected into the voice emotion recognition model, and determining the emotion state of the user.
Optionally, inputting the set of to-be-detected voice signals to the voice emotion recognition model, determining the emotion state of the user includes:
According to a preset emmbedding algorithm and the input feature vector, determining an emmbedding output feature of a first target dialect in the voice sample set; the first target dialect is one dialect in the set of speech samples;
determining the dialect type of the voice signal set to be detected according to the voice signal set to be detected and the emmbedding output characteristics of the first target dialect;
determining an emotion probability vector of a second target dialect in the voice signal set to be detected according to the dialect type and the voice emotion recognition model; wherein the speech emotion recognition model comprises a mixed dialect emotion model and a mandarin emotion model; the second target dialect is one dialect in the voice signal set to be detected;
and determining the emotion result of the voice signal set to be detected according to the emotion probability vector of the second target dialect.
Optionally, determining, according to the dialect category and the speech emotion recognition model, an emotion probability vector of a second target dialect in the speech signal set to be detected includes:
according to the formula:
Figure BDA0003387143860000021
determining an emotion probability vector of a second target dialect in the voice signal set to be detected;
Wherein P is the emotion probability vector of the second target dialect; p (P) i Probability vectors of the emotions obtained for the mixed dialect emotion model; p (P) 0 Probability vectors of the emotions obtained for the Mandarin emotion model; l (L) 0 A mandarin chinese centroid distance for the set of speech samples; l (L) i A dialect centroid distance for the set of speech samples.
Optionally, determining, according to the to-be-detected voice signal set and the emmbedding output feature of the first target dialect, a dialect category of the to-be-detected voice signal set includes:
determining a first centroid vector of the first target dialect and a discrimination threshold of the first target dialect according to the emmbedding output characteristics of the first target dialect;
determining a second centroid vector of a second target dialect in the to-be-detected voice signal set according to the to-be-detected voice signal set and the preset emmbedding algorithm;
if the distance between the second centroid vector and the corresponding first centroid vector is smaller than the discrimination threshold, determining that the voice signal set to be detected is a mixed dialect type; otherwise, determining the voice signal set to be detected as the Mandarin type.
Optionally, the determining, according to a preset labeling algorithm and the input feature vector, labeling output features of the first target dialect in the voice sample set includes:
Determining a compiled coding feature vector according to the dialect embedding model and the preset embedding algorithm;
and performing feature stitching on the compiled coding feature vector and the input feature vector to determine an ebedding output feature of the first target dialect in the voice sample set.
Optionally, constructing and training the dialect casting model includes:
constructing the dialect empdding model; the input value of the dialect empdding model is the input feature vector, and the output value of the dialect empdding model is the single-heat code corresponding to the dialect to which the voice sample set belongs;
and taking a triple loss function as a loss function, and training the dialect casting model according to the voice sample set.
Optionally, the determining a speech emotion recognition model according to the multi-modal feature includes:
the input feature vector, the embedding output feature and a preset feature set are subjected to feature stitching, and the input feature vector, the embedding output feature and the preset feature set are determined to be stitched feature vectors;
determining a voice emotion fusion vector according to the spliced feature vector and a preset recurrent neural network;
and inputting the voice emotion fusion vector to two full-connection layers, calculating the confidence level of emotion types by using a preset softmax function, and determining a voice emotion recognition model.
Optionally, in the obtaining the voice data to be detected, obtaining a voice sample set with dialect features includes:
collecting dialect voice dialogue data, and removing voice data without emotion in the dialect voice dialogue data;
the audio format of the dialect voice dialogue data after being removed is unified, and the unified audio data is determined;
dividing the unified audio data, and determining the unified audio data as a plurality of sections of audio files limited within a preset duration;
and marking the category of the emotion according to all the audio files within the preset duration.
Optionally, determining an input feature vector in the multimodal features of the processed speech sample set includes:
denoising all the marked audio files within a preset time length in a preset filtering mode to obtain denoised audio files;
and determining the input feature vector after preprocessing the voice sample set by a convolution recurrent neural network for the noise-reduced audio file.
Optionally, determining the input feature vector after preprocessing the voice sample set by using a convolutional recurrent neural network for the audio file after noise reduction, including:
performing short-time Fourier transform on the noise-reduced audio file, and determining a processed frequency spectrum signal;
Determining a mel frequency spectrum according to multiplication of the frequency spectrum signal and a mel filter;
and inputting the Mel frequency spectrum into the convolutional recurrent neural network, and determining the input feature vector after preprocessing the voice sample set.
The embodiment of the application also provides a voice quality inspection device, which comprises:
the acquisition module is used for acquiring voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
the first determining module is used for carrying out data preprocessing on the voice sample set and determining the multi-modal characteristics of the processed voice sample set; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
the second determining module is used for determining a voice emotion recognition model according to the multi-modal characteristics;
and the third determining module is used for inputting the voice signal set to be detected into the voice emotion recognition model and determining the emotion state of the user.
The embodiment of the application also provides voice quality inspection equipment, which comprises: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the voice quality inspection method as claimed in any one of the preceding claims.
The embodiment of the application also provides a readable storage medium, wherein a program is stored on the readable storage medium, and the program realizes the voice quality inspection method according to any one of the above when being executed by a processor.
The beneficial effects of this application are:
according to the technical scheme, the emotion state of the user is determined by identifying the language data input by the user by using the dialect, so that the purpose of detecting the dialect voice is achieved. Compared with the existing voice emotion recognition scheme, the technical scheme has the advantages that multiple dialect emotions can be recognized, and meanwhile, the dialect-specific local mandarin has good effect.
Drawings
Fig. 1 shows a flow chart of a voice quality inspection method according to an embodiment of the present application;
fig. 2 shows one of schematic block diagrams of a voice quality inspection device according to an embodiment of the present application;
FIG. 3 is a second schematic block diagram of a voice quality testing apparatus according to an embodiment of the present disclosure;
Fig. 4 shows a third block diagram of a voice quality inspection device according to an embodiment of the present application.
Detailed Description
In order to make the technical problems, technical solutions and advantages to be solved by the present application more apparent, the following detailed description will be given with reference to the accompanying drawings and the specific embodiments. In the following description, specific details such as specific configurations and components are provided merely to facilitate a thorough understanding of embodiments of the present application. It will therefore be apparent to those skilled in the art that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of the application. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
In various embodiments of the present application, it should be understood that the sequence numbers of the following processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.
Aiming at the problem that dialects or dialogue audios with heavy accents cannot be processed in the prior art, the application provides a voice quality inspection method, device and equipment.
As shown in fig. 1, an alternative embodiment of the present invention provides a voice quality inspection method, including:
step 100, obtaining voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
here, a set of speech samples with dialect features is obtained, the set of speech samples comprising a dialect dataset and a dialect dialogue dataset with emotion, and a task track separation and phrase processing for the set of speech samples is determined. Step 200, data preprocessing is carried out on the voice sample set, and multi-modal characteristics of the processed voice sample set are determined; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
In this embodiment, the data preprocessing of the set of voice samples may preferentially determine the input feature vector; determining an emmbedding output characteristic of a first target dialect in the voice sample set according to a preset embedding (emmbedding) algorithm and the input feature vector; the first target dialect is one dialect in the set of speech samples.
It should be noted that, the GeMAPS feature set: a total of 62 features, HFS (high level statistics functions, advanced statistics function), were obtained by making some statistics on the basis of LLDs (low level descriptors, low-level descriptors). The 18 LLDs are characterized by 6 frequency-dependent features, 3 energy/amplitude-dependent features, and 9 spectral features.
Step 300, determining a speech emotion recognition model according to the multi-modal characteristics;
here, a speech emotion recognition model is trained based on the input feature vector, the emmbedding output feature, and a set of preset features. In this embodiment, three feature sources are used as inputs to the speech emotion recognition model, respectively, a spectrogram of the input feature vector, a preset feature set (GeMAPS feature set), and the emmbedding output feature (speech passes through the language code of the previously trained emmbedding model).
Step 400, inputting the voice signal set to be detected to the voice emotion recognition model, and determining the emotion state of the user.
In this embodiment, a set of speech signals to be detected is input to a trained speech emotion recognition model, and a speech recognition result is determined. According to the technical scheme, the training mode of adding the classification network model after adopting the ebadd coding is adopted to determine the voice recognition result, so that the recognition of the dialect-possessing voice is achieved, and the diversity of the voice recognition is increased.
Before starting speech recognition, the speech signal is first preprocessed. Preprocessing is accomplished in three stages, analog signal digitizing, endpoint detection, and framing, respectively. After the speech signal is framed, we can analyze it in detail. Framing refers to slicing a complete speech signal into a number of equal length segments. Each cut-out piece of speech is called a frame. Framing is typically implemented using a moving window function. The first frame and the second frame have only one overlapping portion. Typically, a frame is 25 milliseconds in length, and the overlap between every two frames is 10 milliseconds. After the framing of the voice signal is completed, extracting characteristic parameters of the voice signal. The dialects include, but are not limited to: northeast dialect, wu Fangyan, hunan dialect, min dialect, hakka dialect, guangdong dialect, and Mandarin.
Here, the emotional state includes, but is not limited to: emotional states such as happiness, vitality, sadness, fear, surprise, and general.
In summary, by identifying the obtained language data input by the user using the dialect, the emotion state of the user is determined, so that the purpose of detecting dialect voice is achieved.
Specifically, the step 400 includes: step 410 to step 440;
wherein, in step 410, according to a preset compressing algorithm and the input feature vector, compressing output features of a first target dialect in the voice sample set are determined; the first target dialect is one dialect in the set of speech samples;
in this embodiment, the preset embedding algorithm is preset embedding (embedding) encoding. And for judging dialects, adopting a mode of adding a classification network model to train after preset ebedding codes, inputting Mel (mel) frequency spectrums into the codes, outputting monothermal codes (onehot codes) corresponding to the types of the dialects to which the audio belongs, taking the weight of the previous layer of the output layer of the ebedding network as the codes of the language, extracting a certain number of codes to obtain a central point, performing remembering splicing with the input feature vectors, and adding the central point into the classification network model. Here, a weighted summation approach is used to obtain the empadd center point for a dialect.
Optionally, the step 410 includes:
step 411, determining a compiled encoding feature vector according to the dialect embedding model and the preset embedding algorithm; here, the preset embedding algorithm is preset embedding coding.
And step 412, performing feature stitching on the compiled coding feature vector and the input feature vector, and determining an embedded output feature of the first target dialect in the voice sample set.
In the embodiment, the scheme of carrying out the ebadd coding on different dialects is adopted, so that the technical effect that the dialect embedding model can effectively carry out the multi-language identification is achieved.
Specifically, building and training a dialect embedding model includes:
constructing a dialect embedding model; the input value of the dialect embedding model is the input feature vector, and the output value of the dialect embedding model is the single-heat code corresponding to the dialect to which the voice sample set belongs;
in this embodiment, the input value of the dialect embedding model is the mel spectrum of the input feature vector, the mel spectrum is output as the onehot code corresponding to the dialect type to which the audio belongs, and the weight of the previous layer of the output layer of the embedding network is used as the code of the language.
And taking a triple loss function as a loss function, and training the dialect embedding model according to the voice sample set.
In this embodiment, the triple loss is used as a loss function trained by the dialect embedding model, so that the distance between different tag data in a specified dimension can be effectively increased, and the situation that different dialects have enough differentiation is guaranteed, so that preparation is made for the subsequent remembering emotion weighting.
The Triplet loss formula is: l=max (d (a, p) -d (a, n) +margin, 0), d (a, p) represents the sample-to-positive sample distance, and d (a, n) sample-to-negative sample distance. Here, it is intended to let the distance to the positive sample be less than the distance to the negative sample by margin, in order to pull the distance a, p closer and the distance a, n farther. Here, 80% of the data in the speech sample set is used as a training set, 20% is used as a test set, the test set of each language is coded at the same time, the centroid after each dialect is coded is obtained, and after each dialect of the test set is coded in a data remembering way, the mean value in the same dimension of each dialect is used as the centroid in the dimension. Meanwhile, a discrimination threshold value of the language is set according to the distance from the coded data to the centroid, and particularly, the proportion of 95% of the Euclidean distance from the number of the whole test sets to the centroid is used as the threshold value.
Step 420, determining a dialect class of the set of to-be-detected voice signals according to the set of to-be-detected voice signals and the embedded output characteristics of the first target dialect;
step 430, determining an emotion probability vector of a second target dialect in the set of speech signals to be detected according to the dialect type and the speech emotion recognition model; wherein the speech emotion recognition model comprises a mixed dialect emotion model and a mandarin emotion model; the second target dialect is one dialect in the voice signal set to be detected;
it should be noted that the first target dialect is one dialect in the voice sample set, namely, a dialect in the training set; the second target dialect is one dialect in the set of speech signals to be detected.
Step 440 determines an emotion result of the set of to-be-detected speech signals according to the emotion probability vector of the second target dialect.
In this embodiment, by determining the dialect type of the to-be-detected voice signal set, when the language is determined to be a mixed dialect emotion type, that is, when the language is determined to be one dialect, the output result is the trained mixed dialect emotion type output result, and the final result is the probability of each emotion and the emotion with the highest probability. If the judgment result is the Mandarin emotion model, the Mandarin emotion model comprises Mandarin with local accent or all Mandarin voices, determining to use the output result of the Mandarin emotion model
Specifically, the 430 includes:
according to the formula:
Figure BDA0003387143860000091
determining an emotion probability vector of a second target dialect in the voice signal set to be detected;
wherein P is the emotion probability vector of the second target dialect; p (P) i Probability vectors of the emotions obtained for the mixed dialect emotion model; p (P) 0 Probability vectors of the emotions obtained for the Mandarin emotion model; l (L) 0 A mandarin chinese centroid distance for the set of speech samples; l (L) i A dialect centroid distance for the set of speech samples.
In this embodiment, the emotion represented by the speech is determined by the emotion label of the maximum value in the probability vector P of each emotion, and the proportion occupied by other emotions is also given, so that the user can conveniently determine.
In many scenes, the speaker does not use dialect purely or use mandarin purely, which causes great trouble to many systems which can only support a single language, so the speech emotion recognition model adds speech emotion result weighting, namely effectively weighting the speech emotion obtained in the model, so that the speech emotion classification tends to be accurate.
Specifically, the step 420 includes:
Determining a first centroid vector of the first target dialect and a discrimination threshold of the first target dialect according to the embedded output characteristics of the first target dialect;
determining a second centroid vector of a second target dialect in the to-be-detected voice signal set according to the to-be-detected voice signal set and the preset embedding algorithm; here, the preset embedding algorithm is preset embedding coding;
if the distance between the second centroid vector and the corresponding first centroid vector is smaller than the discrimination threshold, determining that the voice signal set to be detected is a mixed dialect type; otherwise, determining the voice signal set to be detected as the Mandarin type.
It should be noted that centroid vectors of 8 languages (northeast dialect, wu Fangyan, hunan dialect, min dialect, hakka dialect, guangdong dialect and Mandarin) can be obtained by the dialect embedding model, and these vectors are formed into a matrix X, wherein the first row X of X 0 Representing the centroid vector after mandarin is rebedding. Meanwhile, a judgment threshold L which is used as a language can be obtained, when the minimum distance between a section of preset voice and each centroid after being subjected to the embellishing is smaller than L, the dialect corresponding to the preset voice input by the voice is determined, and otherwise, the dialect is mandarin with local accents. Wherein, the Euclidean distance is used in the calculation formula of the distance.
In this embodiment, the dialect type of the to-be-detected voice signal set is determined by determining a first centroid vector of the first target dialect, a discrimination threshold of the first target dialect, and a second centroid vector of the second target dialect in the to-be-detected voice signal set, and determining a relationship between the first centroid vector and the second centroid vector.
Optionally, the step 300 includes:
the input feature vector, the embedded output feature and a preset feature set are subjected to feature stitching, and the stitched feature vector is determined;
determining a voice emotion fusion vector according to the spliced feature vector and a preset recurrent neural network;
and inputting the voice emotion fusion vector to two full-connection layers, calculating the confidence level of emotion types by using a preset softmax function, and determining a voice emotion recognition model.
Extracting features of the spectrogram by adopting a CNN (convolutional neural network), expanding the extracted features in a time dimension, splicing the features with GeMaps features and embedded output features, using an RNN (cyclic neural network) to fully utilize the time connection of the features, and finally adding the features into a network structure of the DNN (deep neural network) and a preset classifier to train a speech emotion recognition model; the preset classifier is preferably a softmax classifier.
Specifically, in order to better adapt to the Mandarin and dialect round scenes, two models are trained, one model is trained by using all dialect emotion data in a mixed mode, and the other model is independently trained by using Mandarin, so that the training by using Mandarin alone can be good due to the large data volume of Mandarin. The method comprises the following specific steps: 1. and extracting the pre-processed audio data set through three types of remembering features, respectively converting a mel spectrogram, extracting a GeMaps feature set, and generating corresponding vectors through a previously trained embedding network. 2. And (3) after the spectrogram features pass through a CNN network convolutional neural network, unfolding and splicing the GeMaps feature set and the emmbedding features. 3. And inputting the spliced characteristics into the RNN network circulating neural network. 4. And then accessing a softmax classifier after passing through a 2-layer full-connection layer network, and outputting (output) as a corresponding emotion label.
Optionally, the step 100 of obtaining a set of voice samples with dialect features includes:
collecting dialect voice dialogue data, and removing voice data without emotion in the dialect voice dialogue data;
The audio format of the dialect voice dialogue data after being removed is unified, and the unified audio data is determined;
dividing the unified audio data, and determining the unified audio data as a plurality of sections of audio files limited within a preset duration;
and marking the category of the emotion according to all the audio files within the preset duration.
In this embodiment, dialect dialogue data is preferably collected through a telephone dialogue, and for ease of processing, a single speaker uses only the same dialect in a single dialogue. Further, after the unified audio data is determined, each section of dialogue is divided so as to avoid the situation that the single section of voice is overlong, namely, the audio files with a plurality of sections limited within the preset duration are determined, wherein the main basis of silence detection (vad) of the divided sentences is the main basis, the most common duration is not more than 10 seconds, and the divided sentences closest to 10 seconds are taken for the divided sentences. And finally, marking the category of the emotion according to all the audio files within the preset duration, indicating the dialect of the sentence data, and completing collection of the dialect coding data set by using numbers.
Optionally, determining the input feature vector of step 200 includes:
denoising all the marked audio files within a preset time length in a preset filtering mode to obtain denoised audio files;
And determining the input feature vector after preprocessing the voice sample set by a convolution recurrent neural network for the noise-reduced audio file.
In the embodiment, because noise is generally present in the voice signal in the acquisition process, and because the voice signal has the characteristic of uneven distribution in the frequency domain, some preprocessing is needed, and all audio files within the marked preset duration are subjected to noise reduction processing in a preset filtering mode, so that the audio files after noise reduction are obtained; here, in terms of speech noise reduction enhancement, the method of selection includes, but is not limited to: the LMS (vibration noise testing system) self-use filtering noise reduction, basic spectral subtraction, wiener filtering noise reduction, and the like are adopted, and specifically, the mode of using kalman filtering is preferable for speech enhancement in the present invention.
It should be noted that, the kalman filter is a linear least mean square estimation for discrete linear system states, and the basic idea is to predict and correct first, and estimate the system by time update equation and observation update equation. An estimate of what is actually a speech signal is ultimately obtained in speech system modeling. The method has the advantages that all past observed values are not required to be adopted for estimation, the method is suitable for real-time processing, meanwhile, a recursive method is adopted for calculation, and the method is suitable for computer solving. Other preprocessing methods include pre-emphasis for non-uniform distribution of different frequencies, framing, windowing, endpoint detection, short-time analysis, etc. of speech signals because the entire speech cannot be processed at one time.
In this embodiment, the input feature vector after preprocessing the speech sample set is determined for the denoised audio file by a Convolutional Recurrent Neural Network (CRNN). Firstly, converting an audio signal into a spectrogram to serve as an input source of the deep learning network, and performing feature extraction. The method has the main advantages that the input of the network keeps all various information of the voice to a large extent, simultaneously, the complex procedure of secondary screening of the manual feature set is avoided, and meanwhile, the method has a good recognition effect.
Specifically, determining the input feature vector after preprocessing the voice sample set by a convolution recurrent neural network for the noise-reduced audio file, including:
performing short-time Fourier transform on the noise-reduced audio file, and determining a processed frequency spectrum signal;
determining a mel frequency spectrum according to multiplication of the frequency spectrum signal and a mel filter;
and inputting the Mel frequency spectrum into the convolutional recurrent neural network, and determining the input feature vector after preprocessing the voice sample set.
In this embodiment, the step of obtaining the mel (mel) spectrum is: performing short-time fourier transform (STFT) on the signal; the generated spectrum signal is multiplied by a mel filter to obtain a mel spectrum. Further, after obtaining a mel (mel) spectrum, building a CRNN network model, and inputting the mel spectrum into the convolutional recurrent neural network, wherein emotion types are used as output (output) of the network, network training is completed, and input feature vectors after preprocessing the voice sample set are determined. According to the invention, feature extraction is performed on the Mel (mel) spectrum aspect, feature extraction is performed on the Mel spectrogram through the CRNN, and a model training scheme is performed by adopting a plurality of dialects, so that the robustness of a voice quality inspection system is effectively improved, and the technical effect of supporting automatic voice quality inspection on a plurality of dialects is achieved.
As shown in fig. 2, an alternative embodiment of the present invention further provides a voice quality inspection apparatus, including:
the voice collecting and transmitting module 01 is used for collecting and transmitting voice data of the voice customer service telephone;
the voice storage module 02 is used for receiving the voice data transmitted by the voice transmission module and storing the voice data into the server;
the voice preprocessing module 03 performs voice preprocessing on voice data to be subjected to quality inspection;
the AI (artificial intelligence) algorithm module 04 obtains the multi-dialect emotion recognition result through the AI algorithm model on the data which has been subjected to the preprocessing.
The result feedback module 05 receives the recognition result of the AI algorithm module, records the emotion in each piece of data, and reports the dialog containing the non-compliant emotion.
As shown in fig. 3, in particular, the voice quality inspection device may include; the system comprises an audio collection module, a communication bus, a preprocessing module, a feature extraction module, a trained AI model and a result output module.
In this embodiment, the audio collection module is mainly configured to collect the voice to be recognized through a front-end hardware device, such as a mobile phone, a wired phone, and the like.
In this embodiment, the result output module is a server for storing customer service voice recognition results, and a database for recording non-compliance emotion dialogues, and is used for querying and displaying emotion recognition results.
In this embodiment, the communication bus is mainly used for communication between modules, including a serial peripheral interface (Serial Peripheral Interface) communication bus and an I2C (Inter-integrated Circuit, two-wire serial bus) communication bus, and is mainly used for uploading collected audio data to a cloud server and sending an identification result to a result output module.
In this embodiment, the preprocessing module mainly performs the voice noise reduction enhancement and the preprocessing operation on the received voice signal, so that the following functional modules can smoothly perform.
In this embodiment, the feature extraction module mainly performs feature extraction on the preprocessed voice data, and mainly performs feature set extraction for generating mel spectrum and GeMAPS in this document.
In this embodiment, the trained AI model is mainly used for identifying and encoding different dialects based on the features obtained by the feature extraction module, weighting the emotion of mandarin with dialects, and identifying emotion of speech by using the model obtained by training data.
As shown in fig. 4, an alternative embodiment of the present invention further provides a voice quality inspection apparatus, including:
an acquisition module 10, configured to acquire voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
A first determining module 20, configured to perform data preprocessing on the voice sample set, and determine multi-modal characteristics of the processed voice sample set; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
a second determining module 30, configured to determine a speech emotion recognition model according to the multi-modal feature;
a third determining module 40, configured to input the set of to-be-detected voice signals to the voice emotion recognition model, and determine an emotion state of the user.
Optionally, the third determining module 40 includes:
the first determining unit is used for determining the ebedding output characteristics of the first target dialect in the voice sample set according to a preset ebedding algorithm and the input characteristic vector; the first target dialect is one dialect in the set of speech samples;
the second determining unit is used for determining the dialect type of the voice signal set to be detected according to the voice signal set to be detected and the emmbedding output characteristics of the first target dialect;
A third determining unit, configured to determine, according to the dialect category and the speech emotion recognition model, an emotion probability vector of a second target dialect in the speech signal set to be detected; wherein the speech emotion recognition model comprises a mixed dialect emotion model and a mandarin emotion model; the second target dialect is one dialect in the voice signal set to be detected;
and the fourth determining unit is used for determining the emotion result of the voice signal set to be detected according to the emotion probability vector of the second target dialect.
Specifically, the third determining unit is configured to:
according to the formula:
Figure BDA0003387143860000141
determining an emotion probability vector of a second target dialect in the voice signal set to be detected;
wherein P is the emotion probability vector of the second target dialect; p (P) i Probability vectors of the emotions obtained for the mixed dialect emotion model; p (P) 0 Probability vectors of the emotions obtained for the Mandarin emotion model; l (L) 0 A mandarin chinese centroid distance for the set of speech samples; l (L) i A dialect centroid distance for the set of speech samples.
Optionally, the second determining unit includes:
the first determining subunit is used for determining a first centroid vector of the first target dialect and a discrimination threshold of the first target dialect according to the ebedding output characteristics of the first target dialect;
The second determining subunit is configured to determine a second centroid vector of a second target dialect in the to-be-detected voice signal set according to the to-be-detected voice signal set and the preset sounding algorithm;
a third determining subunit, configured to determine that the set of to-be-detected speech signals is a mixed dialect type if a distance between the second centroid vector and the corresponding first centroid vector is smaller than the discrimination threshold; otherwise, determining the voice signal set to be detected as the Mandarin type.
Optionally, the first determining unit includes:
a fourth determining subunit, configured to determine a compiled encoding feature vector according to the dialect embedding model and the preset embedding algorithm;
and a fifth determining subunit, configured to perform feature stitching on the compiled encoded feature vector and the input feature vector, and determine an ebedding output feature of the first target dialect in the speech sample set.
Optionally, the apparatus further includes:
a model is built and is used for building the dialect casting model; the input value of the dialect empdding model is the input feature vector, and the output value of the dialect empdding model is the single-heat code corresponding to the dialect to which the voice sample set belongs;
And training a model, namely taking a triple loss function as a loss function, and training the dialect casting model according to the voice sample set.
Optionally, the second determining module 30 includes:
a fifth determining unit, configured to splice the input feature vector, the mapping output feature and a preset feature set as features, and determine the spliced feature vector;
a sixth determining unit, configured to determine a speech emotion fusion vector according to the spliced feature vector and a preset recurrent neural network;
and a seventh determining unit, configured to input the speech emotion fusion vector to two full connection layers, and calculate confidence level of emotion type by using a preset softmax function, so as to determine a speech emotion recognition model.
Optionally, the acquiring module 10 includes:
the collecting unit is used for collecting dialect voice dialogue data and removing voice data without emotion in the dialect voice dialogue data;
an eighth determining unit, configured to determine an audio format of the dialect voice dialogue data after being removed uniformly as uniform audio data;
a ninth determining unit, configured to divide the unified audio data and determine the unified audio data as audio files with multiple segments limited within a preset duration;
The marking unit is used for marking the category of the emotion according to all the audio files within the preset duration.
Optionally, the first determining module 20 includes:
the obtaining unit is used for carrying out noise reduction treatment on all the marked audio files within the preset time length in a preset filtering mode to obtain noise-reduced audio files;
and a tenth determining unit, configured to determine, for the denoised audio file, an input feature vector after preprocessing the speech sample set through a convolutional recurrent neural network.
Optionally, the tenth determining unit includes:
a sixth determining subunit, configured to perform short-time fourier transform on the noise-reduced audio file, and determine a processed spectrum signal;
a seventh determining subunit, configured to determine a mel spectrum according to multiplication of the spectrum signal and a mel filter;
and an eighth determining subunit, configured to input the mel spectrum into the convolutional recurrent neural network, and determine an input feature vector after preprocessing the speech sample set.
An optional embodiment of the present invention further provides a voice quality inspection apparatus, including: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the voice quality inspection method as claimed in any one of the preceding claims.
An optional embodiment of the present invention further provides a readable storage medium, where a program is stored, where the program, when executed by a processor, implements each process of the foregoing any one of the embodiments of the voice quality inspection method, and the same technical effect can be achieved, and in order to avoid repetition, a description is omitted herein. Wherein the readable storage medium is selected from Read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
While the foregoing is directed to the preferred embodiments of the present application, it should be noted that modifications and adaptations to those embodiments may be made by one of ordinary skill in the art without departing from the principles set forth herein and are intended to be within the scope of the present application.

Claims (13)

1. A method for voice quality testing, comprising:
acquiring voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
performing data preprocessing on the voice sample set, and determining multi-modal characteristics of the processed voice sample set; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
determining a speech emotion recognition model according to the multi-modal features;
and inputting the voice signal set to be detected into the voice emotion recognition model, and determining the emotion state of the user.
2. The method of claim 1, wherein inputting the set of speech signals to be detected to the speech emotion recognition model, determining the emotional state of the user comprises:
According to a preset emmbedding algorithm and the input feature vector, determining an emmbedding output feature of a first target dialect in the voice sample set; the first target dialect is one dialect in the set of speech samples;
determining the dialect type of the voice signal set to be detected according to the voice signal set to be detected and the emmbedding output characteristics of the first target dialect;
determining an emotion probability vector of a second target dialect in the voice signal set to be detected according to the dialect type and the voice emotion recognition model; wherein the speech emotion recognition model comprises a mixed dialect emotion model and a mandarin emotion model; the second target dialect is one dialect in the voice signal set to be detected;
and determining the emotion result of the voice signal set to be detected according to the emotion probability vector of the second target dialect.
3. The method of claim 2, wherein determining the emotion probability vector for the second target dialect in the set of speech signals to be detected based on the dialect class and the speech emotion recognition model comprises:
according to the formula:
Figure FDA0003387143850000011
Determining an emotion probability vector of a second target dialect in the voice signal set to be detected;
wherein P is the emotion probability vector of the second target dialect; p (P) i Probability vectors of the emotions obtained for the mixed dialect emotion model; p (P) 0 Each obtained for Mandarin emotion modelProbability vectors of emotions; l (L) 0 A mandarin chinese centroid distance for the set of speech samples; l (L) i A dialect centroid distance for the set of speech samples.
4. The method of claim 2, wherein determining the dialect class of the set of speech signals to be detected based on the set of speech signals to be detected and the first target dialect's emmbedding output characteristics comprises:
determining a first centroid vector of the first target dialect and a discrimination threshold of the first target dialect according to the emmbedding output characteristics of the first target dialect;
determining a second centroid vector of a second target dialect in the to-be-detected voice signal set according to the to-be-detected voice signal set and the preset emmbedding algorithm;
if the distance between the second centroid vector and the corresponding first centroid vector is smaller than the discrimination threshold, determining that the voice signal set to be detected is a mixed dialect type; otherwise, determining the voice signal set to be detected as the Mandarin type.
5. The method of claim 2, wherein the determining the emmbedding output feature of the first target dialect in the set of speech samples according to a preset emmbedding algorithm and the input feature vector comprises:
determining a compiled coding feature vector according to the dialect embedding model and the preset embedding algorithm;
and performing feature stitching on the compiled coding feature vector and the input feature vector to determine an ebedding output feature of the first target dialect in the voice sample set.
6. The method of claim 1, wherein constructing and training the dialect casting model comprises:
constructing the dialect empdding model; the input value of the dialect empdding model is the input feature vector, and the output value of the dialect empdding model is the single-heat code corresponding to the dialect to which the voice sample set belongs;
and taking a triple loss function as a loss function, and training the dialect casting model according to the voice sample set.
7. The method of claim 1, wherein said determining a speech emotion recognition model from said multimodal features comprises:
The input feature vector, the embedding output feature and a preset feature set are subjected to feature stitching, and the input feature vector, the embedding output feature and the preset feature set are determined to be stitched feature vectors;
determining a voice emotion fusion vector according to the spliced feature vector and a preset recurrent neural network;
and inputting the voice emotion fusion vector to two full-connection layers, calculating the confidence level of emotion types by using a preset softmax function, and determining a voice emotion recognition model.
8. The method according to claim 1, wherein the step of obtaining a set of speech samples having dialect features from the obtaining speech data to be detected comprises:
collecting dialect voice dialogue data, and removing voice data without emotion in the dialect voice dialogue data;
the audio format of the dialect voice dialogue data after being removed is unified, and the unified audio data is determined;
dividing the unified audio data, and determining the unified audio data as a plurality of sections of audio files limited within a preset duration;
and marking the category of the emotion according to all the audio files within the preset duration.
9. The method of claim 1, wherein determining an input feature vector in the multi-modal features of the processed set of speech samples comprises:
Denoising all the marked audio files within a preset time length in a preset filtering mode to obtain denoised audio files;
and determining the input feature vector after preprocessing the voice sample set by a convolution recurrent neural network for the noise-reduced audio file.
10. The method of claim 9, wherein determining the input feature vector for the set of speech samples after preprocessing by the convolutional recurrent neural network for the denoised audio file comprises:
performing short-time Fourier transform on the noise-reduced audio file, and determining a processed frequency spectrum signal;
determining a mel frequency spectrum according to multiplication of the frequency spectrum signal and a mel filter;
and inputting the Mel frequency spectrum into the convolutional recurrent neural network, and determining the input feature vector after preprocessing the voice sample set.
11. A voice quality testing device, comprising:
the acquisition module is used for acquiring voice data to be detected; the voice data to be detected comprises a voice sample set with dialect characteristics and a voice signal set to be detected, which are input by a user;
the first determining module is used for carrying out data preprocessing on the voice sample set and determining the multi-modal characteristics of the processed voice sample set; the multi-modal feature includes: inputting a feature vector, embedding an embedding output feature and a preset feature set; the said ebedding output characteristic is confirmed through the dialect ebedding model code, is used for indicating each dialect in the said voice sample set;
The second determining module is used for determining a voice emotion recognition model according to the multi-modal characteristics;
and the third determining module is used for inputting the voice signal set to be detected into the voice emotion recognition model and determining the emotion state of the user.
12. A voice quality testing apparatus, comprising: a processor, a memory and a program stored on the memory and executable on the processor, which when executed by the processor implements the voice quality detection method of any one of claims 1 to 10.
13. A readable storage medium, characterized in that the readable storage medium has stored thereon a program, which when executed by a processor, implements the voice quality inspection method according to any one of claims 1 to 10.
CN202111453798.1A 2021-12-01 2021-12-01 Voice quality inspection method, device and equipment Pending CN116206593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111453798.1A CN116206593A (en) 2021-12-01 2021-12-01 Voice quality inspection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111453798.1A CN116206593A (en) 2021-12-01 2021-12-01 Voice quality inspection method, device and equipment

Publications (1)

Publication Number Publication Date
CN116206593A true CN116206593A (en) 2023-06-02

Family

ID=86511782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111453798.1A Pending CN116206593A (en) 2021-12-01 2021-12-01 Voice quality inspection method, device and equipment

Country Status (1)

Country Link
CN (1) CN116206593A (en)

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN109741732B (en) Named entity recognition method, named entity recognition device, equipment and medium
WO2021128741A1 (en) Voice emotion fluctuation analysis method and apparatus, and computer device and storage medium
CN111916111B (en) Intelligent voice outbound method and device with emotion, server and storage medium
CN110473566A (en) Audio separation method, device, electronic equipment and computer readable storage medium
CN110782872A (en) Language identification method and device based on deep convolutional recurrent neural network
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN110827801A (en) Automatic voice recognition method and system based on artificial intelligence
CN107731233A (en) A kind of method for recognizing sound-groove based on RNN
CN113707125B (en) Training method and device for multi-language speech synthesis model
CN108899047A (en) The masking threshold estimation method, apparatus and storage medium of audio signal
CN112581963B (en) Voice intention recognition method and system
Ismail et al. Mfcc-vq approach for qalqalahtajweed rule checking
CN110600014A (en) Model training method and device, storage medium and electronic equipment
CN110782902A (en) Audio data determination method, apparatus, device and medium
Verma et al. Indian language identification using k-means clustering and support vector machine (SVM)
CN111091809A (en) Regional accent recognition method and device based on depth feature fusion
CN115171731A (en) Emotion category determination method, device and equipment and readable storage medium
CN106297769B (en) A kind of distinctive feature extracting method applied to languages identification
CN114627896A (en) Voice evaluation method, device, equipment and storage medium
CN114927126A (en) Scheme output method, device and equipment based on semantic analysis and storage medium
Dave et al. Speech recognition: A review
Birla A robust unsupervised pattern discovery and clustering of speech signals
CN112885379A (en) Customer service voice evaluation method, system, device and storage medium
CN112231440A (en) Voice search method based on artificial intelligence

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination