CN116110437A - Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics - Google Patents
Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics Download PDFInfo
- Publication number
- CN116110437A CN116110437A CN202310395720.1A CN202310395720A CN116110437A CN 116110437 A CN116110437 A CN 116110437A CN 202310395720 A CN202310395720 A CN 202310395720A CN 116110437 A CN116110437 A CN 116110437A
- Authority
- CN
- China
- Prior art keywords
- voice
- layer
- pathological
- fusion
- speaker
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000001575 pathological effect Effects 0.000 title claims abstract description 69
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000004927 fusion Effects 0.000 title claims abstract description 41
- 238000013441 quality evaluation Methods 0.000 title claims abstract description 21
- 238000000605 extraction Methods 0.000 claims abstract description 17
- 239000013598 vector Substances 0.000 claims description 49
- 238000004364 calculation method Methods 0.000 claims description 18
- 230000008569 process Effects 0.000 claims description 16
- 238000011176 pooling Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 13
- 230000004913 activation Effects 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 9
- 230000003044 adaptive effect Effects 0.000 claims description 7
- 238000012935 Averaging Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 238000009432 framing Methods 0.000 claims description 3
- 238000013507 mapping Methods 0.000 abstract description 5
- 238000011158 quantitative evaluation Methods 0.000 abstract description 4
- 230000006870 function Effects 0.000 description 16
- 238000011156 evaluation Methods 0.000 description 13
- 208000011293 voice disease Diseases 0.000 description 5
- 230000008901 benefit Effects 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 230000008447 perception Effects 0.000 description 4
- 238000012545 processing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 206010013952 Dysphonia Diseases 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000007170 pathology Effects 0.000 description 2
- 238000001228 spectrum Methods 0.000 description 2
- 101100272279 Beauveria bassiana Beas gene Proteins 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000004931 aggregating effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 201000010099 disease Diseases 0.000 description 1
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000004451 qualitative analysis Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 230000033764 rhythmic process Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/66—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for extracting parameters related to health condition
-
- A—HUMAN NECESSITIES
- A61—MEDICAL OR VETERINARY SCIENCE; HYGIENE
- A61B—DIAGNOSIS; SURGERY; IDENTIFICATION
- A61B5/00—Measuring for diagnostic purposes; Identification of persons
- A61B5/48—Other medical applications
- A61B5/4803—Speech analysis specially adapted for diagnostic purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Life Sciences & Earth Sciences (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Public Health (AREA)
- Theoretical Computer Science (AREA)
- Molecular Biology (AREA)
- Surgery (AREA)
- Data Mining & Analysis (AREA)
- Medical Informatics (AREA)
- Pathology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Animal Behavior & Ethology (AREA)
- Veterinary Medicine (AREA)
- Heart & Thoracic Surgery (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Epidemiology (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention provides a pathological voice quality evaluation method based on voice characteristics and speaker characteristics fusion, which comprises the steps of inputting pathological voice, extracting two voice characteristics of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out characteristic fusion; taking the fused voice characteristics as input, extracting time information and predicting frame-level scores; taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker; and taking the voice characteristics obtained after the time information extraction and the speaker characteristics as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score. According to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.
Description
Technical Field
The invention belongs to the technical field of pathological voice quality evaluation, and particularly relates to a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics.
Background
Along with the acceleration of life rhythm in modern society, people have increasingly bad sounding habits, living habits and voice abuse conditions so that the incidence rate of voice diseases is higher. Voice diseases can obstruct communication of people, separate individuals from society, even cause diseases such as depression, and have great influence on physiological and psychological conditions of people. Therefore, effective diagnosis and treatment of voice diseases are becoming an increasing focus of attention. And the accurate and quantitative evaluation of the quality of pathological voice plays an important role in the diagnosis and treatment of voice diseases.
The pathological voice quality evaluation is a new direction in the field of voice disorder research, and the corresponding quality fraction is obtained on the basis of pathological voice signal analysis. The main methods of pathological voice quality evaluation are divided into subjective perception evaluation and objective acoustic analysis evaluation. At present, the subjective perception evaluation is mainly used for pathological voice diagnosis clinically, and in the process, a plurality of doctors can make corresponding MOS scores for voices according to a certain subjective measurement standard by listening to the voices made by patients so as to measure the pathological voice quality. However, subjective and occasional evaluation exists, because doctors experience levels and acoustic perceptions have certain differences, and the evaluation mode has poor repeatability and high price.
The quality evaluation by the objective acoustic analysis method can be further divided into a reference evaluation mode and a non-reference evaluation mode. The referenced objective evaluation mode needs to have original voice and distorted voice at the same time and requires strict alignment of the original voice and the distorted voice, and has a certain limitation in real life, while the non-referenced objective evaluation mode can obtain corresponding quality fraction only through the distorted voice. In recent years, with the development of deep learning, reference-free evaluation with high accuracy and end-to-end advantage is possible, but at present, the subjects of the scholars in this direction are normal voices.
At present, the research on pathological voice mainly focuses on the detection of healthy and pathological voice and the classification of pathological symptoms, but the research on the quality perception of pathological voice is very little. Although some studies explore the relationship between objective parameters and pathological voice quality, these indicators can only be used for qualitative analysis of pathological voice quality. Currently, fresh students quantitatively evaluate pathological voice quality by using a deep learning technology.
Therefore, the mapping relation between the pathological voice quality and the subjective MOS score, which is expressed by using the deep learning technology, is explored, and the construction of the pathological voice quality evaluation model based on the deep learning has important research significance and practical application value.
Disclosure of Invention
In view of the above, the invention provides a pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics, so as to solve the defects of poor repeatability and high price of subjective quality evaluation and make up for the study blank of objective evaluation in the aspect of quantitative evaluation of pathological voice quality.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics comprises the following steps:
step 1: inputting pathological voice, extracting two voice features of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out feature fusion;
step 2: taking the fused voice characteristics as input, extracting time information and predicting frame-level scores;
step 3: taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker;
step 4: and (3) taking the voice characteristics obtained after the time information extraction in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score.
Further, in the step 1,
extracting spectrogram voice characteristics comprises framing pathological voice, adding a hamming window and performing short-time Fourier transform to obtain an amplitude spectrogram of voice signals;
extracting the voice characteristics of the mel-frequency cepstrum coefficient includes voice to pathologyUsing p mel filters, m mel frequency cepstral coefficients are obtained per frame.
Further, in the step 1, feature fusion of two voices includes:
inputting the obtained spectrogram into a convolution module formed by stacking k convolution layers, and finally obtaining each frameFeature vector of dimension->The method comprises the steps of carrying out a first treatment on the surface of the The process is expressed by the following formula:
wherein ,u is the input spectrogram of pathological voice, < ->Represents a ReLU activation functionCount (n)/(l)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->An output of the layer convolution layer;
will beCorresponding to the m-dimensional Meier frequency cepstrum coefficient obtained from each frame, adding to obtain +.>Frame-level features of the preliminary fusion of dimensions->。
Further, the step 2 specifically includes:
step 201: will firstAs input, via the input embedding layer, get +.>It has three dimensions: [ n, u, v]The meaning of the expression is: the number n of one batch vector, the number u of each vector value and the embedding dimension v of each value;
wherein ,for the location information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value,
and then will beAnd->Corresponding dimensions are added to obtain a new vector with position-coding information +.>;
Step 203: will beSending the signals to an M-head self-attention mechanism module to obtain +.>Its dimension and->The same applies to the following formula:
wherein ,for the dimension of the resulting K vector, < >>Is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>Is the total output of the M-head attention mechanism;
step 204: will beAnd->Residual connection is carried out to obtain->And then willPerforming batch normalization to obtain +.>The calculation formula of the process is as follows:
step 205: will beObtaining ∈K through feedforward calculation>The calculation formula is as follows:
step 206: will be and />Residual connection is carried out to obtain->Then->Performing batch normalization to obtain +.>The calculation formula of the process is as follows:
Further, the step 3 includes:
input of m-dimensional mel frequency cepstral coefficient features per frameThe time delay neural network is arranged, the time delay neural network integrates the frame-level characteristics into a speech-level characteristic while considering the context information between adjacent frames, and finally, the speech-level characteristics are calculated and splicedMean and variance of the outputs of the layer delay neural network, then go through +.>A time delay neural network for obtaining a characteristic vector +.>。
Further, in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:
will beSequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>The calculation formula is as follows:
wherein ,representation->Exchange of dimensions, add>Representing an adaptive averaging pooling operation;
Will result inAs input, a convolution module formed by overlapping k convolution layers is fed to obtain high-dimensional speaking-level characteristics +.>The process is expressed by the following formula:
wherein ,,/>representing the ReLU activation function,)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->And (3) outputting a layer convolution layer.
Further, in the step 4, the predicting the speech-level quality score includes:
will beSequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>The calculation formula is as follows: />
wherein ,representing the exchange of dimensions->Representing an adaptive averaging pooling operation.
Compared with the prior art, the pathological voice quality evaluation method based on the fusion of the voice characteristics and the speaker characteristics has the following advantages: according to the invention, the voice characteristics and the speaker characteristics are extracted from the pathological voice, the characteristics are fused, and finally the score prediction is carried out, so that the mapping relation between the pathological voice and the subjective quality score corresponding to the pathological voice is found, and objective and quantitative evaluation on the quality of the pathological voice is realized.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a schematic diagram of the overall process of the present invention;
FIG. 2 is a schematic diagram of an X-Vector extraction according to the present invention;
FIG. 3 is a schematic diagram of an encoder network architecture according to the present invention;
FIG. 4 is a graph comparing the fitting result of the present invention with the MOSNet to subjective MOS score;
fig. 5 is a graph of the results of the fitting of the present invention to subjective quality scores.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", etc. may explicitly or implicitly include one or more such feature. In the description of the present invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art in a specific case.
The invention will be described in detail below with reference to the drawings in connection with embodiments.
As shown in figure 1, the invention provides a pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics, and compared with the previous method, the method has the advantages that the voice characteristics and the speaker characteristics are extracted from the pathological voice, and the characteristic fusion is carried out, so that the network can better extract effective information in the pathological voice. The invention inputs pathological voice and outputs the corresponding mass fraction, namely MOS fraction, has the advantages of end-to-end and accuracy, and has great practical significance.
The method of the invention generally comprises the following steps:
and step 1, extracting and fusing voice features. And inputting pathological voice, extracting various voice features, and carrying out feature fusion. The invention takes spectrograms and Mel Frequency Cepstrum Coefficients (MFCC) as voice features which need to be extracted from original pathological voice, and takes a multi-layer convolution structure as an example to perform feature fusion.
And 2, predicting the frame level score. And (3) taking the fused voice characteristics obtained in the step (1) as input, and extracting time information and predicting frame fraction. The invention takes the encoder as a time sequence processing model to extract the time sequence information of the fused voice characteristics, thereby realizing the prediction of pathological voice frame fraction.
And 3, extracting the speaker characteristics. And (3) taking the Mel Frequency Cepstrum Coefficient (MFCC) extracted in the step (1) as input to extract the speaker characteristics. The invention uses X-Vector as the speaker characteristic to increase the information representing the speaker identity.
And 4, predicting the speech level score. And (3) taking the voice characteristics obtained after the time information is extracted in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to finally obtain the speech-level quality score. The invention takes a multilayer convolution structure as an example to perform feature fusion; taking the adaptive pooling operation as an example, the speech-level quality score is predicted.
In one embodiment, the voice feature extraction and fusion in step 1 of the present invention specifically includes:
1. extraction of speech features
To the pathological voice of the inputThe following two methods are respectively carried out to obtain two voice characteristics: spectrogram, mel-frequency cepstral coefficient (MFCC).
1) Frequency spectrum diagram: for pathological voiceAnd carrying out the steps of framing, adding a hamming window, carrying out short-time Fourier transform and the like to obtain an amplitude spectrogram U of the voice signal.
2) Mel frequency cepstral coefficients: for pathological voiceUsing p mel filters, m MFCC coefficients are obtained per frame.
2. Fusion of speech features
Inputting the obtained spectrogram U into a convolution module formed by stacking k convolution layers, and finally obtaining each frameFeature vector of dimension->. The process can be expressed by the following formula:
wherein ,u is the input spectrogram of pathological voice, < ->Representing the ReLU activation function,)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->And (3) outputting a layer convolution layer.
Will beCorrespondingly adding the m-dimensional MFCCs obtained from each frame to obtain +.>Frame-level features of the preliminary fusion of dimensions->。
In one embodiment, in step 2 of the present invention, the prediction of the frame level score specifically includes:
will beAs input, by means of a time-series processing module, encoder, for +.>Extracting time sequence information to obtain +.>Dimension frame level feature->The encoder network architecture is shown in fig. 3. The detailed description of this process is as follows:
1) Will firstAs input, via the input embedding layer, get +.>It has three dimensions: [ n, u, v ]>]The meaning of the expression is: number of lot vectors n->The number of values u per vector and the embedding dimension v per value.
wherein ,for the position information of each value in the input vector, 2t refers to the dimension of the word vector encoded by each value.
And then will beAnd->Corresponding dimension addition to obtain new vector with position coding information。
3) Will beSending the signals to an M-head self-attention mechanism module to obtain +.>Its dimension andthe same applies. The formula is as follows:
wherein ,for the dimension of the resulting K vector, < >>Is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>Is the total output of the M-head attention mechanism.
4) Will beAnd->Residual connection is carried out to obtain->And then willPerforming batch normalization to obtain +.>The calculation formula of the process is as follows:
6) Will be and />Residual connection is carried out to obtain->Then->Performing batch normalization to obtain +.>The calculation formula of the process is as follows:
In one embodiment, in step 3 of the present invention, the extracting of the speaker characteristic includes:
to the pathological voice of the inputThe speaker ID X-Vector was obtained by the following method. As shown in fig. 2.
Inputting m-dimensional MFCC features of each frameLayer delay neural network (TDNN), which combines frame-level features into one speech-level feature while considering context information between adjacent frames, and finally calculates and splices ∈ ->Mean and variance of outputs of layer TDNN network, then go through +.>TDNN, obtain the eigenvector +.>。
In one embodiment, in step 4 of the present invention, the utterance fraction prediction includes:
1. fusion of speech features and speaker features
Will beSequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>The calculation formula is as follows:
wherein ,representation->Exchange of dimensions, add>Representing an adaptive averaging pooling operation.
Will result inAs input, sendA convolution module formed by overlapping k convolution layers is added to obtain high-dimensional speaking-level characteristics ∈>. The process can be expressed by the following formula:
wherein ,,/>representing the ReLU activation function,)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->And (3) outputting a layer convolution layer.
2. Prediction of speech-level scores
Will beSequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>The calculation formula is as follows:
wherein ,representing the exchange of dimensions->Representing an adaptive averaging pooling operation.
The implementation of the invention is illustrated by specific examples.
1. Speech feature extraction and fusion
1. Extraction of speech features
1) Extracting a spectrogram of pathological voice: the pathological voice data is read, the sampling rate is 16KHz, then frame division is carried out, the frame is shifted by 256, a Hamming window is added, and the window length is 512. And performing 512-point short-time Fourier transform on the windowed pathological voice data segment to obtain an amplitude spectrogram of each frame.
2) Extracting a mel frequency cepstrum coefficient of pathological voice: according to the MFCC extraction procedure in the specific implementation, 13 Mel-frequency cepstral coefficients are extracted using 40 Mel filters.
2. Fusion of speech features
1) The amplitude spectrum obtained by processing is input into a convolution module composed of 4 layers of convolution layers, the parameter number of the 1 st layer is 3 multiplied by 16, the size of the 1 st layer convolution kernel is 3 multiplied by 3, the number of output channels is 16, the parameter number of the 2 nd layer is 3 multiplied by 32, the size of the 2 nd layer convolution kernel is 3 multiplied by 3, and the number of output channels is 32. The number of parameters of layer 3 is 3×3×64, where the size of the layer 3 convolution kernel is 3×3 and the number of output channels is 64. The number of parameters of layer 4 is 3×3×128, where the size of the convolution kernel of layer 4 is 3×3 and the number of output channels is 128. The activation function of each of the 4 convolutional layers is a ReLU function.
2) The output of the convolution module is subjected to dimension transformation to obtain 512-dimensional vectors of each frame, and the 512-dimensional vectors are spliced with the 13-dimensional MFCC according to the corresponding frames to obtain 525-dimensional vectors.
2. Extraction of temporal information and prediction of frame-level scores
1. Extraction of time information
The 525-dimensional feature vector of each frame is input into an encoder, and time information is extracted to obtain 512-dimensional vectors of each frame. Wherein the number of heads m=8 of the multi-head attention mechanism in the encoder, and the number of encoders n=6.
2. Prediction of frame level scores
Output of encoderThe feature vector of 512 dimensions per frame is mapped to a one-dimensional vector, i.e., a frame-level MOS score, via the fully connected layer. />
3. Speaker feature extraction
The method comprises the steps of inputting 13-dimensional MFCC parameters extracted from each frame into a five-layer TDNN network for feature extraction, inputting 13-dimensional vectors into a first-layer TDNN network, outputting 512-dimensional vectors, respectively obtaining the average value and the variance of outputs of the fifth-layer TDNN, splicing the two to form a 1024-dimensional Vector, and mapping the 1024-dimensional Vector into a 512-dimensional X-Vector through a full-connection layer.
4. Fusion of speech and speaker features and speech score prediction
1. Fusion of speech features and speaker features
Will beSequentially performing dimension transformation, self-adaptive average pooling operation, and aggregating frame-level features as speech-level features to obtain 128-dimensional feature vectors of each pathological voice>。
Will be 128-dimensionalAnd 512-dimensional X-vectors are spliced together to obtain a preliminarily fused 640-dimensional speech-level feature Vector. And then the deep-level characteristic extraction is carried out by a convolution module. The convolution module is divided into 4 layers, the parameter number of the 1 st layer is 3 multiplied by 8, wherein the size of the 1 st layer convolution kernel is 3 multiplied by 03, and the number of output channels is 8; the parameter amount of the layer 2 is 3×13×16, wherein the size of the convolution kernel of the layer 2 is 3×3, and the number of output channels is 16; the number of parameters of the 3 rd layer is 3×3×32, wherein the size of the 3 rd layer convolution kernel is 3×3, and the number of output channels is 32; the number of parameters of the 4 th layer is 3×3×64, wherein the size of the 4 th layer convolution kernel is 3×3, and the number of output channels is 64. Wherein the activation function of each of the 4 convolutional layers is a ReLU function. And then, carrying out dimension transformation processing to obtain 512-dimensional feature vectors of each pathological voice.
2. Prediction of speech-level scores
And mapping the obtained 512-dimensional feature vectors into one-dimensional MOS scores through self-adaptive average pooling operation to obtain the speech grade scores.
5. Loss function
1. Frame-level fractional network loss function:
to ensure accuracy of network output pathology voice frame fraction, network predicted frame fraction is usedAnd true frame level score->The mean square loss between them optimizes the frame-level-difference prediction network.
2. Utterance level fraction network loss function:
to ensure accuracy of network output pathology voice utterance fraction, network predicted utterance fraction is usedAnd true speech level score->The mean square loss between them optimizes the speech level fraction prediction network.
3. Total loss function
The total loss function for pathological voice objective quality assessment can be expressed as:
figure 4 emphasizes the performance of the MOSNet model and the proposed method. It can be seen from fig. 4 that the prediction score and the true value score of the two are distributed around the vicinity of the image of the direct proportion function, that is, the prediction value and the true value have stronger correlation, but the data obtained by prediction of the invention are distributed around the image of the direct proportion function more intensively, and the fitting effect is better. This shows that the method based on the fusion of the voice characteristics and the speaker characteristics can respectively perform effective characteristic representation and characteristic extraction on pathological voice.
Fig. 5 shows the effect of the present invention on the fitting of MOS score truth labels. The invention predicts the pathological voice quality fraction, draws the predicted data as a curve, and draws a histogram by MOS fraction truth value data. As can be seen from fig. 5, the true value distribution interval of the voice data is more dispersed, but the invention can still make reliable quality score prediction for pathological voice, because the prediction score curve can be regarded as the fitting curve of the histogram made by the true value data, the invention proves that the invention is used as an objective and automatic evaluation method, can fit the subjective evaluation of people, and makes reliable score prediction for the intelligibility of pathological voice.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.
Claims (7)
1. A pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics is characterized in that: the method comprises the following steps:
step 1: inputting pathological voice, extracting two voice features of a spectrogram and a Mel frequency cepstrum coefficient, and carrying out feature fusion;
step 2: taking the fused voice characteristics as input, extracting time information and predicting frame-level scores;
step 3: taking the voice characteristics of the mel frequency cepstrum coefficient as input to extract the characteristics of a speaker;
step 4: and (3) taking the voice characteristics obtained after the time information extraction in the step (2) and the speaker characteristics obtained in the step (3) as inputs, and carrying out characteristic fusion to obtain the prediction of the speech-level quality score.
2. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 1, wherein: in the step (1) of the above-mentioned process,
extracting spectrogram voice characteristics comprises framing pathological voice, adding a hamming window and performing short-time Fourier transform to obtain an amplitude spectrogram of voice signals;
3. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 2, wherein: in the step 1, feature fusion of two voices includes:
inputting the obtained spectrogram into a convolution module formed by stacking k convolution layers, and finally obtaining each frameFeature vector of dimensionThe method comprises the steps of carrying out a first treatment on the surface of the The process is expressed by the following formula:
wherein ,/>U is the input spectrogram of pathological voice, < ->Representing the ReLU activation function,)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->An output of the layer convolution layer;
4. The pathological voice quality evaluation method based on the fusion of voice characteristics and speaker characteristics according to claim 3, wherein: the step 2 specifically includes:
step 201: will firstAs input, via the input embedding layer, get +.>It has three dimensions: [ n, u, v]The meaning of the expression is: the number n of one batch vector, the number u of each vector value and the embedding dimension v of each value;
step 202:then the position coding is carried out to obtain +.>The formula for position coding is: />
wherein ,/>For inputting the position information of each value in the vector, 2t refers to the dimension of the word vector encoded by each value, and then +.>And (3) withCorresponding dimensions are added to obtain a new vector with position-coding information +.>;
Step 203: will beSending the signals to an M-head self-attention mechanism module to obtain +.>Its dimension and->The same applies to the following formula:
wherein ,/>,/>Is the weight matrix in the t-th head in the M-head attention mechanism; /> wherein ,/>For the dimension of the resulting K vector,is the output of the attention mechanism module of the t-th head,/-, a first switch is provided>Is the total output of the M-head attention mechanism;
step 204: will beAnd->Residual connection is carried out to obtain->And then willPerforming batch normalization to obtain +.>The calculation formula of the process is as follows:
step 205: will beObtaining ∈K through feedforward calculation>The calculation formula is as follows:
step 206: will be and />Residual connection is carried out to obtain->Then->Performing batch normalization to obtainThe calculation formula of the process is as follows:
5. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 4, wherein: the step 3 includes:
input of m-dimensional mel frequency cepstral coefficient features per frameLayer delay neural network, delay neural network while considering the context information between adjacent frames, assemble the frame-level characteristic into a speech-level characteristic, calculate and splice +.>Mean and variance of the outputs of the layer delay neural network, then go through +.>A time delay neural network for obtaining the characteristic vector of k dimension。
6. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 5, wherein: in the step 4, feature fusion is performed on the voice feature obtained after the time information extraction in the step 2 and the speaker feature obtained in the step 3 as inputs, including:
will beSequentially performing dimension conversion, self-adaptive average pooling operation and dimension conversion to obtain +.>The calculation formula is as follows:
wherein ,/>Representation->Exchange of dimensions, add>Representing an adaptive averaging pooling operation;
Will get->As input, a convolution module formed by overlapping k convolution layers is fed to obtain high-dimensional speaking-level characteristics +.>The process is expressed by the following formula:
wherein ,,/>representing the ReLU activation function,)>Is a weight parameter of the first layer convolution layer, < ->Is the bias parameter of the first layer convolution layer, the output of which is expressed as +.>;/> and />Are respectively->Weight parameters and bias parameters of the layer convolution layer, < ->Is the output of the ith convolution layer, +.>Is->And (3) outputting a layer convolution layer.
7. The pathological voice quality evaluation method based on the fusion of voice features and speaker features according to claim 6, wherein: in the step 4, the predicting the speech-level quality score includes:
will beSequentially performing dimension conversion and self-adaptive average pooling operation to obtain final fraction +.>The calculation formula is as follows:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310395720.1A CN116110437B (en) | 2023-04-14 | 2023-04-14 | Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310395720.1A CN116110437B (en) | 2023-04-14 | 2023-04-14 | Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116110437A true CN116110437A (en) | 2023-05-12 |
CN116110437B CN116110437B (en) | 2023-06-13 |
Family
ID=86260107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310395720.1A Active CN116110437B (en) | 2023-04-14 | 2023-04-14 | Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116110437B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103093759A (en) * | 2013-01-16 | 2013-05-08 | 东北大学 | Device and method of voice detection and evaluation based on mobile terminal |
CN103730130A (en) * | 2013-12-20 | 2014-04-16 | 中国科学院深圳先进技术研究院 | Detection method and system for pathological voice |
CN107068167A (en) * | 2017-03-13 | 2017-08-18 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
CN109727608A (en) * | 2017-10-25 | 2019-05-07 | 香港中文大学深圳研究院 | A kind of ill voice appraisal procedure based on Chinese speech |
AU2020102516A4 (en) * | 2020-09-30 | 2020-11-19 | Du, Jiahui Mr | Health status monitoring system based on speech analysis |
CN112820279A (en) * | 2021-03-12 | 2021-05-18 | 深圳市臻络科技有限公司 | Parkinson disease detection method based on voice context dynamic characteristics |
US20210319804A1 (en) * | 2020-04-01 | 2021-10-14 | University Of Washington | Systems and methods using neural networks to identify producers of health sounds |
CN114724589A (en) * | 2022-04-14 | 2022-07-08 | 标贝(北京)科技有限公司 | Voice quality inspection method and device, electronic equipment and storage medium |
CN115294970A (en) * | 2022-10-09 | 2022-11-04 | 苏州大学 | Voice conversion method, device and storage medium for pathological voice |
-
2023
- 2023-04-14 CN CN202310395720.1A patent/CN116110437B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103093759A (en) * | 2013-01-16 | 2013-05-08 | 东北大学 | Device and method of voice detection and evaluation based on mobile terminal |
CN103730130A (en) * | 2013-12-20 | 2014-04-16 | 中国科学院深圳先进技术研究院 | Detection method and system for pathological voice |
CN107068167A (en) * | 2017-03-13 | 2017-08-18 | 广东顺德中山大学卡内基梅隆大学国际联合研究院 | Merge speaker's cold symptoms recognition methods of a variety of end-to-end neural network structures |
CN109727608A (en) * | 2017-10-25 | 2019-05-07 | 香港中文大学深圳研究院 | A kind of ill voice appraisal procedure based on Chinese speech |
US20210319804A1 (en) * | 2020-04-01 | 2021-10-14 | University Of Washington | Systems and methods using neural networks to identify producers of health sounds |
AU2020102516A4 (en) * | 2020-09-30 | 2020-11-19 | Du, Jiahui Mr | Health status monitoring system based on speech analysis |
CN112820279A (en) * | 2021-03-12 | 2021-05-18 | 深圳市臻络科技有限公司 | Parkinson disease detection method based on voice context dynamic characteristics |
CN114724589A (en) * | 2022-04-14 | 2022-07-08 | 标贝(北京)科技有限公司 | Voice quality inspection method and device, electronic equipment and storage medium |
CN115294970A (en) * | 2022-10-09 | 2022-11-04 | 苏州大学 | Voice conversion method, device and storage medium for pathological voice |
Non-Patent Citations (2)
Title |
---|
A. G´OMEZA ETC: "Acoustic to kinematic projection in Parkinson’s disease dysarthria", 《BIOMEDICAL SIGNAL PROCESSING AND CONTROL》, pages 1 - 13 * |
邹佳成: "基于深度学习的呼吸应肺病听诊研究与应用", 《中国优秀硕士学位论文全文数据库 医药卫生辑》, no. 01, pages 1 - 65 * |
Also Published As
Publication number | Publication date |
---|---|
CN116110437B (en) | 2023-06-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP4002362A1 (en) | Method and apparatus for training speech separation model, storage medium, and computer device | |
Moran et al. | Telephony-based voice pathology assessment using automated speech analysis | |
Jahangir et al. | Deep learning approaches for speech emotion recognition: state of the art and research challenges | |
Muhammad et al. | Convergence of artificial intelligence and internet of things in smart healthcare: a case study of voice pathology detection | |
CN112818892A (en) | Multi-modal depression detection method and system based on time convolution neural network | |
US20040002853A1 (en) | Method and device for speech analysis | |
Seneviratne et al. | Multi-Corpus Acoustic-to-Articulatory Speech Inversion. | |
WO2023139559A1 (en) | Multi-modal systems and methods for voice-based mental health assessment with emotion stimulation | |
CN110946554A (en) | Cough type identification method, device and system | |
CN112397074A (en) | Voiceprint recognition method based on MFCC (Mel frequency cepstrum coefficient) and vector element learning | |
Tripathi et al. | A novel approach for intelligibility assessment in dysarthric subjects | |
CN111489763A (en) | Adaptive method for speaker recognition in complex environment based on GMM model | |
US20240057936A1 (en) | Speech-analysis based automated physiological and pathological assessment | |
Avila et al. | Automatic speaker verification from affective speech using Gaussian mixture model based estimation of neutral speech characteristics | |
Dibazar et al. | A system for automatic detection of pathological speech | |
CN115910097A (en) | Audible signal identification method and system for latent fault of high-voltage circuit breaker | |
CN116110437B (en) | Pathological voice quality evaluation method based on fusion of voice characteristics and speaker characteristics | |
Lee et al. | Assessment of dysarthria using one-word speech recognition with hidden markov models | |
Ribeiro et al. | Towards the prediction of the vocal tract shape from the sequence of phonemes to be articulated | |
Debnath et al. | Study of speech enabled healthcare technology | |
CN116013371A (en) | Neurodegenerative disease monitoring method, system, device and storage medium | |
Xu et al. | Voiceprint recognition of Parkinson patients based on deep learning | |
Amami et al. | A robust voice pathology detection system based on the combined bilstm–cnn architecture | |
Suwannakhun et al. | Characterizing Depressive Related Speech with MFCC | |
Naikare et al. | Classification of voice disorders using i-vector analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |